用 BeautifulSoup 分析百度贴吧的页面，为什么只能提取前 60 多行的数据呢？

f = urllib.urlopen(url).read()
soup = BeautifulSoup(f, 'html.parser')

如上面的代码，f打印出来看了是完整的页面，有几百行，但是把soup打印出来只有60多行。爬取其他网页的数据整成，就是爬百度贴吧的帖子会出现这种情况，是什么原因呢？

soup

打印

页面

12 条回复 • 2015-07-28 15:54:02 +08:00

WhiteLament

2015 年 7 月 27 日

'html.parser' 换成 'lxml' 试试？

lingo233

2015 年 7 月 27 日

我记得贴吧未登录只能看一页的内容。

iyaozhen

2015 年 7 月 27 日

2 楼应该是真相。

liaipeng

2015 年 7 月 27 日

@WhiteLament
提示这个，对BeautifulSoup模块还不熟悉，第一次接触
Couldn't find a tree builder with the features you requested: lxml.parser. Do you need to install a parser library?

liaipeng

2015 年 7 月 27 日

@lingo233 不是的，现在是soup连主楼的内容都没有抓取完整

yappa

2015 年 7 月 27 日

html.parser改成lxml，或者html5lib,这两个模块都要先安装

liaipeng

2015 年 7 月 27 日

@yappa
好的，我试试

liaipeng

2015 年 7 月 27 日

@yappa 可以了！太感谢了。想知道为什么会有这种情况呢？是因为其他网页跟贴吧帖子的什么不同？

WhiteLament

2015 年 7 月 27 日

你没安装
pip install lxml

yappa

2015 年 7 月 27 日

估计你是从文档里面复制出来的代码，“html.parser”是“html解析器”的意思，你要找到适合的解析器，lxml,html5lib就是所谓的“html.parser"。

WhiteLament

2015 年 7 月 27 日

有些页面不够规范，不同解析器兼容不一样，造成结果不同。
我也遇到过，换一个解析器就好了

liaipeng

2015 年 7 月 28 日

@WhiteLament
@yappa
感谢两位！