爬取地址: https://tieba.baidu.com/p/4959928798 在 chrome 上查看源代码,有着一段
<a class="pb_nameplate j_nameplate j_self_no_nameplate" href="/tbmall/propslist?category=112&ps=24" data-field='{"props_id":"1120050972","end_time":"1512731564","title":"\u6d77\u8d3c\u738b\u7684\u53f3\u624b","optional_word":["\u7684","\u4e4b","\u306e"],"pattern":["1","1","1","2","3","3"]}' target="_blank">海贼王的右手</a>
依据: class="pb_nameplate j_nameplate j_self_no_nameplate
写了一个正则:(?<=pb_nameplate\sj_nameplate\sj_self_nameplate)[\s\S]*?(?=)
运行后发现死活匹配不了,所以
# -*- coding: utf-8 -*-
__author__ = 'duohappy'
import requests
def get_info_from(url):
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36"
}
web_data = requests.get(url, headers=headers)
web_data.encoding = 'utf-8'
content = web_data.text
with open('./test.txt', 'w') as f:
f.write(content)
if __name__ == '__main__':
url = 'http://tieba.baidu.com/p/4959928798'
get_info_from(url)
才发现
<a class="pb_nameplate j_nameplate j_self_nameplate" href="/tbmall/propslist?category=112&ps=24" data-field='{"props_id":"1120050972","end_time":"1512731564","title":"\u6d77\u8d3c\u738b\u7684\u53f3\u624b","optional_word":["\u7684","\u4e4b","\u306e"],"pattern":["1","1","1","2","3","3"]}' target="_blank">海贼王的右手</a>
class="pb_nameplate j_nameplate j_self_no_nameplate 变成了 pb_nameplate j_nameplate j_self_nameplate
这是什么技术,还是我的姿势有问题?
1
amustart 2017-03-09 12:31:51 +08:00 1
正则- > 网页解析
(源码的改变或许是因为你从 chrome 里看的和你真实爬到的不一致?) |
4
holyzhou 2017-03-09 12:39:08 +08:00 1
呵呵 我刚试了下 , 应该是你网页登录了, 脚本没有登录。
导出为 curl 命令行,可以对比一下带 cookie 内容跟不带 get 后的内容。 |
5
954880786 2017-03-09 12:39:09 +08:00 via iPhone 1
楼主试试把 headers 伪造的完整一点呢,也有可能是 js 动态执行的缘故
|
6
ljcarsenal 2017-03-09 12:40:34 +08:00 via Android 1
你是查看源代码 还是 f12 的检查元素
|
7
annielong 2017-03-09 12:45:58 +08:00 1
如果出问题一般都用笨方法,开始的时候先输出爬到的全文,根据爬到的全文做解析,而不是看网页
|
11
QQ2112755791 2017-03-09 13:47:30 +08:00
代码不错,会不会跟服务器有关呢?
|
12
ChangHaoWei 2017-03-09 14:26:18 +08:00
用 firefox 或者 chrome 的时候记得装个 js 开关。这样你就能看到没有 js 修改 DOM 的界面效果了。
|