用 scrapy 采集百度贴吧，怎么按内容或标题关键字采集贴子？那个 xpath 写了总是采不到．．?

item['title'] = response.xpath('//h1[@style="width: 470px"]/text()').extract()[0].strip() ####贴子标题 item['url'] = response.meta['text_url'] ####贴子地址 item['content'] = response.xpath('//*[starts-with(@id, "post_content_")]/text()').extract()[0].strip() ####贴子的内容 item['time'] = response.xpath('//div[@class="l_post j_l_post l_post_bright noborder "]').re("\d+-\d+-\d+ \d+:\d+") ####发贴时间 item['click'] = random.randint(0, 20) ###点击次数，给了一个随机值

用下面的两个方法．先查一下内容再决定要不要

＃＃＃＃＃＃＃＃＃＃＃＃＃＃＃＃方法　二

okok = response.xpath('//[starts-with(@id, "post_content_")]/text()').extract()[0].strip() if '交友' or '征婚' or '美女' in okok: 　　 item['content'] = response.xpath('//[starts-with(@id, "post_content_")]/text()').extract()[0].strip() 　　 item['title'] = response.xpath('//h1[@style="width: 470px"]/text()').extract()[0].strip() 　　 item['url'] = response.meta['text_url'] 　　 item['time'] = response.xpath('//div[@class="l_post j_l_post l_post_bright noborder "]').re("\d+-\d+-\d+ \d+:\d+") 　　 item['click'] = random.randint(0, 20) 　　 print item 　　 yield item

这两个总是不行．也用过 contains(str1, str2) 可能是用的不行．总也不成功．

不知道有什么办法．可以通过一组关键词采集百度贴子．

谢谢．

2 条回复

byfar

2017-04-27 13:10:41 +08:00

XPath 小技巧：如果是 chrome 浏览器的话，开发都工具的 Elements 下 ctrl+f 有 find by string, selector, or XPath 的功能。

取不到要么是你取的元素是 ajax 异步请求的，可以模拟请求。

要么你的 xpath 表达式有问题，可以用上述方法检验。

bb2018

2017-04-27 15:02:54 +08:00

@byfar 不是取不到．是能取到全部贴子信息．但是现在是无法挑选内容有：交友　　征婚　美女等特定关键词的方法．
不知道是在采集的时候过虑还是在入库的时候过虑．应该怎么过虑出来要的信息？