1
cam 2012-12-09 18:07:14 +08:00 1
Ruby快些,如果只是parse HTML就用Nokogiri,
如果需要处理Session就用mechanize 信我的,没错! https://github.com/sparklemotion/mechanize https://github.com/sparklemotion/nokogiri |
2
wingoo 2012-12-09 18:26:50 +08:00
python scrapy
|
3
fwee 2012-12-09 18:38:27 +08:00
差不了多少吧,再推荐个ruby的eventmachine异步库
|
5
liuxurong 2012-12-09 19:51:43 +08:00
requests
|
6
muxi 2012-12-10 00:15:58 +08:00
用scrapy 同学不知道有没有碰到动手加东西的时候或者不满足需要改造的时候?
那段日子简直就是我的噩梦 这玩意真复杂,如果不是做通用抓取,建议还是别用了 定向抓取或许requests 之类更合适 |
7
zuroc 2012-12-10 00:23:51 +08:00
|
8
zuroc 2012-12-10 00:25:06 +08:00
code for example
#coding:utf-8 from spider.spider import route, Handler, spider import _env from os.path import abspath, dirname, join from operator import itemgetter PREFIX = join(dirname(abspath(__file__))) HTTP = 'http://www.ecocn.org/%s' @route('/portal\.php') class portal(Handler): def get(self): for link in self.extract_all('<dt class="xs2"><a href="', '"'): spider.put(HTTP%link) @route('/article-\d+-\d+.html') class article(Handler): def get(self): link = self.extract( 'class="pn" href="', '" target=""> 中英对照') spider.put(HTTP%link) @route('/forum\.php') class forum(Handler): from mako.lookup import Template template = Template(filename=join(PREFIX, 'template/rss.xml')) page = [] def get(self): name = self.extract('id="thread_subject">', '</a>') if not name: return name = name.split(']', 1)[-1].strip() html = self.extract('<div class="t_fsz">', '<div id="comment_') html = html[:html.rfind('</div>')] tid = int(self.get_argument('tid')) print tid, name self.page.append((tid, self.request.url, name, html)) @classmethod def write(cls): page = cls.page page.sort(key=itemgetter(0), reverse=True) with open(join(PREFIX, 'ecocn_org.xml'), 'w') as rss: rss.write( cls.template.render( rss_title='经济学人 . 中文网', rss_link='http://www.ecocn.org', li=[ dict( link=link, title=title, txt=txt ) for id, link, title, txt in cls.page ] ) ) if __name__ == '__main__': spider.put('http://www.ecocn.org/portal.php?mod=list&catid=1') #10个并发抓取线程 , 网页读取超时时间为30秒 spider.run(10, 30) forum.write() |
9
kenlen 2012-12-10 00:29:20 +08:00
刚看了个python 和scrapy 一起抓美图 的例子. http://bbs.chinaunix.net/thread-4057457-1-1.html
|
10
oa414 2012-12-10 01:04:26 +08:00
|
11
HowardMei 2012-12-10 10:48:22 +08:00
在爬虫上面,python甩其它语言几条街:
https://scraperwiki.com/docs/python/python_libraries/ ruby的没有python全面,很多库都是从python port过去的 这家爬虫平台就是python写的 https://bitbucket.org/ScraperWiki/scraperwiki/src 毕竟,蟒本来就是爬虫之王么 :) |
12
HowardMei 2012-12-10 11:48:36 +08:00
还有一个做语义分析的框架也蛮有意思: http://www.clips.ua.ac.be/pages/pattern
|
13
messense 2012-12-10 12:15:18 +08:00
python pattern+pyquery库
|