好几个月前写的了,写的比较挫。
并没有写成爬取一个博客的所有内容,本来是用来网站的,如果要爬所有内容,会让用户等待太久。
# -*- coding=utf-8 -*-
from threading import Thread
import Queue
import requests
import re
import os
import sys
import time
api_url='http://%s.tumblr.com/api/read?&num=50&start='
UQueue=Queue.Queue()
def getpost(uid,queue):
url='http://%s.tumblr.com/api/read?&num=50'%uid
page=requests.get(url).content
total=re.findall('<posts start="0" total="(.*?)">',page)[0]
total=int(total)
a=[i*50 for i in range(1000) if i*50-total<0]
ul=api_url%uid
for i in a:
queue.put(ul+str(i))
extractpicre = re.compile(r'(?<=<photo-url max-width="1280">).+?(?=</photo-url>)',flags=re.S) #search for url of maxium size of a picture, which starts with '<photo-url max-width="1280">' and ends with '</photo-url>'
extractvideore=re.compile('/tumblr_(.*?)" type="video/mp4"')
video_links = []
pic_links = []
vhead = 'https://vt.tumblr.com/tumblr_%s.mp4'
class Consumer(Thread):
def __init__(self, l_queue):
super(Consumer,self).__init__()
self.queue = l_queue
def run(self):
session = requests.Session()
while 1:
link = self.queue.get()
print 'start parse post: ' + link
try:
content = session.get(link).content
videos = extractvideore.findall(content)
video_links.extend([vhead % v for v in videos])
pic_links.extend(extractpicre.findall(content))
except:
print 'url: %s parse failed\n' % link
if self.queue.empty():
break
def main():
task=[]
for i in range(min(10,UQueue.qsize())):
t=Consumer(UQueue)
task.append(t)
for t in task:
t.start()
for t in task:
t.join
while 1:
for t in task:
if t.is_alive():
continue
else:
task.remove(t)
if len(task)==0:
break
def write():
videos=[i.replace('/480','') for i in video_links]
pictures=pic_links
with open('pictures.txt','w') as f:
for i in pictures:
f.write('%s\n'%i)
with open('videos.txt','w') as f:
for i in videos:
f.write('%s\n'%i)
if __name__=='__main__':
#name=sys.argv[1]
#name=name.strip()
name='mzcyx2011'
getpost(name,UQueue)
main()
write()
1
miketeam 2016-10-28 23:59:59 +08:00 via iPhone
Mark
|
2
tumbzzc OP 忘了去重了!在 write 函数里面
videos=list(set(videos)) pictures=list(set(pictures)) |
3
sammiriam 2016-10-29 01:08:48 +08:00
mark ,明天起来再看
|
4
wjm2038 2016-10-29 01:15:12 +08:00 via Android
mark
|
5
weipang 2016-10-29 06:22:08 +08:00 via iPhone
然而不会用
|
6
cszhiyue 2016-10-29 07:58:37 +08:00
|
8
aksoft 2016-10-29 09:17:48 +08:00
刚需啊,出售营养快线!
|
13
programdog 2016-10-29 09:53:59 +08:00
感谢楼主
|
15
freaks 2016-10-29 10:06:58 +08:00 via Android
这样的在线解析不要太多(⊙o⊙)哦!😂
|
16
0915240 2016-10-29 10:13:11 +08:00
olddrivertaketakeme
|
17
Nicksxs 2016-10-29 10:23:13 +08:00
不是被墙了么, vps 上下吗
|
19
exoticknight 2016-10-29 10:51:17 +08:00
这东西是好,但是我觉得爬出提供资源的 tumblr 名字更重要
|
21
tumbzzc OP @exoticknight 名字没办法
|
22
guokeke 2016-10-29 11:54:29 +08:00 via Android
Mark
|
23
cevincheung 2016-10-29 11:58:33 +08:00
然后就可以 wget 了?
|
24
exalex 2016-10-29 12:09:26 +08:00
能不能简述下爬虫效果。。。
|
25
guonning 2016-10-29 16:51:33 +08:00 via Android
收藏了
|
26
LeoEatle 2016-10-29 20:34:31 +08:00
name 改成什么好,能否给个名单: )
|
27
yangonee 2016-10-29 21:12:00 +08:00
求 name_list
|
28
lycos 2016-10-29 23:48:36 +08:00 via iPad
mark
|
29
leetom 2016-10-30 00:07:26 +08:00
@cszhiyue
下载到一半会这样 Traceback (most recent call last): File "turmla.py", line 150, in <module> for square in tqdm(pool.imap_unordered(download_base_dir, urls), total=len(urls)): File "/home/leetom/.pyenv/versions/2.7.10/lib/python2.7/site-packages/tqdm/_tqdm.py", line 713, in __iter__ for obj in iterable: File "/home/leetom/.pyenv/versions/2.7.10/lib/python2.7/multiprocessing/pool.py", line 668, in next raise value Exception: Unexpected response. |
30
thinks 2016-10-30 10:22:00 +08:00 via Android
Mark ,哎,老司机一言不合就发车啊。
|
31
sangmong 2016-10-30 21:59:24 +08:00
mark
|
32
errorlife 2016-10-31 01:58:11 +08:00
没人知道 www.tumblrget.com 吗
|
35
Nutlee 2016-10-31 09:52:35 +08:00
战略 Mark
|
36
iewgnaw 2016-10-31 11:12:14 +08:00
不是有现成的 API 吗
|
38
znoodl 2016-10-31 12:19:27 +08:00 via iPhone
我也用 golang 爬过。。。后来被墙就没搞了
|
39
Layne 2016-10-31 13:01:29 +08:00
默默点个赞 :)
|
40
itqls 2016-10-31 14:57:08 +08:00
一天到晚搞事情
|
41
weaming 2016-10-31 17:37:37 +08:00
搞事搞事
|
43
GreatMartial 2016-11-01 14:55:39 +08:00 via Android
@tumbzzc 楼主,我要访问你的网站,我要做的你粉丝😄
|
44
tumbzzc OP @GreatMartial 少儿不宜哈哈哈
|
45
firefox12 2016-11-01 15:26:01 +08:00
|
46
Doggy 2016-11-05 10:54:13 +08:00
with open('pictures.txt','r') as fobj:
for eachline in fobj: pngurl=eachline.strip() filename='.//getpic//test-{0}.jpg'.format(i) print '[-]parsing:{0}'.format(filename) urllib.urlretrieve(pngurl,filename) i+=1 |
47
dickeny 2016-11-06 22:05:57 +08:00
for i in range(0, total, 50): queue.put(ul+str(i))
|
48
hard2reg 2016-11-27 23:58:38 +08:00
看完表示自己 python 白学了。。。
人家的爬虫都是多线程,队列,类 我的爬虫都是。。。 while if for .... |
50
dr3am 2017-10-31 17:43:43 +08:00
求 LZ 网站
|
52
giveupAK47 2018-09-22 18:22:54 +08:00
请问老哥您的博客是什么?想深入学习一下爬虫。
|