V2EX = way to explore
V2EX 是一个关于分享和探索的地方
现在注册
已注册用户请  登录
推荐学习书目
Learn Python the Hard Way
Python Sites
PyPI - Python Package Index
http://diveintopython.org/toc/index.html
Pocoo
值得关注的项目
PyPy
Celery
Jinja2
Read the Docs
gevent
pyenv
virtualenv
Stackless Python
Beautiful Soup
结巴中文分词
Green Unicorn
Sentry
Shovel
Pyflakes
pytest
Python 编程
pep8 Checker
Styles
PEP 8
Google Python Style Guide
Code Style from The Hitchhiker's Guide
Ewig
V2EX  ›  Python

mongodb 在 scrapy 如何去重,然后下载管道如何管理

  •  
  •   Ewig · 2018-10-09 15:04:57 +08:00 · 2773 次点击
    这是一个创建于 2293 天前的主题,其中的信息可能已经有所发展或是发生改变。
    from scrapy.pipelines.files import FilesPipeline

    from scrapy import Request

    from scrapy.conf import settings

    import pymongo


    class XiaoMiQuanPipeLines(object):
    def __init__(self):
    host = settings["MONGODB_HOST"]
    port = settings["MONGODB_PORT"]
    dbname = settings["MONGODB_DBNAME"]
    sheetname = settings["MONGODB_SHEETNAME"]

    client = pymongo.MongoClient(host=host, port=port)

    mydb = client[dbname]

    self.post = mydb[sheetname]

    def process_item(self, item):
    url = item['file_url']
    name = item['name']

    result = self.post.aggregate(
    [
    {"$group": {"_id": {"url": url, "name": name}}}
    ]
    )
    if result:
    pass
    else:

    self.post.insert({"url": url, "name": name})
    return item


    class DownLoadPipelines(FilesPipeline):

    def file_path(self, request, response=None, info=None):
    return request.meta.get('filename', '')

    def get_media_requests(self, item, info):
    file_url = item['file_url']
    meta = {'filename': item['name']}
    yield Request(url=file_url, meta=meta)


    这里写两个管道,先判断,如何重复不下载,如果不重复,写入数据库,然后下载,这里用 aggregate 联合键去重
    9 条回复    2019-01-24 18:33:58 +08:00
    watsy0007
        1
    watsy0007  
       2018-10-09 16:22:08 +08:00
    ```python

    class MongoCache:
    db = None

    def __init__(self):
    if not hasattr(MongoCache, 'pool'):
    MongoCache.create_instance()

    @staticmethod
    def create_instance():
    client = MongoClient(config.MONGO_URL)
    MongoCache.db = client['spider']

    def create(self, table, unique_key, origin_data):
    if self.exists(table, unique_key):
    return None

    summaries = {k: generator_summary(v) for (k, v) in origin_data.items()}

    return self.db[table].insert({
    'unique_key': unique_key,
    'data': origin_data,
    'summaries': summaries
    })

    def get(self, table, unique_key):
    data = self.db[table].find_one({'unique_key': unique_key})
    if data is None:
    return None
    return data['data']

    def exists(self, table, unique_key):
    data = self.db[table].find_one({'unique_key': unique_key})
    return data is not None

    def is_changed(self, table, unique_key, origin_data):
    if not self.exists(table, unique_key):
    return True

    last_summaries = self.db[table].find_one({'unique_key': unique_key})['summaries']
    for (k, v) in origin_data.items():
    summary = generator_summary(v)
    last_summary = last_summaries.get(k, None)
    # print('{} -> {} | {} -> {}'.format(k, v, summary, last_summary))
    if last_summary is None or last_summary != summary:
    return True
    return False

    def change_fields(self, table, unique_key, origin_data):
    if not self.exists(table, unique_key):
    return origin_data
    changes = {}
    last_summaries = self.db[table].find_one({'unique_key': unique_key})['summaries']
    for (k, v) in origin_data.items():
    last_summary = last_summaries.get(k, None)
    # print('{} -> {} | {} -> {}'.format(k, v, summary, last_summary))
    if last_summary is None or last_summary != generator_summary(v):
    changes[k] = v
    return changes

    def update(self, table, unique_key, origin_data):
    if not self.exists(table, unique_key):
    return origin_data
    new_summaries = {k: generator_summary(v) for (k, v) in origin_data.items()}
    self.db[table].update_one({'unique_key': unique_key},
    {'$set': {'data': origin_data, 'summaries': new_summaries}})
    return origin_data
    ```
    watsy0007
        2
    watsy0007  
       2018-10-09 16:24:07 +08:00
    v2ex 不支持 markdown...

    https://gist.github.com/watsy0007/779c27fb0ceab283cc434b5eec10b7c4

    封装了针对数据处理的公共方法.
    picone
        3
    picone  
       2018-10-09 20:47:42 +08:00
    我是直接 mongo 加 unique 索引,并捕捉索引冲突异常。。
    Ewig
        4
    Ewig  
    OP
       2018-10-12 12:29:00 +08:00
    @picone 你的是联合键吗?我说的是 url 和 name 一起
    picone
        5
    picone  
       2018-10-12 13:33:01 +08:00
    Ewig
        6
    Ewig  
    OP
       2018-10-12 17:31:35 +08:00
    @picone db.XiaoMiQuan.find()
    { "_id" : ObjectId("5bbf14dbc96b5b3f5627d11d"), "file_url" : "https://baogaocos.seedsufe.com/2018/07/19/doc_1532004923556.pdf", "name" : "AMCHAM-中国的“一带一路”:对美国企业的影响(英文)-2018.6-8 页.pdf" }我现在是这样写的
    这是对的?
    pyfrog
        7
    pyfrog  
       2019-01-24 17:37:46 +08:00
    @Ewig 用不用把他全站 pdf 发你
    Ewig
        8
    Ewig  
    OP
       2019-01-24 18:04:30 +08:00
    @pyfrog 人家网站是更新的吧
    pyfrog
        9
    pyfrog  
       2019-01-24 18:33:58 +08:00
    @Ewig 是啊,直接给你服务器
    关于   ·   帮助文档   ·   博客   ·   API   ·   FAQ   ·   实用小工具   ·   947 人在线   最高记录 6679   ·     Select Language
    创意工作者们的社区
    World is powered by solitude
    VERSION: 3.9.8.5 · 25ms · UTC 22:55 · PVG 06:55 · LAX 14:55 · JFK 17:55
    Developed with CodeLauncher
    ♥ Do have faith in what you're doing.