https://github.com/TeamHG-Memex/scrapy-crawl-once

Scrapy middleware which allows to crawl only new content
https://github.com/TeamHG-Memex/scrapy-crawl-once

scrapy

Last synced: 9 months ago
JSON representation

Scrapy middleware which allows to crawl only new content

Host: GitHub
URL: https://github.com/TeamHG-Memex/scrapy-crawl-once
Owner: TeamHG-Memex
License: mit
Created: 2017-03-02T23:07:01.000Z (over 8 years ago)
Default Branch: master
Last Pushed: 2022-10-31T19:28:53.000Z (about 3 years ago)
Last Synced: 2024-09-20T06:13:17.129Z (about 1 year ago)
Topics: scrapy
Language: Python
Size: 14.6 KB
Stars: 79
Watchers: 8
Forks: 23
Open Issues: 6
Metadata Files:
- Readme: README.rst
- Changelog: CHANGES.rst
- License: LICENSE.txt

Awesome Lists containing this project

awesome - scrapy-crawl-once - Scrapy middleware which allows to crawl only new content (Scrapy Middleware)
awesome-scrapy - scrapy-crawl-once - crawling pages which were already downloaded in previous crawls. (Apps / Other Useful Extensions)

README

scrapy-crawl-once
=================

.. image:: https://img.shields.io/pypi/v/scrapy-crawl-once.svg
:target: https://pypi.python.org/pypi/scrapy-crawl-once
:alt: PyPI Version

.. image:: https://travis-ci.org/TeamHG-Memex/scrapy-crawl-once.svg?branch=master
:target: http://travis-ci.org/TeamHG-Memex/scrapy-crawl-once
:alt: Build Status

.. image:: http://codecov.io/github/TeamHG-Memex/scrapy-crawl-once/coverage.svg?branch=master
:target: http://codecov.io/github/TeamHG-Memex/scrapy-crawl-once?branch=master
:alt: Code Coverage

This package provides a Scrapy_ middleware which allows to avoid re-crawling
pages which were already downloaded in previous crawls.

.. _Scrapy: https://scrapy.org/

License is MIT.

Installation
------------

pip install scrapy-crawl-once

Usage
-----

To enable it, modify your settings.py::

SPIDER_MIDDLEWARES = {
# ...
'scrapy_crawl_once.CrawlOnceMiddleware': 100,
# ...
}

DOWNLOADER_MIDDLEWARES = {
# ...
'scrapy_crawl_once.CrawlOnceMiddleware': 50,
# ...
}

By default it does nothing. To avoid crawling a particular page
multiple times set ``request.meta['crawl_once'] = True``. When a response
is received and a callback is successful, the fingerprint of such request
is stored to a database. When spider schedules a new request middleware
first checks if its fingerprint is in the database, and drops the request
if it is there.

Other ``request.meta`` keys:

* ``crawl_once_value`` - a value to store in DB. By default, timestamp
is stored.
* ``crawl_once_key`` - request unique id; by default request_fingerprint
is used.

Settings
--------

* ``CRAWL_ONCE_ENABLED`` - set it to False to disable middleware.
Default is True.
* ``CRAWL_ONCE_PATH`` - a path to a folder with crawled requests database.
By default ``.scrapy/crawl_once/`` path inside a project dir is used;
this folder contains ``.sqlite`` files with databases of
seen requests.
* ``CRAWL_ONCE_DEFAULT`` - default value for ``crawl_once`` meta key
(False by default). When True, all requests are handled by
this middleware unless disabled explicitly using
``request.meta['crawl_once'] = False``.

Alternatives
------------

https://github.com/scrapy-plugins/scrapy-deltafetch is a similar package; it
does almost the same. Differences:

* scrapy-deltafetch chooses whether to discard a request or not based on
yielded items; scrapy-crawl-once uses an explicit
``request.meta['crawl_once']`` flag.
* scrapy-deltafetch uses bsddb3, scrapy-crawl-once uses sqlite.

Another alternative is a built-in `Scrapy HTTP cache`_. Differences:

* scrapy cache stores all pages on disc, scrapy-crawl-once only keeps request
fingerprints;
* scrapy cache allows a more fine grained invalidation consistent with how
browsers work;
* with scrapy cache all pages are still processed (though not all pages are
downloaded).

.. _Scrapy HTTP cache: https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#module-scrapy.downloadermiddlewares.httpcache

Contributing
------------

* source code: https://github.com/TeamHG-Memex/scrapy-crawl-once
* bug tracker: https://github.com/TeamHG-Memex/scrapy-crawl-once/issues

To run tests, install tox_ and run ``tox`` from the source checkout.

.. _tox: https://tox.readthedocs.io/en/latest/

----

.. image:: https://hyperiongray.s3.amazonaws.com/define-hg.svg
:target: https://www.hyperiongray.com/?pk_campaign=github&pk_kwd=scrapy-crawl-once
:alt: define hyperiongray

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/TeamHG-Memex/scrapy-crawl-once

Awesome Lists containing this project

README