Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/scrapy-plugins/scrapy-deltafetch
Scrapy spider middleware to ignore requests to pages containing items seen in previous crawls
https://github.com/scrapy-plugins/scrapy-deltafetch
hacktoberfest hacktoberfest2021
Last synced: 2 days ago
JSON representation
Scrapy spider middleware to ignore requests to pages containing items seen in previous crawls
- Host: GitHub
- URL: https://github.com/scrapy-plugins/scrapy-deltafetch
- Owner: scrapy-plugins
- Created: 2016-06-16T18:21:12.000Z (over 8 years ago)
- Default Branch: master
- Last Pushed: 2021-10-17T17:36:33.000Z (over 3 years ago)
- Last Synced: 2024-09-19T12:08:37.225Z (4 months ago)
- Topics: hacktoberfest, hacktoberfest2021
- Language: Python
- Homepage:
- Size: 52.7 KB
- Stars: 264
- Watchers: 15
- Forks: 48
- Open Issues: 18
-
Metadata Files:
- Readme: README.rst
- Changelog: CHANGES.rst
Awesome Lists containing this project
- awesome - scrapy-deltafetch - Scrapy spider middleware to ignore requests to pages containing items seen in previous crawls (Scrapy Middleware)
- awesome-scrapy - scrapy-deltafetch
README
=================
scrapy-deltafetch
=================
.. image:: https://github.com/scrapy-plugins/scrapy-deltafetch/workflows/CI/badge.svg
:target: https://github.com/scrapy-plugins/scrapy-deltafetch/actions.. image:: https://img.shields.io/pypi/pyversions/scrapy-deltafetch.svg
:target: https://pypi.python.org/pypi/scrapy-deltafetch.. image:: https://img.shields.io/pypi/v/scrapy-deltafetch.svg
:target: https://pypi.python.org/pypi/scrapy-deltafetch.. image:: https://img.shields.io/pypi/l/scrapy-deltafetch.svg
:target: https://pypi.python.org/pypi/scrapy-deltafetch.. image:: https://img.shields.io/pypi/dm/scrapy-deltafetch.svg
:target: https://pypistats.org/packages/scrapy-deltafetch
:alt: Downloads countThis is a Scrapy spider middleware to ignore requests
to pages seen in previous crawls of the same spider,
thus producing a "delta crawl" containing only new requests.This also speeds up the crawl, by reducing the number of requests that need
to be crawled, and processed (typically, item requests are the most CPU
intensive).DeltaFetch middleware uses Python's dbm_ package to store requests fingerprints.
.. _dbm: https://docs.python.org/3/library/dbm.html
Installation
============Install scrapy-deltafetch using ``pip``::
$ pip install scrapy-deltafetch
Configuration
=============1. Add DeltaFetch middleware by including it in ``SPIDER_MIDDLEWARES``
in your ``settings.py`` file::SPIDER_MIDDLEWARES = {
'scrapy_deltafetch.DeltaFetch': 100,
}Here, priority ``100`` is just an example.
Set its value depending on other middlewares you may have enabled already.2. Enable the middleware using ``DELTAFETCH_ENABLED`` in your ``settings.py``::
DELTAFETCH_ENABLED = True
Usage
=====Following are the different options to control DeltaFetch middleware
behavior.Supported Scrapy settings
-------------------------* ``DELTAFETCH_ENABLED`` — to enable (or disable) this extension
* ``DELTAFETCH_DIR`` — directory where to store state
* ``DELTAFETCH_RESET`` — reset the state, clearing out all seen requestsThese usually go in your Scrapy project's ``settings.py``.
Supported Scrapy spider arguments
---------------------------------* ``deltafetch_reset`` — same effect as DELTAFETCH_RESET setting
Example::
$ scrapy crawl example -a deltafetch_reset=1
Supported Scrapy request meta keys
----------------------------------* ``deltafetch_key`` — used to define the lookup key for that request. by
default it's Scrapy's default Request fingerprint function,
but it can be changed to contain an item id, for example.
This requires support from the spider, but makes the extension
more efficient for sites that many URLs for the same item.* ``deltafetch_enabled`` - if set to False it will disable deltafetch for some
specific request