Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/rmax/scrapy-redis
Redis-based components for Scrapy.
https://github.com/rmax/scrapy-redis
crawler distributed redis scrapy
Last synced: 6 days ago
JSON representation
Redis-based components for Scrapy.
- Host: GitHub
- URL: https://github.com/rmax/scrapy-redis
- Owner: rmax
- License: mit
- Created: 2011-08-29T04:06:23.000Z (over 13 years ago)
- Default Branch: master
- Last Pushed: 2024-07-06T21:54:35.000Z (6 months ago)
- Last Synced: 2024-10-29T14:55:49.113Z (3 months ago)
- Topics: crawler, distributed, redis, scrapy
- Language: Python
- Homepage: http://scrapy-redis.readthedocs.io
- Size: 228 KB
- Stars: 5,523
- Watchers: 273
- Forks: 1,591
- Open Issues: 31
-
Metadata Files:
- Readme: README.rst
- Changelog: HISTORY.rst
- Contributing: CONTRIBUTING.rst
- License: LICENSE
- Authors: AUTHORS.rst
Awesome Lists containing this project
- awesome-scrapy - scrapy-redis - based components for Scrapy. (Apps / Distributed Spider)
README
============
Scrapy-Redis
============.. image:: https://readthedocs.org/projects/scrapy-redis/badge/?version=latest
:alt: Documentation Status
:target: https://readthedocs.org/projects/scrapy-redis/?badge=latest.. image:: https://img.shields.io/pypi/v/scrapy-redis.svg
:target: https://pypi.python.org/pypi/scrapy-redis.. image:: https://img.shields.io/pypi/pyversions/scrapy-redis.svg
:target: https://pypi.python.org/pypi/scrapy-redis.. image:: https://github.com/rmax/scrapy-redis/actions/workflows/builds.yml/badge.svg
:target: https://github.com/rmax/scrapy-redis/actions/workflows/builds.yml
.. image:: https://github.com/rmax/scrapy-redis/actions/workflows/checks.yml/badge.svg
:target: https://github.com/rmax/scrapy-redis/actions/workflows/checks.yml
.. image:: https://github.com/rmax/scrapy-redis/actions/workflows/tests.yml/badge.svg
:target: https://github.com/rmax/scrapy-redis/actions/workflows/tests.yml
.. image:: https://codecov.io/github/rmax/scrapy-redis/coverage.svg?branch=master
:alt: Coverage Status
:target: https://codecov.io/github/rmax/scrapy-redis.. image:: https://img.shields.io/badge/security-bandit-green.svg
:alt: Security Status
:target: https://github.com/rmax/scrapy-redis
Redis-based components for Scrapy.* Usage: https://github.com/rmax/scrapy-redis/wiki/Usage
* Documentation: https://github.com/rmax/scrapy-redis/wiki.
* Release: https://github.com/rmax/scrapy-redis/wiki/History
* Contribution: https://github.com/rmax/scrapy-redis/wiki/Getting-Started
* LICENSE: MIT licenseFeatures
--------* Distributed crawling/scraping
You can start multiple spider instances that share a single redis queue.
Best suitable for broad multi-domain crawls.* Distributed post-processing
Scraped items gets pushed into a redis queued meaning that you can start as
many as needed post-processing processes sharing the items queue.* Scrapy plug-and-play components
Scheduler + Duplication Filter, Item Pipeline, Base Spiders.
* In this forked version: added ``json`` supported data in Redis
data contains ``url``, ```meta``` and other optional parameters. ``meta`` is a nested json which contains sub-data.
this function extract this data and send another FormRequest with ``url``, ``meta`` and addition ``formdata``.For example:
.. code-block:: json
{ "url": "https://exaple.com", "meta": {"job-id":"123xsd", "start-date":"dd/mm/yy"}, "url_cookie_key":"fertxsas" }
this data can be accessed in `scrapy spider` through response.
like: `request.url`, `request.meta`, `request.cookies`
.. note:: This features cover the basic case of distributing the workload across multiple workers. If you need more features like URL expiration, advanced URL prioritization, etc., we suggest you to take a look at the Frontera_ project.Requirements
------------* Python 3.7+
* Redis >= 5.0
* ``Scrapy`` >= 2.0
* ``redis-py`` >= 4.0Installation
------------From pip
.. code-block:: bash
pip install scrapy-redis
From GitHub
.. code-block:: bash
git clone https://github.com/darkrho/scrapy-redis.git
cd scrapy-redis
python setup.py install.. note:: For using this json supported data feature, please make sure you have not installed the scrapy-redis through pip. If you already did it, you first uninstall that one.
.. code-block:: bashpip uninstall scrapy-redis
Alternative Choice
---------------------------Frontera_ is a web crawling framework consisting of `crawl frontier`_, and distribution/scaling primitives, allowing to build a large scale online web crawler.
.. _Frontera: https://github.com/scrapinghub/frontera
.. _crawl frontier: http://nlp.stanford.edu/IR-book/html/htmledition/the-url-frontier-1.html