https://github.com/scrapy-plugins/scrapy-hcf
Scrapy spider middleware to use Scrapinghub's Hub Crawl Frontier as a backend for URLs
https://github.com/scrapy-plugins/scrapy-hcf
Last synced: 12 months ago
JSON representation
Scrapy spider middleware to use Scrapinghub's Hub Crawl Frontier as a backend for URLs
- Host: GitHub
- URL: https://github.com/scrapy-plugins/scrapy-hcf
- Owner: scrapy-plugins
- License: bsd-3-clause
- Created: 2016-07-18T13:24:39.000Z (almost 10 years ago)
- Default Branch: master
- Last Pushed: 2018-08-28T22:00:34.000Z (almost 8 years ago)
- Last Synced: 2025-05-18T01:18:57.843Z (about 1 year ago)
- Language: Python
- Size: 10.7 KB
- Stars: 4
- Watchers: 5
- Forks: 6
- Open Issues: 2
-
Metadata Files:
- Readme: README.rst
- Changelog: CHANGES.rst
- License: LICENSE
Awesome Lists containing this project
README
==========
scrapy-hcf
==========
.. image:: https://travis-ci.org/scrapy-plugins/scrapy-hcf.svg?branch=master
:target: https://travis-ci.org/scrapy-plugins/scrapy-hcf
.. image:: https://codecov.io/gh/scrapy-plugins/scrapy-hcf/branch/master/graph/badge.svg
:target: https://codecov.io/gh/scrapy-plugins/scrapy-hcf
This Scrapy spider middleware uses the HCF backend from Scrapinghub's
Scrapy Cloud service to retrieve the new urls to crawl
and store back the links extracted.
Installation
============
Install scrapy-hcf using ``pip``::
$ pip install scrapy-hcf
Configuration
=============
To activate this middleware it needs to be added to the ``SPIDER_MIDDLEWARES``
dict, i.e::
SPIDER_MIDDLEWARES = {
'scrapy_hcf.HcfMiddleware': 543,
}
And the following settings need to be defined:
``HS_AUTH``
Scrapy Cloud API key
``HS_PROJECTID``
Scrapy Cloud project ID (not needed if the spider is ran on dash)
``HS_FRONTIER``
Frontier name.
``HS_CONSUME_FROM_SLOT``
Slot from where the spider will read new URLs.
Note that ``HS_FRONTIER`` and ``HS_CONSUME_FROM_SLOT`` can be overriden
from inside a spider using the spider attributes ``hs_frontier``
and ``hs_consume_from_slot`` respectively.
The following optional Scrapy settings can be defined:
``HS_ENDPOINT``
URL to the API endpoint, i.e: http://localhost:8003.
The default value is provided by the python-hubstorage package.
``HS_MAX_LINKS``
Number of links to be read from the HCF, the default is 1000.
``HS_START_JOB_ENABLED``
Enable whether to start a new job when the spider finishes.
The default is ``False``
``HS_START_JOB_ON_REASON``
This is a list of closing reasons,
if the spider ends with any of these reasons a new job will be started
for the same slot. The default is ``['finished']``
``HS_NUMBER_OF_SLOTS``
This is the number of slots that the middleware will use to store the new links.
The default is 8.
Usage
=====
The following keys can be defined in a Scrapy Request meta in order to control the behavior
of the HCF middleware:
``'use_hcf'``
If set to ``True`` the request will be stored in the HCF.
``'hcf_params'``
Dictionary of parameters to be stored in the HCF with the request fingerprint
``'qdata'``
data to be stored along with the fingerprint in the request queue
``'fdata'``
data to be stored along with the fingerprint in the fingerprint set
``'p'``
Priority - lower priority numbers are returned first. The default is 0
The value of ``'qdata'`` parameter could be retrieved later using
``response.meta['hcf_params']['qdata']``.
The spider can override the default slot assignation function by setting the
spider ``slot_callback`` method to a function with the following signature::
def slot_callback(request):
...
return slot