https://github.com/scrapy-plugins/scrapy-hcf

Scrapy spider middleware to use Scrapinghub's Hub Crawl Frontier as a backend for URLs
https://github.com/scrapy-plugins/scrapy-hcf

Last synced: 12 months ago
JSON representation

Scrapy spider middleware to use Scrapinghub's Hub Crawl Frontier as a backend for URLs

Host: GitHub
URL: https://github.com/scrapy-plugins/scrapy-hcf
Owner: scrapy-plugins
License: bsd-3-clause
Created: 2016-07-18T13:24:39.000Z (almost 10 years ago)
Default Branch: master
Last Pushed: 2018-08-28T22:00:34.000Z (almost 8 years ago)
Last Synced: 2025-05-18T01:18:57.843Z (about 1 year ago)
Language: Python
Size: 10.7 KB
Stars: 4
Watchers: 5
Forks: 6
Open Issues: 2
Metadata Files:
- Readme: README.rst
- Changelog: CHANGES.rst
- License: LICENSE

Awesome Lists containing this project

README

          ==========

scrapy-hcf

==========

.. image:: https://travis-ci.org/scrapy-plugins/scrapy-hcf.svg?branch=master

    :target: https://travis-ci.org/scrapy-plugins/scrapy-hcf

.. image:: https://codecov.io/gh/scrapy-plugins/scrapy-hcf/branch/master/graph/badge.svg

  :target: https://codecov.io/gh/scrapy-plugins/scrapy-hcf

This Scrapy spider middleware uses the HCF backend from Scrapinghub's

Scrapy Cloud service to retrieve the new urls to crawl

and store back the links extracted.

Installation

============

Install scrapy-hcf using ``pip``::

    $ pip install scrapy-hcf

Configuration

=============

To activate this middleware it needs to be added to the ``SPIDER_MIDDLEWARES``

dict, i.e::

    SPIDER_MIDDLEWARES = {

        'scrapy_hcf.HcfMiddleware': 543,

    }

And the following settings need to be defined:

``HS_AUTH``

    Scrapy Cloud API key

``HS_PROJECTID``

    Scrapy Cloud project ID (not needed if the spider is ran on dash)

``HS_FRONTIER``

    Frontier name.

``HS_CONSUME_FROM_SLOT``

    Slot from where the spider will read new URLs.

Note that ``HS_FRONTIER`` and ``HS_CONSUME_FROM_SLOT`` can be overriden

from inside a spider using the spider attributes ``hs_frontier``

and ``hs_consume_from_slot`` respectively.

The following optional Scrapy settings can be defined:

``HS_ENDPOINT``

    URL to the API endpoint, i.e: http://localhost:8003.

    The default value is provided by the python-hubstorage package.

``HS_MAX_LINKS``

    Number of links to be read from the HCF, the default is 1000.

``HS_START_JOB_ENABLED``

    Enable whether to start a new job when the spider finishes.

    The default is ``False``

``HS_START_JOB_ON_REASON``

    This is a list of closing reasons,

    if the spider ends with any of these reasons a new job will be started

    for the same slot. The default is ``['finished']``

``HS_NUMBER_OF_SLOTS``

    This is the number of slots that the middleware will use to store the new links.

    The default is 8.

Usage

=====

The following keys can be defined in a Scrapy Request meta in order to control the behavior

of the HCF middleware:

``'use_hcf'``

    If set to ``True`` the request will be stored in the HCF.

``'hcf_params'``

    Dictionary of parameters to be stored in the HCF with the request fingerprint

    ``'qdata'``

        data to be stored along with the fingerprint in the request queue

    ``'fdata'``

        data to be stored along with the fingerprint in the fingerprint set

    ``'p'``

        Priority - lower priority numbers are returned first. The default is 0

The value of ``'qdata'`` parameter could be retrieved later using

``response.meta['hcf_params']['qdata']``.

The spider can override the default slot assignation function by setting the

spider ``slot_callback`` method to a function with the following signature::

       def slot_callback(request):

           ...

           return slot

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/scrapy-plugins/scrapy-hcf

Awesome Lists containing this project

README