{"id":19899990,"url":"https://github.com/scrapy-plugins/scrapy-hcf","last_synced_at":"2025-06-23T17:33:12.419Z","repository":{"id":57464700,"uuid":"63605005","full_name":"scrapy-plugins/scrapy-hcf","owner":"scrapy-plugins","description":"Scrapy spider middleware to use Scrapinghub's Hub Crawl Frontier as a backend for URLs","archived":false,"fork":false,"pushed_at":"2018-08-28T22:00:34.000Z","size":11,"stargazers_count":4,"open_issues_count":2,"forks_count":6,"subscribers_count":5,"default_branch":"master","last_synced_at":"2025-05-18T01:18:57.843Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-3-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/scrapy-plugins.png","metadata":{"files":{"readme":"README.rst","changelog":"CHANGES.rst","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2016-07-18T13:24:39.000Z","updated_at":"2019-11-08T10:21:39.000Z","dependencies_parsed_at":"2022-08-31T03:10:53.255Z","dependency_job_id":null,"html_url":"https://github.com/scrapy-plugins/scrapy-hcf","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/scrapy-plugins/scrapy-hcf","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scrapy-plugins%2Fscrapy-hcf","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scrapy-plugins%2Fscrapy-hcf/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scrapy-plugins%2Fscrapy-hcf/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scrapy-plugins%2Fscrapy-hcf/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/scrapy-plugins","download_url":"https://codeload.github.com/scrapy-plugins/scrapy-hcf/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scrapy-plugins%2Fscrapy-hcf/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":260575657,"owners_count":23030557,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-12T20:10:44.965Z","updated_at":"2025-06-23T17:33:12.371Z","avatar_url":"https://github.com/scrapy-plugins.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"==========\nscrapy-hcf\n==========\n\n.. image:: https://travis-ci.org/scrapy-plugins/scrapy-hcf.svg?branch=master\n    :target: https://travis-ci.org/scrapy-plugins/scrapy-hcf\n\n.. image:: https://codecov.io/gh/scrapy-plugins/scrapy-hcf/branch/master/graph/badge.svg\n  :target: https://codecov.io/gh/scrapy-plugins/scrapy-hcf\n\n\nThis Scrapy spider middleware uses the HCF backend from Scrapinghub's\nScrapy Cloud service to retrieve the new urls to crawl\nand store back the links extracted.\n\n\nInstallation\n============\n\nInstall scrapy-hcf using ``pip``::\n\n    $ pip install scrapy-hcf\n\n\nConfiguration\n=============\n\nTo activate this middleware it needs to be added to the ``SPIDER_MIDDLEWARES``\ndict, i.e::\n\n    SPIDER_MIDDLEWARES = {\n        'scrapy_hcf.HcfMiddleware': 543,\n    }\n\nAnd the following settings need to be defined:\n\n``HS_AUTH``\n    Scrapy Cloud API key\n\n``HS_PROJECTID``\n    Scrapy Cloud project ID (not needed if the spider is ran on dash)\n\n``HS_FRONTIER``\n    Frontier name.\n\n``HS_CONSUME_FROM_SLOT``\n    Slot from where the spider will read new URLs.\n\nNote that ``HS_FRONTIER`` and ``HS_CONSUME_FROM_SLOT`` can be overriden\nfrom inside a spider using the spider attributes ``hs_frontier``\nand ``hs_consume_from_slot`` respectively.\n\nThe following optional Scrapy settings can be defined:\n\n``HS_ENDPOINT``\n    URL to the API endpoint, i.e: http://localhost:8003.\n    The default value is provided by the python-hubstorage package.\n\n``HS_MAX_LINKS``\n    Number of links to be read from the HCF, the default is 1000.\n\n``HS_START_JOB_ENABLED``\n    Enable whether to start a new job when the spider finishes.\n    The default is ``False``\n\n``HS_START_JOB_ON_REASON``\n    This is a list of closing reasons,\n    if the spider ends with any of these reasons a new job will be started\n    for the same slot. The default is ``['finished']``\n\n``HS_NUMBER_OF_SLOTS``\n    This is the number of slots that the middleware will use to store the new links.\n    The default is 8.\n\n\nUsage\n=====\n\nThe following keys can be defined in a Scrapy Request meta in order to control the behavior\nof the HCF middleware:\n\n``'use_hcf'``\n    If set to ``True`` the request will be stored in the HCF.\n\n``'hcf_params'``\n    Dictionary of parameters to be stored in the HCF with the request fingerprint\n\n    ``'qdata'``\n        data to be stored along with the fingerprint in the request queue\n\n    ``'fdata'``\n        data to be stored along with the fingerprint in the fingerprint set\n\n    ``'p'``\n        Priority - lower priority numbers are returned first. The default is 0\n\nThe value of ``'qdata'`` parameter could be retrieved later using\n``response.meta['hcf_params']['qdata']``.\n\nThe spider can override the default slot assignation function by setting the\nspider ``slot_callback`` method to a function with the following signature::\n\n       def slot_callback(request):\n           ...\n           return slot\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fscrapy-plugins%2Fscrapy-hcf","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fscrapy-plugins%2Fscrapy-hcf","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fscrapy-plugins%2Fscrapy-hcf/lists"}