{"id":19900020,"url":"https://github.com/scrapy-plugins/scrapy-incremental","last_synced_at":"2026-05-13T14:37:23.913Z","repository":{"id":229881056,"uuid":"628000646","full_name":"scrapy-plugins/scrapy-incremental","owner":"scrapy-plugins","description":null,"archived":false,"fork":false,"pushed_at":"2023-05-01T22:08:13.000Z","size":11,"stargazers_count":1,"open_issues_count":0,"forks_count":1,"subscribers_count":6,"default_branch":"master","last_synced_at":"2025-01-11T21:10:06.383Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-3-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/scrapy-plugins.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2023-04-14T17:15:01.000Z","updated_at":"2024-06-06T09:54:05.000Z","dependencies_parsed_at":"2024-03-26T19:23:34.893Z","dependency_job_id":"0cbea2a7-e0c2-4e09-96ca-0b6dc62a1f48","html_url":"https://github.com/scrapy-plugins/scrapy-incremental","commit_stats":null,"previous_names":["scrapy-plugins/scrapy-incremental"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scrapy-plugins%2Fscrapy-incremental","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scrapy-plugins%2Fscrapy-incremental/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scrapy-plugins%2Fscrapy-incremental/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scrapy-plugins%2Fscrapy-incremental/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/scrapy-plugins","download_url":"https://codeload.github.com/scrapy-plugins/scrapy-incremental/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":241329412,"owners_count":19944984,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-12T20:10:51.693Z","updated_at":"2026-05-13T14:37:23.865Z","avatar_url":"https://github.com/scrapy-plugins.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# scrapy-incremental\n\nscrapy-incremental is a package that uses Zyte's [Collections API](https://docs.zyte.com/scrapy-cloud/reference/http/collections.html) to keep a persistent state of previously scraped items between jobs, allowing the spiders to run in an incremental behavior, returning only new items.\n\n## Getting Started\n\n### Installation\n\nYou can install scrapy-incremental using pip:\n\n```bash\npip install scrapy-incremental\n```\n\n### Settings\n\n- `SCRAPYCLOUD_API_KEY` **must be set** in your `settings.py`, otherwise the plugin will be disabled on start.\n- `SCRAPYCLOUD_PROJECT_ID` It's your the project's ID assigned by Scrapy Cloud. If the code is running on Scrapy Cloud **the package will infer the project ID** by the environment variables. However if running in other enviorments it must be set on the `settings.py`, otherwise the plugin will be disabled on start.\n\n`scrapy-incremental` stores a reference of each scraped item in a Collections store named after each individual spider and compares that reference to know if the item in process was already scraped in previous jobs. \n\nThe **reference used by default** is the field `url` inside the item. If your Items don't contain a `url` field you can change the reference by setting the `INCREMENTAL_PIPELINE_ITEM_UNIQUE_FIELD` to the field name you want. The new field **must be a field that contains unique data for that item**, otherwise the pipeline won't behave as expected. \n\n## Usage\n### Pipeline\n\nEnabling the `ScrapyIncrementalItemsPipeline` in your project's settings `ITEM_PIPELINES` is the simplest and most flexible way to add the incremental features to your spiders.\n\n```python\nITEM_PIPELINES = {\n    'scrapy_incremental.pipelines.ScrapyIncrementalItemsPipeline': 100,\n    #...\n}\n```\n\nThe pipeline will compare the unique field of each item against the references stored in the collections and if they are present the item will be dropped. At the end of the crawling process the collection's store will be updated with the newly scraped items.\n\n**The pipeline alone won't prevent making requests to items scraped before,** in order to avoid unnecessary requests you will need to use the `ScrapyIncrementalItemsMixin`.\n\n### ScrapyIncrementalItemsMixin\n\nThe `ScrapyIncrementalItemsMixin` will enable both the `ScrapyIncrementalRequestFilterMiddleware` and the `ScrapyIncrementalItemsPipeline`. The purpose of `ScrapyIncrementalRequestFilterMiddleware` is to filter requests to URLs that had been scraped in previous jobs and were present in the Items. The use of the middleware is optional and only meant to avoid unnecessary requests.\n\nFor this to be effective the references kept in the collections **must be of the URL of each item's page**. Therefore your items must either have the `url` field that contains the URL to the item's page or if using a different field defined in `INCREMENTAL_PIPELINE_ITEM_UNIQUE_FIELD` it must meet this same criteria.\n\n```python\nfrom scrapy.spiders import Spider\nfrom scrapy_incremental import IncrementalItemsMixin\n\nclass MySpider(IncrementalItemsMixin, Spider):\n    name = 'myspider'\n    # ...\n```\n\n### Configuration\n\n#### Crawling previously seen Items / Temporarily disabling the incremental features.\n\nTo temporarily disable the incremental feature of your spiders you can just pass the argument `full_crawl=True` when executing them.\n\n```bash\nscrapy crawl myspider -a full_crawl=True\n```\n\n#### INCREMENTAL_PIPELINE_BATCH_SIZE\n\nWhen stopping the crawling process, the pipeline will update the Collection's store with the newly scraped items. This is automatically done in batches of 5000. \n\nIf you are facing issues in this process you may want to change the batch size, which can be done by setting an integer value to the setting `INCREMENTAL_PIPELINE_BATCH_SIZE`.\n\n\n## License\n\nThis project is licensed under the [LICENSE](LICENSE.txt) file for details.","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fscrapy-plugins%2Fscrapy-incremental","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fscrapy-plugins%2Fscrapy-incremental","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fscrapy-plugins%2Fscrapy-incremental/lists"}