{"id":17296389,"url":"https://github.com/xc2/cacheable-crawlee","last_synced_at":"2025-08-31T15:16:23.340Z","repository":{"id":257811152,"uuid":"864181509","full_name":"xc2/cacheable-crawlee","owner":"xc2","description":"Add http cache support to crawlee's HttpCrawler based crawlers 🗂️✨","archived":false,"fork":false,"pushed_at":"2024-10-08T15:13:47.000Z","size":84,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-04-14T12:04:09.912Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"https://npm.im/cacheable-crawlee","language":"TypeScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/xc2.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-09-27T16:39:10.000Z","updated_at":"2025-01-06T14:22:07.000Z","dependencies_parsed_at":"2024-10-05T06:54:57.020Z","dependency_job_id":"d1627b24-8a63-43f1-978e-7dd87af885e4","html_url":"https://github.com/xc2/cacheable-crawlee","commit_stats":null,"previous_names":["xc2/cacheable-crawlee"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/xc2%2Fcacheable-crawlee","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/xc2%2Fcacheable-crawlee/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/xc2%2Fcacheable-crawlee/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/xc2%2Fcacheable-crawlee/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/xc2","download_url":"https://codeload.github.com/xc2/cacheable-crawlee/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248877986,"owners_count":21176243,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-15T11:12:41.778Z","updated_at":"2025-04-14T12:04:16.146Z","avatar_url":"https://github.com/xc2.png","language":"TypeScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"# cacheable-crawlee\n\n[![npm version](https://badge.fury.io/js/cacheable-crawlee.svg)](https://npmjs.com/package/cacheable-crawlee)\n\n## Why\n\nUsing crawlee's default crawling flow, which only saves whether a task is complete and the processed data, comes with some pain points:\n\n1. If you need more details from a response, you have to re-crawl the same task even if the content hasn’t changed.\n2. When you add new tasks based on the responses of other tasks, you also have to re-crawl those for the same reason.\n\nHTTP requests are the most expensive part of web crawling, while other processes are usually inexpensive.\n\nInstead of completely skipping a task after it's done, we can cache the response. When re-running the crawler, we will process all tasks but only send requests for those that aren't cached.\n\nFor more discussion, see [Why cacheable-crawlee?](https://tldr.ws/why-cacheable-crawlee)\n\n## What\n\n`cacheable-crawlee` is a Node.js package that provides caching capabilities for the [crawlee](https://crawlee.dev/)'s `HttpCrawler` based crawlers. It allows you to cache HTTP responses to improve the efficiency and speed of your web crawling tasks.\n\nThe cache policy follows [RFC 7234](https://tools.ietf.org/html/rfc7234) and [RFC 5861](https://tools.ietf.org/html/rfc5861) standards, and is implemented by [http-cache-semantics](https://www.npmjs.com/package/http-cache-semantics).\n\n## Usage\n\n### Installation\n\nSimply install the package using npm/yarn/pnpm:\n\n```sh\nnpm install cacheable-crawlee\n```\n\n### Basic Usage\n\nTo use `cacheable-crawlee`, you need to \"install\" it into your `HttpCrawler` instance. Here's an example:\n\n```typescript\nimport { HttpCrawler } from 'crawlee';\nimport { CacheableCrawlee } from 'cacheable-crawlee';\n\n// Your crawler. Can be HttpCrawler, CheerioCrawler or any other HttpCrawler based crawler\nconst crawler = new HttpCrawler({\n  // ...\n});\n\n// Add this line to enable caching\nCacheableCrawlee({/* CacheableOptions */}).install(crawler);\n\nawait crawler.run([\n  // ...\n]);\n```\n\nResponses will be cached into a `KeyValueStore` named `http-cahce` by default.\n\n## Configuration\n\nPass options shaped as `CacheableOptions` to the `CacheableCrawlee` constructor to configure the default caching behavior.\n\n### Cache Policy Related Options\n\n- `cacheControl`: The cache control header to use. Default is `max-stale=3600`. This option greatly influences the caching strategy.\n- `publicOnly`: By default, `cacheable-crawlee` also caches responses with `Cache-Control: private`. Set this option to `true` to only cache public responses.\n- `cacheHeuristic`/`immutableMinTimeToLive`/`ignoreCargoCult`:  These options are passed to the `http-cache-semantics` library. Refer to the [documentation](https://www.npmjs.com/package/http-cache-semantics) for more information.\n \n### Storage Related Options\n\n- `storeName`: The name of crawlee's `KeyValueStore` to use. Default is `http-cache`.\n- `cache`: If you want to use a custom cache store instead of the `KeyValueStore`, you can provide a store adapter instance of [keyv](https://www.npmjs.com/package/keyv) here.\n\n## Advanced Usage\n\n\n### Cache as much as possible\n\nTo cache as much as possible, you can use the following options:\n\n```typescript\nCacheableCrawlee({ cacheControl: 'max-stale' }).install(crawler);\n```\n\n### Request specific cache control\n\nYou can also override the default cache options on a per-request basis:\n\n```typescript\nimport { makeCacheable } from 'cacheable-crawlee';\nimport { Request } from 'crawlee';\n\nconst task1 = new Request({ url: 'https://example.com/foo' });\nconst task2 = new Request({ url: 'https://example.com/bar' });\nmakeCacheable(task1, { cacheControl: 'no-store' }); // disable caching for task1\nmakeCacheable(task2, { storeName: 'hello' }); // use a different store for task2\n\nawait crawler.run([task1, task2]); \n```\n\n### Use redis as cache store\n\nYou can use `redis` as the cache store by providing a `@keyv/redis` instance:\n\n```typescript\nimport KeyvRedis from '@keyv/redis';\nimport { CacheableCrawlee } from 'cacheable-crawlee';\n\nconst redis = new KeyvRedis('127.0.0.1:6379');\n\nCacheableCrawlee({ cache: redis }).install(crawler);\n```\n\n## License\n\n[MIT ©️ xc2](https://tldr.ws/mitxc2)","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fxc2%2Fcacheable-crawlee","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fxc2%2Fcacheable-crawlee","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fxc2%2Fcacheable-crawlee/lists"}