https://github.com/xc2/cacheable-crawlee

Add http cache support to crawlee's HttpCrawler based crawlers 🗂️✨
https://github.com/xc2/cacheable-crawlee

Last synced: 12 days ago
JSON representation

Add http cache support to crawlee's HttpCrawler based crawlers 🗂️✨

Host: GitHub
URL: https://github.com/xc2/cacheable-crawlee
Owner: xc2
License: mit
Created: 2024-09-27T16:39:10.000Z (7 months ago)
Default Branch: main
Last Pushed: 2024-10-08T15:13:47.000Z (7 months ago)
Last Synced: 2025-04-14T12:04:09.912Z (12 days ago)
Language: TypeScript
Homepage: https://npm.im/cacheable-crawlee
Size: 82 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        # cacheable-crawlee

[![npm version](https://badge.fury.io/js/cacheable-crawlee.svg)](https://npmjs.com/package/cacheable-crawlee)

## Why

Using crawlee's default crawling flow, which only saves whether a task is complete and the processed data, comes with some pain points:

1. If you need more details from a response, you have to re-crawl the same task even if the content hasn’t changed.

2. When you add new tasks based on the responses of other tasks, you also have to re-crawl those for the same reason.

HTTP requests are the most expensive part of web crawling, while other processes are usually inexpensive.

Instead of completely skipping a task after it's done, we can cache the response. When re-running the crawler, we will process all tasks but only send requests for those that aren't cached.

For more discussion, see [Why cacheable-crawlee?](https://tldr.ws/why-cacheable-crawlee)

## What

`cacheable-crawlee` is a Node.js package that provides caching capabilities for the [crawlee](https://crawlee.dev/)'s `HttpCrawler` based crawlers. It allows you to cache HTTP responses to improve the efficiency and speed of your web crawling tasks.

The cache policy follows [RFC 7234](https://tools.ietf.org/html/rfc7234) and [RFC 5861](https://tools.ietf.org/html/rfc5861) standards, and is implemented by [http-cache-semantics](https://www.npmjs.com/package/http-cache-semantics).

## Usage

### Installation

Simply install the package using npm/yarn/pnpm:

```sh

npm install cacheable-crawlee

```

### Basic Usage

To use `cacheable-crawlee`, you need to "install" it into your `HttpCrawler` instance. Here's an example:

```typescript

import { HttpCrawler } from 'crawlee';

import { CacheableCrawlee } from 'cacheable-crawlee';

// Your crawler. Can be HttpCrawler, CheerioCrawler or any other HttpCrawler based crawler

const crawler = new HttpCrawler({

  // ...

});

// Add this line to enable caching

CacheableCrawlee({/* CacheableOptions */}).install(crawler);

await crawler.run([

  // ...

]);

```

Responses will be cached into a `KeyValueStore` named `http-cahce` by default.

## Configuration

Pass options shaped as `CacheableOptions` to the `CacheableCrawlee` constructor to configure the default caching behavior.

### Cache Policy Related Options

- `cacheControl`: The cache control header to use. Default is `max-stale=3600`. This option greatly influences the caching strategy.

- `publicOnly`: By default, `cacheable-crawlee` also caches responses with `Cache-Control: private`. Set this option to `true` to only cache public responses.

- `cacheHeuristic`/`immutableMinTimeToLive`/`ignoreCargoCult`:  These options are passed to the `http-cache-semantics` library. Refer to the [documentation](https://www.npmjs.com/package/http-cache-semantics) for more information.

 

### Storage Related Options

- `storeName`: The name of crawlee's `KeyValueStore` to use. Default is `http-cache`.

- `cache`: If you want to use a custom cache store instead of the `KeyValueStore`, you can provide a store adapter instance of [keyv](https://www.npmjs.com/package/keyv) here.

## Advanced Usage

### Cache as much as possible

To cache as much as possible, you can use the following options:

```typescript

CacheableCrawlee({ cacheControl: 'max-stale' }).install(crawler);

```

### Request specific cache control

You can also override the default cache options on a per-request basis:

```typescript

import { makeCacheable } from 'cacheable-crawlee';

import { Request } from 'crawlee';

const task1 = new Request({ url: 'https://example.com/foo' });

const task2 = new Request({ url: 'https://example.com/bar' });

makeCacheable(task1, { cacheControl: 'no-store' }); // disable caching for task1

makeCacheable(task2, { storeName: 'hello' }); // use a different store for task2

await crawler.run([task1, task2]); 

```

### Use redis as cache store

You can use `redis` as the cache store by providing a `@keyv/redis` instance:

```typescript

import KeyvRedis from '@keyv/redis';

import { CacheableCrawlee } from 'cacheable-crawlee';

const redis = new KeyvRedis('127.0.0.1:6379');

CacheableCrawlee({ cache: redis }).install(crawler);

```

## License

[MIT ©️ xc2](https://tldr.ws/mitxc2)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/xc2/cacheable-crawlee

Awesome Lists containing this project

README