Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/cloudflare/queues-web-crawler
A web crawler built with Cloudflare Queues, Browser Rendering, and Workers KV.
https://github.com/cloudflare/queues-web-crawler
Last synced: about 1 month ago
JSON representation
A web crawler built with Cloudflare Queues, Browser Rendering, and Workers KV.
- Host: GitHub
- URL: https://github.com/cloudflare/queues-web-crawler
- Owner: cloudflare
- License: other
- Created: 2023-06-14T15:18:27.000Z (over 1 year ago)
- Default Branch: master
- Last Pushed: 2024-10-01T14:32:29.000Z (about 2 months ago)
- Last Synced: 2024-10-03T21:41:18.592Z (about 1 month ago)
- Language: TypeScript
- Size: 133 KB
- Stars: 96
- Watchers: 5
- Forks: 5
- Open Issues: 3
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Queues Web Crawler Example
An example use-case for [Queues](https://developers.cloudflare.com/queues/): a web crawler built on [Browser Rendering](https://developers.cloudflare.com/browser-rendering/) and Puppeteer. The crawler finds the number of links to Cloudflare.com on the site, and archives a screenshot to Workers KV.
For this project, Queues helps batch sites to be crawled, which limits the overhead of opening and closing new Puppeteer instances. Because loading pages and scraping links takes some time, Queues makes it possible to respond to inbound crawl requests instantly while providing peace of mind that the long-running crawl will be triggered. Queues also helps handle bursty traffic and reliability issues!
![Products used: Pages Functions, Queues, and Browser Rendering](./img/Products-Diagram.png)
## Development
This assumes you have access to the Browser Rendering feature - you can join the waitlist [here](https://www.cloudflare.com/lp/workers-browser-rendering-api).
First, fork this project. Install [Node.js](https://nodejs.org/en/download) and [Wrangler](https://developers.cloudflare.com/workers/wrangler/install-and-update/), and run `npm install`.
Then, to configure your project and deploy on Cloudflare Workers:
1. Go to the [Dash](https://dash.cloudflare.com) and click on Workers & Pages > Queues > Create queue. Enter a Queue name.
2. In the `pages` directory, `wrangler pages deploy .`, and enter a project name (`PROJECT_NAME`).
3. Go to the [Dash](https://dash.cloudflare.com) and click on Workers & Pages > Overview > `PROJECT_NAME` > Settings > Functions > Queue Producers bindings > Add binding.
4. Set the variable name to `CRAWLER_QUEUE` and select your queue as the Queue name. Click "Save".
5. In the Dash, click on Workers & Pages > KV > Create a namespace. Create one namespace called `crawler_screenshots` and one called `crawler_links`.
6. Create two KV namespace bindings. Set `CRAWLER_LINKS_KV` as first's variable name and `crawler_links` as the KV namespace. Then, set `CRAWLER_SCREENSHOTS_KV` as the second's variable name and `crawler_screenshots` as the KV namespace.
7. In the `consumer` directory, update the `wrangler.toml` file with your new KV namespace IDs. Also update the `[[queues.consumers]]` name to the Queue you created.
8. In the `consumer` directory, `wrangler deploy`.Your Queues-powered web crawler will be live!