https://github.com/webrecorder/browsertrix-crawler

Run a high-fidelity browser-based web archiving crawler in a single Docker container
https://github.com/webrecorder/browsertrix-crawler

crawler crawling wacz warc web-archiving web-crawler webrecorder

Last synced: 6 months ago
JSON representation

Run a high-fidelity browser-based web archiving crawler in a single Docker container

Host: GitHub
URL: https://github.com/webrecorder/browsertrix-crawler
Owner: webrecorder
License: agpl-3.0
Created: 2020-11-02T04:37:14.000Z (over 5 years ago)
Default Branch: main
Last Pushed: 2026-02-05T20:45:29.000Z (6 months ago)
Last Synced: 2026-02-06T05:17:40.012Z (6 months ago)
Topics: crawler, crawling, wacz, warc, web-archiving, web-crawler, webrecorder
Language: TypeScript
Homepage: https://crawler.docs.browsertrix.com
Size: 54 MB
Stars: 967
Watchers: 22
Forks: 127
Open Issues: 127
Metadata Files:
- Readme: README.md
- Changelog: CHANGES.md
- License: LICENSE
- Notice: NOTICE

Awesome Lists containing this project

awesome-datahoarding - Browsertrix Crawler - based high-fidelity crawling system, designed to run a complex, customizable browser-based crawl in a single Docker container (Download utilities / Web Archiving)
webarchiving-awesome-graph - Browsertrix Crawler - A Chromium based high-fidelity crawling system, designed to run a complex, customizable browser-based crawl in a single Docker container. 💽 ⭐ 1053 👀 23 (Tools & Software / Acquisition)
awesome-web-archiving - Browsertrix Crawler - A Chromium based high-fidelity crawling system, designed to run a complex, customizable browser-based crawl in a single Docker container. *(Stable)* (Tools & Software / Acquisition)
awesome-digital-preservation - Browsertrix Crawler - High-fidelity browser-based crawler in a Docker container. (Web Archiving / Crawlers & Capture)
awesome-starred - webrecorder/browsertrix-crawler - Run a high-fidelity browser-based web archiving crawler in a single Docker container (TypeScript)
awesome-rainmana - webrecorder/browsertrix-crawler - Run a high-fidelity browser-based web archiving crawler in a single Docker container (TypeScript)

README

          # Browsertrix Crawler 1.x

Browsertrix Crawler is a standalone browser-based high-fidelity crawling system, designed to run a complex, customizable browser-based crawl in a single Docker container. Browsertrix Crawler uses [Puppeteer](https://github.com/puppeteer/puppeteer) to control one or more [Brave Browser](https://brave.com/) browser windows in parallel. Data is captured through the [Chrome Devtools Protocol (CDP)](https://chromedevtools.github.io/devtools-protocol/) in the browser.

For information on how to use and develop Browsertrix Crawler, see the hosted [Browsertrix Crawler documentation](https://crawler.docs.browsertrix.com).

For information on how to build the docs locally, see the [docs page](docs/docs/develop/docs.md).

## Support

Initial support for 0.x version of Browsertrix Crawler, was provided by [Kiwix](https://kiwix.org/). The initial functionality for Browsertrix Crawler was developed to support the [zimit](https://github.com/openzim/zimit) project in a collaboration between Webrecorder and Kiwix, and this project has been split off from Zimit into a core component of Webrecorder.

Additional support for Browsertrix Crawler, including for the development of the 0.4.x version has been provided by [Portico](https://www.portico.org/).

## License

[AGPLv3](https://www.gnu.org/licenses/agpl-3.0) or later, see [LICENSE](LICENSE) for more details.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/webrecorder/browsertrix-crawler

Awesome Lists containing this project

README