Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/webrecorder/browsertrix-crawler
Run a high-fidelity browser-based crawler in a single Docker container
https://github.com/webrecorder/browsertrix-crawler
crawler crawling wacz warc web-archiving web-crawler webrecorder
Last synced: 4 days ago
JSON representation
Run a high-fidelity browser-based crawler in a single Docker container
- Host: GitHub
- URL: https://github.com/webrecorder/browsertrix-crawler
- Owner: webrecorder
- License: agpl-3.0
- Created: 2020-11-02T04:37:14.000Z (about 4 years ago)
- Default Branch: main
- Last Pushed: 2024-05-22T17:25:44.000Z (8 months ago)
- Last Synced: 2024-05-22T17:32:55.119Z (8 months ago)
- Topics: crawler, crawling, wacz, warc, web-archiving, web-crawler, webrecorder
- Language: TypeScript
- Homepage: https://crawler.docs.browsertrix.com
- Size: 52.4 MB
- Stars: 554
- Watchers: 24
- Forks: 72
- Open Issues: 101
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGES.md
- License: LICENSE
Awesome Lists containing this project
- awesome-digital-preservation - Browsetrix Crawler - run a high-fidelity browser-based crawler in a single Docker container (Web archiving / Crawlers)
- awesome-datahoarding - Browsertrix Crawler - based high-fidelity crawling system, designed to run a complex, customizable browser-based crawl in a single Docker container (Download utilities / Web Archiving)
- awesome-starred - webrecorder/browsertrix-crawler - Run a high-fidelity browser-based web archiving crawler in a single Docker container (TypeScript)
- awesome-starred - webrecorder/browsertrix-crawler - Run a high-fidelity browser-based web archiving crawler in a single Docker container (TypeScript)
- awesome-rainmana - webrecorder/browsertrix-crawler - Run a high-fidelity browser-based web archiving crawler in a single Docker container (TypeScript)
README
# Browsertrix Crawler 1.x
Browsertrix Crawler is a standalone browser-based high-fidelity crawling system, designed to run a complex, customizable browser-based crawl in a single Docker container. Browsertrix Crawler uses [Puppeteer](https://github.com/puppeteer/puppeteer) to control one or more [Brave Browser](https://brave.com/) browser windows in parallel. Data is captured through the [Chrome Devtools Protocol (CDP)](https://chromedevtools.github.io/devtools-protocol/) in the browser.
For information on how to use and develop Browsertrix Crawler, see the hosted [Browsertrix Crawler documentation](https://crawler.docs.browsertrix.com).
For information on how to build the docs locally, see the [docs page](docs/docs/develop/docs.md).
## Support
Initial support for 0.x version of Browsertrix Crawler, was provided by [Kiwix](https://kiwix.org/). The initial functionality for Browsertrix Crawler was developed to support the [zimit](https://github.com/openzim/zimit) project in a collaboration between Webrecorder and Kiwix, and this project has been split off from Zimit into a core component of Webrecorder.Additional support for Browsertrix Crawler, including for the development of the 0.4.x version has been provided by [Portico](https://www.portico.org/).
## License
[AGPLv3](https://www.gnu.org/licenses/agpl-3.0) or later, see [LICENSE](LICENSE) for more details.