https://github.com/q-m/scrapy-webarchive
A plugin for Scrapy that allows users to capture and export web archives in the WARC and WACZ formats during crawling.
https://github.com/q-m/scrapy-webarchive
scrapy wacz warc webarchive webarchive-data-scraping
Last synced: 6 months ago
JSON representation
A plugin for Scrapy that allows users to capture and export web archives in the WARC and WACZ formats during crawling.
- Host: GitHub
- URL: https://github.com/q-m/scrapy-webarchive
- Owner: q-m
- Created: 2024-10-02T09:21:08.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2025-02-28T13:55:31.000Z (8 months ago)
- Last Synced: 2025-04-24T06:48:49.504Z (6 months ago)
- Topics: scrapy, wacz, warc, webarchive, webarchive-data-scraping
- Language: Python
- Homepage: http://developers.thequestionmark.org/scrapy-webarchive/
- Size: 10.8 MB
- Stars: 5
- Watchers: 1
- Forks: 0
- Open Issues: 5
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
Awesome Lists containing this project
README
# Scrapy Webarchive
[](https://github.com/q-m/scrapy-webarchive/actions/workflows/docs.yml)
Scrapy Webarchive is a plugin for Scrapy that allows users to capture and export web archives in the WARC and WACZ formats during crawling.
## Features
* Save web crawls in WACZ format (multiple storages supported; local and cloud).
* Crawl against WACZ format archives.
* Integrate seamlessly with Scrapy’s spider request and response cycle.## Compatibility
* Python 3.7, 3.8, 3.9, 3.10, 3.11 and 3.12
## Documentation
Documentation is available online at [developers.thequestionmark.org/scrapy-webarchive/](https://developers.thequestionmark.org/scrapy-webarchive/)