An open API service indexing awesome lists of open source software.

https://github.com/q-m/scrapy-webarchive

A plugin for Scrapy that allows users to capture and export web archives in the WARC and WACZ formats during crawling.
https://github.com/q-m/scrapy-webarchive

scrapy wacz warc webarchive webarchive-data-scraping

Last synced: 6 months ago
JSON representation

A plugin for Scrapy that allows users to capture and export web archives in the WARC and WACZ formats during crawling.

Awesome Lists containing this project

README

          

# Scrapy Webarchive

[![Docs](https://github.com/q-m/scrapy-webarchive/actions/workflows/docs.yml/badge.svg)](https://github.com/q-m/scrapy-webarchive/actions/workflows/docs.yml)

Scrapy Webarchive is a plugin for Scrapy that allows users to capture and export web archives in the WARC and WACZ formats during crawling.

## Features

* Save web crawls in WACZ format (multiple storages supported; local and cloud).
* Crawl against WACZ format archives.
* Integrate seamlessly with Scrapy’s spider request and response cycle.

## Compatibility

* Python 3.7, 3.8, 3.9, 3.10, 3.11 and 3.12

## Documentation

Documentation is available online at [developers.thequestionmark.org/scrapy-webarchive/](https://developers.thequestionmark.org/scrapy-webarchive/)