https://github.com/q-m/scrapy-webarchive

A plugin for Scrapy that allows users to capture and export web archives in the WARC and WACZ formats during crawling.
https://github.com/q-m/scrapy-webarchive

scrapy wacz warc webarchive webarchive-data-scraping

Last synced: about 1 year ago
JSON representation

A plugin for Scrapy that allows users to capture and export web archives in the WARC and WACZ formats during crawling.

Host: GitHub
URL: https://github.com/q-m/scrapy-webarchive
Owner: q-m
Created: 2024-10-02T09:21:08.000Z (almost 2 years ago)
Default Branch: main
Last Pushed: 2025-02-28T13:55:31.000Z (over 1 year ago)
Last Synced: 2025-04-24T06:48:49.504Z (about 1 year ago)
Topics: scrapy, wacz, warc, webarchive, webarchive-data-scraping
Language: Python
Homepage: http://developers.thequestionmark.org/scrapy-webarchive/
Size: 10.8 MB
Stars: 5
Watchers: 1
Forks: 0
Open Issues: 5
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md

Awesome Lists containing this project

README

# Scrapy Webarchive

[![Docs](https://github.com/q-m/scrapy-webarchive/actions/workflows/docs.yml/badge.svg)](https://github.com/q-m/scrapy-webarchive/actions/workflows/docs.yml)

Scrapy Webarchive is a plugin for Scrapy that allows users to capture and export web archives in the WARC and WACZ formats during crawling.

## Features

* Save web crawls in WACZ format (multiple storages supported; local and cloud).
* Crawl against WACZ format archives.
* Integrate seamlessly with Scrapy’s spider request and response cycle.

## Compatibility

* Python 3.7, 3.8, 3.9, 3.10, 3.11 and 3.12

## Documentation

Documentation is available online at [developers.thequestionmark.org/scrapy-webarchive/](https://developers.thequestionmark.org/scrapy-webarchive/)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/q-m/scrapy-webarchive

Awesome Lists containing this project

README