https://github.com/pjsier/scrapy-wayback-middleware

Scrapy middleware for submitting URLs to the Internet Archive Wayback Machine
https://github.com/pjsier/scrapy-wayback-middleware

archiving hacktoberfest python scrapy wayback-machine

Last synced: about 1 year ago
JSON representation

Scrapy middleware for submitting URLs to the Internet Archive Wayback Machine

Host: GitHub
URL: https://github.com/pjsier/scrapy-wayback-middleware
Owner: pjsier
License: mit
Created: 2019-02-25T16:05:54.000Z (over 7 years ago)
Default Branch: main
Last Pushed: 2022-08-01T07:35:47.000Z (almost 4 years ago)
Last Synced: 2024-11-02T08:51:43.339Z (over 1 year ago)
Topics: archiving, hacktoberfest, python, scrapy, wayback-machine
Language: Python
Homepage: https://pypi.org/project/scrapy-wayback-middleware/
Size: 23.4 KB
Stars: 10
Watchers: 4
Forks: 3
Open Issues: 2
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md

Awesome Lists containing this project

README

# Scrapy Wayback Middleware

[![Build status](https://github.com/pjsier/scrapy-wayback-middleware/workflows/CI/badge.svg)](https://github.com/pjsier/scrapy-wayback-middleware/actions)

Middleware for submitting all scraped response URLs to the [Internet Archive Wayback Machine](https://archive.org/web/) for archival.

## Installation

```bash
pip install scrapy-wayback-middleware
```

## Setup

Add `scrapy_wayback_middleware.WaybackMiddleware` to your project's `SPIDER_MIDDLEWARES` settings. By default, the middleware will make `GET` requests to `web.archive.org/save/{URL}`, but if the `WAYBACK_MIDDLEWARE_POST` setting is `True` then it will make POST requests to [`pragma.archivelab.org`](https://archive.readme.io/docs/creating-a-snapshot) instead.

## Configuration

To configure custom behavior for certain methods, subclass `WaybackMiddleware` and override the `get_item_urls` method to pull additional links to archive from individual items or `handle_wayback` to change how responses from the Wayback Machine are handled. The `WAYBACK_MIDDLEWARE_POST` can be set to `True` to adjust request behavior.

### Duplicate Filtering

In order to avoid sending duplicate requests with `WAYBACK_MIDDLEWARE_POST` set to `False`, you'll need to either include `web.archive.org` in your spider's `allowed_domains` property (if specified) or disable `scrapy.spidermiddlewares.offsite.OffsiteMiddleware` in your settings.

### Rate Limits

While neither endpoint returns headers indicating specific rate limits, the `GET` endpoint at `web.archive.org/save` has a rate limit of 25 requests/minute, resetting each minute. The middleware is configured to wait for 60 seconds whenever it sees a 429 error code to handle this.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/pjsier/scrapy-wayback-middleware

Awesome Lists containing this project

README