https://github.com/pjsier/scrapy-wayback-middleware
Scrapy middleware for submitting URLs to the Internet Archive Wayback Machine
https://github.com/pjsier/scrapy-wayback-middleware
archiving hacktoberfest python scrapy wayback-machine
Last synced: about 1 year ago
JSON representation
Scrapy middleware for submitting URLs to the Internet Archive Wayback Machine
- Host: GitHub
- URL: https://github.com/pjsier/scrapy-wayback-middleware
- Owner: pjsier
- License: mit
- Created: 2019-02-25T16:05:54.000Z (over 7 years ago)
- Default Branch: main
- Last Pushed: 2022-08-01T07:35:47.000Z (almost 4 years ago)
- Last Synced: 2024-11-02T08:51:43.339Z (over 1 year ago)
- Topics: archiving, hacktoberfest, python, scrapy, wayback-machine
- Language: Python
- Homepage: https://pypi.org/project/scrapy-wayback-middleware/
- Size: 23.4 KB
- Stars: 10
- Watchers: 4
- Forks: 3
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
Awesome Lists containing this project
README
# Scrapy Wayback Middleware
[](https://github.com/pjsier/scrapy-wayback-middleware/actions)
Middleware for submitting all scraped response URLs to the [Internet Archive Wayback Machine](https://archive.org/web/) for archival.
## Installation
```bash
pip install scrapy-wayback-middleware
```
## Setup
Add `scrapy_wayback_middleware.WaybackMiddleware` to your project's `SPIDER_MIDDLEWARES` settings. By default, the middleware will make `GET` requests to `web.archive.org/save/{URL}`, but if the `WAYBACK_MIDDLEWARE_POST` setting is `True` then it will make POST requests to [`pragma.archivelab.org`](https://archive.readme.io/docs/creating-a-snapshot) instead.
## Configuration
To configure custom behavior for certain methods, subclass `WaybackMiddleware` and override the `get_item_urls` method to pull additional links to archive from individual items or `handle_wayback` to change how responses from the Wayback Machine are handled. The `WAYBACK_MIDDLEWARE_POST` can be set to `True` to adjust request behavior.
### Duplicate Filtering
In order to avoid sending duplicate requests with `WAYBACK_MIDDLEWARE_POST` set to `False`, you'll need to either include `web.archive.org` in your spider's `allowed_domains` property (if specified) or disable `scrapy.spidermiddlewares.offsite.OffsiteMiddleware` in your settings.
### Rate Limits
While neither endpoint returns headers indicating specific rate limits, the `GET` endpoint at `web.archive.org/save` has a rate limit of 25 requests/minute, resetting each minute. The middleware is configured to wait for 60 seconds whenever it sees a 429 error code to handle this.