https://github.com/internetarchive/scrapy-warcio
Support for writing WARC files with Scrapy
https://github.com/internetarchive/scrapy-warcio
python scrapy warc web-archiving
Last synced: 1 day ago
JSON representation
Support for writing WARC files with Scrapy
- Host: GitHub
- URL: https://github.com/internetarchive/scrapy-warcio
- Owner: internetarchive
- Created: 2019-12-11T02:45:48.000Z (over 5 years ago)
- Default Branch: master
- Last Pushed: 2019-12-21T00:49:02.000Z (over 5 years ago)
- Last Synced: 2025-07-10T13:05:11.335Z (6 days ago)
- Topics: python, scrapy, warc, web-archiving
- Language: Python
- Size: 31.3 KB
- Stars: 23
- Watchers: 19
- Forks: 6
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
Scrapy Warcio
=============A Web Archive
[WARC](https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.0/)
I/O module for Scrapy[](https://travis-ci.com/internetarchive/scrapy-warcio)
Install
-------```shell
$ pip install scrapy-warcio
```Usage
-----1. Create a project and spider:
https://docs.scrapy.org/en/latest/intro/tutorial.html```
$ scrapy startproject
$ cd
$ scrapy genspider example.com
```2. Copy and edit `scrapy_warcio` distributed `settings.yml` with your
configuration settings:```yaml
---
warc_spec: https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.0/
max_warc_size: 10000000000 # 10GBcollection: ~ # collection name
description: ~ # collection description
operator: ~ # operator email address
robots: ~ # robots policy (obey or ignore)
user_agent: ~ # your user-agent
warc_prefix: ~ # WARC filename prefix
warc_dest: ~ # WARC files destination
...
```3. Export `SCRAPY_WARCIO_SETTINGS=/path/to/settings.yml`
4. Add `WarcioDownloaderMiddleware` (distributed as `middlewares.py`)
to your `//middlewares.py`:```python
import scrapy_warcioclass WarcioDownloaderMiddleware:
def __init__(self):
self.warcio = scrapy_warcio.ScrapyWarcIo()def process_request(self, request, spider):
request.meta['WARC-Date'] = scrapy_warcio.warc_date()
return Nonedef process_response(self, request, response, spider):
self.warcio.write(response, request)
return response
```5. Enable `WarcioDownloaderMiddleware` in `//settings.py`:
```
DOWNLOADER_MIDDLEWARES = {
'.middlewares.WarcioDownloaderMiddleware': 543,
}
```6. Validate your warcs with `internetarchive/warctools`:
```shell
$ warcvalid WARC.warc.gz
```7. Upload your WARC(s) to your favorite web archive!
Help
----```shell
$ pydoc scrapy_warcio
```or
```python
>>> help(scrapy_warcio)
```TODO
----Making this a Scrapy extension may make it more useful:
https://docs.scrapy.org/en/latest/topics/extensions.html@internetarchive