https://github.com/internetarchive/scrapy-warcio

Support for writing WARC files with Scrapy
https://github.com/internetarchive/scrapy-warcio

python scrapy warc web-archiving

Last synced: 1 day ago
JSON representation

Support for writing WARC files with Scrapy

Host: GitHub
URL: https://github.com/internetarchive/scrapy-warcio
Owner: internetarchive
Created: 2019-12-11T02:45:48.000Z (over 5 years ago)
Default Branch: master
Last Pushed: 2019-12-21T00:49:02.000Z (over 5 years ago)
Last Synced: 2025-07-10T13:05:11.335Z (6 days ago)
Topics: python, scrapy, warc, web-archiving
Language: Python
Size: 31.3 KB
Stars: 23
Watchers: 19
Forks: 6
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

        Scrapy Warcio

=============

A Web Archive

[WARC](https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.0/)

I/O module for Scrapy

[![travis-ci](https://travis-ci.com/internetarchive/scrapy-warcio.svg?branch=master)](https://travis-ci.com/internetarchive/scrapy-warcio)

Install

-------

```shell

$ pip install scrapy-warcio

```

Usage

-----

1. Create a project and spider:


   https://docs.scrapy.org/en/latest/intro/tutorial.html

```

$ scrapy startproject 

$ cd 

$ scrapy genspider  example.com

```

2. Copy and edit `scrapy_warcio` distributed `settings.yml` with your

   configuration settings:

```yaml

---

warc_spec: https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.0/

max_warc_size: 10000000000  # 10GB

collection: ~ # collection name

description: ~ # collection description

operator: ~ # operator email address

robots: ~  # robots policy (obey or ignore)

user_agent: ~ # your user-agent

warc_prefix: ~ # WARC filename prefix

warc_dest: ~ # WARC files destination

...

```

3. Export `SCRAPY_WARCIO_SETTINGS=/path/to/settings.yml`

4. Add `WarcioDownloaderMiddleware` (distributed as `middlewares.py`)

   to your `//middlewares.py`:

```python

import scrapy_warcio

class WarcioDownloaderMiddleware:

    def __init__(self):

        self.warcio = scrapy_warcio.ScrapyWarcIo()

    def process_request(self, request, spider):

        request.meta['WARC-Date'] = scrapy_warcio.warc_date()

        return None

    def process_response(self, request, response, spider):

        self.warcio.write(response, request)

        return response

```

5. Enable `WarcioDownloaderMiddleware` in `//settings.py`:

```

DOWNLOADER_MIDDLEWARES = {

    '.middlewares.WarcioDownloaderMiddleware': 543,

}

```

6. Validate your warcs with `internetarchive/warctools`:

```shell

$ warcvalid WARC.warc.gz

```

7. Upload your WARC(s) to your favorite web archive!

Help

----

```shell

$ pydoc scrapy_warcio

```

or

```python

>>> help(scrapy_warcio)

```

TODO

----

Making this a Scrapy extension may make it more useful:


https://docs.scrapy.org/en/latest/topics/extensions.html

@internetarchive

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/internetarchive/scrapy-warcio

Awesome Lists containing this project

README