Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/peterk/warcworker
A dockerized, queued high fidelity web archiver based on Squidwarc
https://github.com/peterk/warcworker
archiving high-fidelity-preservation preservation webarchives webarchiving
Last synced: about 2 months ago
JSON representation
A dockerized, queued high fidelity web archiver based on Squidwarc
- Host: GitHub
- URL: https://github.com/peterk/warcworker
- Owner: peterk
- License: gpl-3.0
- Created: 2018-07-21T08:31:18.000Z (almost 6 years ago)
- Default Branch: master
- Last Pushed: 2020-07-19T20:33:15.000Z (almost 4 years ago)
- Last Synced: 2024-02-04T04:14:46.529Z (5 months ago)
- Topics: archiving, high-fidelity-preservation, preservation, webarchives, webarchiving
- Language: Python
- Homepage:
- Size: 175 KB
- Stars: 52
- Watchers: 6
- Forks: 9
- Open Issues: 7
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Lists
- awesome-web-archiving - Warcworker - An open source, dockerized, queued, high fidelity web archiver based on Squidwarc with a simple web GUI. *(Stable)* (Tools & Software / Acquisition)
- awesome-stars - peterk/warcworker - A dockerized, queued high fidelity web archiver based on Squidwarc (Python)
- awesome-web-archiving - Warcworker - An open source, dockerized, queued, high fidelity web archiver based on Squidwarc with a simple web GUI. (Stable) (Tools & Software / Acquisition)
README
# Warcworker
A dockerized queued high fidelity web archiver based on [Squidwarc](https://github.com/N0taN3rd/Squidwarc) (Chrome headless), RabbitMQ and a small web frontend. Using the scripting abilities of Squidwarc, you can add scripts that should be run for a specific job (e.g. src-set enrichment, comment expansion etc). Please note that Warcworker is not a crawler (it will not crawl a website automatically - you have to use other software to build lists of URL:s to send to Warcworker).## Installation
Copy .env_example to .env. Update information in .env.Start with `docker-compose up -d --scale worker=3` (wait a minute for everything to start up)
## Archiving and playback
Open web front end at http://0.0.0.0:5555 to enter URLs for archiving. You can prefill the text fields with the `url` and `description` request parameters. Play back the resulting WARC-files with [Webrecorder Player](https://github.com/webrecorder/webrecorderplayer-electron)## Using
### Bookmarklet
Add a bookmarklet to your browser with the following link:`javascript:window.open('http://0.0.0.0:5555?url='+encodeURIComponent(location.href) + '&description=' + encodeURIComponent(document.title));window.focus();`
Now you have two-click web archiving from your browser.
### Command line
To use from the command line with curl:`curl -d "scripts=srcset&scripts=scroll_everything&url=https://www.peterkrantz.com/" -X POST http://0.0.0.0:5555/process/`
### Archivenow handler
To use from [archivenow](https://github.com/oduwsdl/archivenow) add a handler file `handlers/ww_handler.py` like this:```python
import requests
import jsonclass WW_handler(object):
def __init__(self):
self.enabled = True
self.name = 'Warcworker'
self.api_required = Falsedef push(self, uri_org, p_args=[]):
msg = ''
try:
# add scripts in the order you want them to be run on the page
payload = {"url":uri_org, "scripts":["scroll_everything", "srcset"]}r = requests.post('http://0.0.0.0:5555/process/', timeout=120,
data=payload,
allow_redirects=True)r.raise_for_status()
return "%s added to queue" % uri_orgexcept Exception as e:
msg = "Error (" + self.name+ "): " + str(e)
return msg
```