{"id":13585588,"url":"https://github.com/peterk/warcworker","last_synced_at":"2026-03-14T20:35:40.578Z","repository":{"id":66470429,"uuid":"141795011","full_name":"peterk/warcworker","owner":"peterk","description":"A dockerized, queued high fidelity web archiver based on Squidwarc","archived":false,"fork":false,"pushed_at":"2024-07-09T07:58:44.000Z","size":204,"stargazers_count":58,"open_issues_count":6,"forks_count":9,"subscribers_count":5,"default_branch":"master","last_synced_at":"2025-03-26T06:11:17.916Z","etag":null,"topics":["archiving","high-fidelity-preservation","preservation","webarchives","webarchiving"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/peterk.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-07-21T08:31:18.000Z","updated_at":"2025-02-26T19:08:25.000Z","dependencies_parsed_at":"2024-11-06T03:33:10.549Z","dependency_job_id":"bf5e4551-3bf8-4499-b5f3-129578bcd41c","html_url":"https://github.com/peterk/warcworker","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/peterk%2Fwarcworker","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/peterk%2Fwarcworker/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/peterk%2Fwarcworker/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/peterk%2Fwarcworker/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/peterk","download_url":"https://codeload.github.com/peterk/warcworker/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248558134,"owners_count":21124224,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["archiving","high-fidelity-preservation","preservation","webarchives","webarchiving"],"created_at":"2024-08-01T15:05:01.865Z","updated_at":"2026-03-14T20:35:35.556Z","avatar_url":"https://github.com/peterk.png","language":"Python","funding_links":[],"categories":["Tools \u0026 Software","Python"],"sub_categories":["Acquisition"],"readme":"# Warcworker\nA dockerized queued high fidelity web archiver based on [Squidwarc](https://github.com/N0taN3rd/Squidwarc) (Chrome headless), RabbitMQ and a small web frontend. Using the scripting abilities of Squidwarc, you can add scripts that should be run for a specific job (e.g. src-set enrichment, comment expansion etc). Please note that Warcworker is not a crawler (it will not crawl a website automatically - you have to use other software to build lists of URL:s to send to Warcworker).\n\n\u003cimg src=\"https://user-images.githubusercontent.com/19284/49601413-151dab80-f986-11e8-90d6-5a46e4593fb2.png\" alt=\"screenshot of Warcworker\" width=\"50%\" /\u003e\n\n## Installation\nCopy .env_example to .env. Update information in .env.\n\nStart with `docker-compose up -d --scale worker=3` (wait a minute for everything to start up)\n\n## Archiving and playback\nOpen web front end at http://0.0.0.0:5555 to enter URLs for archiving. You can prefill the text fields with the `url` and `description` request parameters. Play back the resulting WARC-files with [Webrecorder Player](https://github.com/webrecorder/webrecorderplayer-electron)\n\n## Using\n### Bookmarklet\nAdd a bookmarklet to your browser with the following link:\n\n`javascript:window.open('http://0.0.0.0:5555?url='+encodeURIComponent(location.href) + '\u0026description=' + encodeURIComponent(document.title));window.focus();`\n\nNow you have two-click web archiving from your browser.\n\n\n### Command line\nTo use from the command line with curl:\n\n`curl -d \"scripts=srcset\u0026scripts=scroll_everything\u0026url=https://www.peterkrantz.com/\" -X POST http://0.0.0.0:5555/process/`\n\n\n\n\n### Archivenow handler\nTo use from [archivenow](https://github.com/oduwsdl/archivenow) add a handler file `handlers/ww_handler.py` like this:\n\n```python\nimport requests\nimport json\n\nclass WW_handler(object):\n\n    def __init__(self):\n        self.enabled = True\n        self.name = 'Warcworker'\n        self.api_required = False\n\n    def push(self, uri_org, p_args=[]):\n        msg = ''\n        try:\n\t    # add scripts in the order you want them to be run on the page\n            payload = {\"url\":uri_org, \"scripts\":[\"scroll_everything\", \"srcset\"]}\n\n            r = requests.post('http://0.0.0.0:5555/process/', timeout=120,\n                    data=payload,\n                    allow_redirects=True)\n\n            r.raise_for_status()\n            return \"%s added to queue\" % uri_org\n\n        except Exception as e:\n            msg = \"Error (\" + self.name+ \"): \" + str(e)\n        return msg\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpeterk%2Fwarcworker","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpeterk%2Fwarcworker","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpeterk%2Fwarcworker/lists"}