https://github.com/atomotic/heritrix-docker
docker container with heritrix crawler
https://github.com/atomotic/heritrix-docker
Last synced: 2 months ago
JSON representation
docker container with heritrix crawler
- Host: GitHub
- URL: https://github.com/atomotic/heritrix-docker
- Owner: atomotic
- Created: 2013-07-28T14:23:16.000Z (over 11 years ago)
- Default Branch: master
- Last Pushed: 2014-01-15T14:09:23.000Z (over 11 years ago)
- Last Synced: 2025-01-05T02:24:58.999Z (4 months ago)
- Size: 137 KB
- Stars: 4
- Watchers: 2
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
heritrix-docker
===============docker container with #webarchiving crawler **heritrix-3.2.0**
### run
- build
docker build -t atomotic/heritrix github.com/atomotic/heritrix-docker- create a persistent job directory on the host
mkdir jobs
- run the container on host:8443
docker run -d -p 8443:8443 -v $(pwd)/jobs:/opt/heritrix-3.2.0/jobs atomotic/heritrix heritrix-start.sh
- access the web interface
open https://localhost:8443
* user **heritrix**
* password **heritrix**#### TODO
- improve the basic job crawler-beans.cxml (specify external seeds.txt file)
- add utilities to deal with warc files
* https://github.com/internetarchive/warctools
* https://github.com/internetarchive/CDX-Writer
* https://github.com/alard/warc-proxy
* wget-lua from archiveteam
* https://github.com/alard/warctozip
* https://github.com/ukwa/warc-explorer
* https://github.com/martinsbalodis/warc-content
* https://github.com/ianmilligan1/Historian-WARC-1
* https://github.com/alard/megawarc