https://github.com/exurd/txt2warc
A text file to WARC pipeline for grab-site-docker.
https://github.com/exurd/txt2warc
7-zip archivebot archiveteam docker grab-site python text text-file url urls warc
Last synced: 3 months ago
JSON representation
A text file to WARC pipeline for grab-site-docker.
- Host: GitHub
- URL: https://github.com/exurd/txt2warc
- Owner: exurd
- License: unlicense
- Created: 2024-12-25T22:05:14.000Z (5 months ago)
- Default Branch: main
- Last Pushed: 2025-02-03T12:18:06.000Z (4 months ago)
- Last Synced: 2025-02-03T13:27:04.073Z (4 months ago)
- Topics: 7-zip, archivebot, archiveteam, docker, grab-site, python, text, text-file, url, urls, warc
- Language: Python
- Homepage:
- Size: 20.5 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
- Support: SUPPORTED.md
Awesome Lists containing this project
README
Basically, it's a text file to WARC pipeline for grab-site (and technically [ArchiveBot](https://wiki.archiveteam.org/index.php/ArchiveBot)).
Prototype was coded on Windows and requires Python, 7-Zip & Docker. Untested on other platforms.
# Instructions
1. Download and install [Docker](https://www.docker.com).
2. Grab Dockerfile from [Nold360/docker-grab-site](https://github.com/Nold360/docker-grab-site) and place into a folder in a directory (e.g. `D:\grab-site-data`, `/home/user/grab-site-data/`).
1. This will become the data folder for the docker containers, where the WARCs will be saved. It's recommened to use a root directory with no spaces.
3. Build the image with `docker build -t grab-site .` (Size of docker image is around 500 mb)
1. If you are on an ARM system (or Apple Silicon), it is recommended to add `--platform=linux/amd64` to all of these docker commands you run avoid [issues with wget's WARC creation.](https://wiki.archiveteam.org/index.php/ArchiveTeam_Warrior#Can_I_run_the_Warrior_on_ARM_or_some_other_unusual_architecture?)
4. Spin the container up with `docker run -d --rm -p29000:29000 -v DATA_FOLDER:/data --name grab-site-container grab-site`
1. Set `DATA_FOLDER` to the path of the above directory.
5. Create a text file of a bunch of IDs you want the script to archive.
1. To see what this program supports, see [SUPPORTED.md](./SUPPORTED.md)
6. Open a terminal in this repo directory.
7. Run `python . DATA_FOLDER TEXTFILE ITEM_TYPE`
1. `DATA_FOLDER` is the directory above, `TEXTFILE` is the text file and `ITEM_TYPE` is what type the items in the text file are.