https://github.com/exurd/txt2warc

A text file to WARC pipeline for grab-site-docker.
https://github.com/exurd/txt2warc

7-zip archivebot archiveteam docker grab-site python text text-file url urls warc

Last synced: 3 months ago
JSON representation

A text file to WARC pipeline for grab-site-docker.

Host: GitHub
URL: https://github.com/exurd/txt2warc
Owner: exurd
License: unlicense
Created: 2024-12-25T22:05:14.000Z (5 months ago)
Default Branch: main
Last Pushed: 2025-02-03T12:18:06.000Z (4 months ago)
Last Synced: 2025-02-03T13:27:04.073Z (4 months ago)
Topics: 7-zip, archivebot, archiveteam, docker, grab-site, python, text, text-file, url, urls, warc
Language: Python
Homepage:
Size: 20.5 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE
- Support: SUPPORTED.md

Awesome Lists containing this project

README

Basically, it's a text file to WARC pipeline for grab-site (and technically [ArchiveBot](https://wiki.archiveteam.org/index.php/ArchiveBot)).

Prototype was coded on Windows and requires Python, 7-Zip & Docker. Untested on other platforms.

# Instructions

1. Download and install [Docker](https://www.docker.com).
2. Grab Dockerfile from [Nold360/docker-grab-site](https://github.com/Nold360/docker-grab-site) and place into a folder in a directory (e.g. `D:\grab-site-data`, `/home/user/grab-site-data/`).
1. This will become the data folder for the docker containers, where the WARCs will be saved. It's recommened to use a root directory with no spaces.
3. Build the image with `docker build -t grab-site .` (Size of docker image is around 500 mb)
1. If you are on an ARM system (or Apple Silicon), it is recommended to add `--platform=linux/amd64` to all of these docker commands you run avoid [issues with wget's WARC creation.](https://wiki.archiveteam.org/index.php/ArchiveTeam_Warrior#Can_I_run_the_Warrior_on_ARM_or_some_other_unusual_architecture?)
4. Spin the container up with `docker run -d --rm -p29000:29000 -v DATA_FOLDER:/data --name grab-site-container grab-site`
1. Set `DATA_FOLDER` to the path of the above directory.
5. Create a text file of a bunch of IDs you want the script to archive.
1. To see what this program supports, see [SUPPORTED.md](./SUPPORTED.md)
6. Open a terminal in this repo directory.
7. Run `python . DATA_FOLDER TEXTFILE ITEM_TYPE`
1. `DATA_FOLDER` is the directory above, `TEXTFILE` is the text file and `ITEM_TYPE` is what type the items in the text file are.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/exurd/txt2warc

Awesome Lists containing this project

README