https://github.com/rybesh/capture-urls

Archive a list of URLs using the Wayback Machine
https://github.com/rybesh/capture-urls

save-page-now wayback-machine web-archiving

Last synced: over 1 year ago
JSON representation

Archive a list of URLs using the Wayback Machine

Host: GitHub
URL: https://github.com/rybesh/capture-urls
Owner: rybesh
License: unlicense
Created: 2021-05-15T01:08:11.000Z (about 5 years ago)
Default Branch: main
Last Pushed: 2024-12-06T16:27:48.000Z (over 1 year ago)
Last Synced: 2025-03-20T13:11:47.114Z (over 1 year ago)
Topics: save-page-now, wayback-machine, web-archiving
Language: Python
Homepage:
Size: 39.1 KB
Stars: 5
Watchers: 2
Forks: 0
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# Archive a list of URLs using the Wayback Machine

** You need Python 3.10 or later to run this script. **

This script uses the [Save Page Now 2 Public API](https://docs.google.com/document/d/1Nsv52MvSjbLb2PCpHlat0gkzw0EvtSgpKHu4mk0MnrA/edit).

To use it:

1. Clone or [download](https://github.com/rybesh/capture-urls/archive/refs/heads/main.zip "download repository as a zip file") and unzip this repository.

1. Install the required Python libraries. Assuming you cloned or
unzipped this repository to the directory `path/to/capture-urls/`:

```
cd path/to/capture-urls/
make
```

1. Go to https://archive.org/account/s3.php and get your S3-like API keys.

1. In `path/to/capture-urls/`, create a file called `secret.py` with
the following contents:

```python
ACCESS_KEY = 'your access key'
SECRET_KEY = 'your secret key'
```

(Use the actual values of your access key and secret key, not `your
access key` and `your secret key`.)

1. *Optionally* edit `config.py` to your liking.

1. Archive your URLs:
```
cat urls.txt | ./capture-urls.py > archived-urls.txt
```
`urls.txt` should contain a list of URLs to be archived, one on each line.

1. Archiving URLs can take a long time. You can interrupt the process
with `Ctrl-C`. This will create a file called `progress.json` with
the state of the archiving process so far. If you start the process
again, it will pick up where it left off. You can add new URLs to
`urls.txt` before you restart the process.

1. When it finishes running you should have a list of the archived
URLs in `archived-urls.txt`.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/rybesh/capture-urls

Awesome Lists containing this project

README