https://github.com/n0tan3rd/memgatorbulkdownload
https://github.com/n0tan3rd/memgatorbulkdownload
memento memento-protocol memento-rfc memgator timemap timemaps web-archiving
Last synced: 3 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/n0tan3rd/memgatorbulkdownload
- Owner: N0taN3rd
- License: mit
- Created: 2018-04-16T04:23:29.000Z (about 7 years ago)
- Default Branch: master
- Last Pushed: 2018-04-18T02:16:26.000Z (about 7 years ago)
- Last Synced: 2025-01-23T05:28:31.298Z (5 months ago)
- Topics: memento, memento-protocol, memento-rfc, memgator, timemap, timemaps, web-archiving
- Language: Python
- Size: 13.7 KB
- Stars: 1
- Watchers: 2
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Memgator Bulk TimeMap Downloader
Have you ever had a need to download 100 or 1 million TimeMaps using [oduwsdl/memgator](https://github.com/oduwsdl/memgator)?
With the caveat that it must be done in a timely manner?
If so then you are in luck because this project has you covered.
# Requirements
**Requires python 3**Be sure to install the dependencies first
- ```[sudo] pip install -r requirements.txt```
You also need a running instance of [oduwsdl/MemGator](https://github.com/oduwsdl/memgator)
If you do not have one. You can get one at [oduwsdl/MemGator/releases](https://github.com/oduwsdl/MemGator/releases)
# Usage
#### Basic usage
```
$ python download.py -m {MGURL} {FORMAT2} -d {DUMDIR} -u {LIST}
# MGURL => http://localhost:1208
# FORMAT => link|json|cdxj
# FORMAT2 => (-l, --link)|(-j, --json)|(-c, --cdxj)
# DUMDIR => Path to directory where timemaps will be dumped
# LIST => Path to URL list
```#### Full Usage
```
$ python download.py --help
usage: download [-h] [-m MEMURL] [-w WORKERS] [-r REQUESTS] [-d DUMP] -u URLS
[-k KEY] [-j | -l | -c]Bulk download TimeMaps using a local memgator instance
optional arguments:
-h, --help show this help message and exit
-m MEMURL, --memurl MEMURL
URL for running memgator instance. Defaults to
http://localhost:1208/timemap/json
-w WORKERS, --workers WORKERS
Max number of worker processes spawned. Defaults to 5
-r REQUESTS, --requests REQUESTS
How many requests should be queued per chunk. Defaults
to 10
-d DUMP, --dump DUMP Directory to dump the TimeMaps in. Defaults to
/timemaps
-u URLS, --urls URLS Path to file (.txt, .csv, .json) containing list of
URLs. File type detected by considering extension. If
.csv must supply -k so we know where to get the
url
-k KEY, --key KEY The csv key for the urls
-j, --json Download TimeMaps in json format. Default format
-l, --link Download TimeMaps in link format
-c, --cdxj Download TimeMaps in cdxj format
```#### URL List Format
- **.txt**: 1 URL per line
- **.csv**: Requires -k or --key {KEY} argument. _KEY_ is the csv column containing the URL
- **.json**: List of URLs# License
MIT