Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/rob-sve/iadownloader

Auto-download files and collections from Internet Archive
https://github.com/rob-sve/iadownloader

download downloader internet-archive python tqdm

Last synced: 18 days ago
JSON representation

Auto-download files and collections from Internet Archive

Awesome Lists containing this project

README

        

#+TITLE: iadownloader
#+AUTHOR: rsvensson
#+EMAIL: [email protected]
#+DESCRIPTION: Auto-download files from Internet Archive
#+KEYWORDS: python, internet archive, download

** Summary
/iadownloader/ is a tool to automatically download files from the [[https://archive.org/][Internet Archive]]. It will download all the files - individually or as a compressed archive - in an internet archive upload url automatically, to a configurable download location (defaults to the current working directory). It can also download complete collections etc, by parsing either json or csv files generated by Internet Archive's [[https://archive.org/advancedsearch.php][advanced search]] tool.

** Usage
#+BEGIN_SRC shell
iadownloader.py [-h] [-c] [-o OUTPUT_DIR] [-t THREADS] [-T] url

positional arguments:
url URL or path to json/csv file

optional arguments:
-h, --help show this help message and exit
-c, --compressed Get the compressed archive download instead of the individual files
-o OUTPUT_DIR, --output_dir OUTPUT_DIR
Path to output directory
-t THREADS, --threads THREADS
Number of simultaneous downloads (maximum of 10)
-T, --torrent Only download the torrent file if available
#+END_SRC

The basic usage is to simply invoke iadownloader with a download url.
#+BEGIN_SRC shell
python iadownloader.py https://archive.org/download/
#+END_SRC
This causes all the files in the url to be downloaded to the directory the script was invoked from.

Optionally specify the download location:
#+BEGIN_SRC shell
python iadownloader.py -o /download/path https://archive.org/download/
#+END_SRC

To download the compressed archive of the upload just add the '-c' flag:
#+BEGIN_SRC shell
python iadownloader.py -c -o /download/path https://archive.org/download/
#+END_SRC

You can also specify the amount of threads (up to 10):
#+BEGIN_SRC shell
python iadownloader.py -t 8 /download/path https://archive.org/download/
#+END_SRC
It defaults to 4 threads if not specified.

*Don't confuse "download url" with individual file urls.* Those are trivially downloaded through your web browser. This tool is to simplify downloading all the included urls in an upload on Internet Archive. Even this can be done using the Web UI quite easily. Where iadownloader shines is the ability to download full collections automatically.

To download a whole collection, all files from a certain author, etc, go to Internet Archive's [[https://archive.org/advancedsearch.php][advanced search]] tool and follow the following steps:
1. Scroll down to "Advanced Search returning JSON, XML, and more". In the "Query" field enter /collection:/ for collections, /creator:/ for creators, etc. In "Field to return" select "identifier" if not already selected. Select an appropriate "Number of results" depending on the collection.
2. Choose either JSON format or CSV format. CSV format is a bit more convenient since it prompts you to download it immediately, while the JSON format opens a javascript page with embedded JSON data. Save the .csv file to a location. If you choose JSON, save the page and make sure to save it with the .json ending rather than the suggested .js one.
3. Run iadownloader.py like this:
#+BEGIN_SRC sh
python iadownloader.py -o /download/path /path/to/csv-or-json-file
#+END_SRC
iadownloader will go through all the downloads of the collection and download them into the download path.

** Requirements
iadownloader uses /requests/, /lxml/, and /tqdm/ to do its magic. To make sure you have them use the included requirements.txt:
#+BEGIN_SRC sh
pip install -r requirements.txt
#+END_SRC
Of course, you need python and pip as well.