Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/sul-dlss/wasapi-downloader
Java application to download WARCs from WASAPI
https://github.com/sul-dlss/wasapi-downloader
application infrastructure java
Last synced: about 2 months ago
JSON representation
Java application to download WARCs from WASAPI
- Host: GitHub
- URL: https://github.com/sul-dlss/wasapi-downloader
- Owner: sul-dlss
- License: other
- Created: 2017-04-28T21:15:37.000Z (about 7 years ago)
- Default Branch: main
- Last Pushed: 2024-04-08T15:52:07.000Z (3 months ago)
- Last Synced: 2024-04-09T10:53:54.687Z (3 months ago)
- Topics: application, infrastructure, java
- Language: Java
- Homepage:
- Size: 587 KB
- Stars: 6
- Watchers: 22
- Forks: 4
- Open Issues: 25
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Lists
- awesome-web-archiving - wasapi-downloader - Java command line application to download crawls from WASAPI. *(Stable)* (Tools & Software / Utilities)
- awesome-web-archiving - wasapi-downloader - Java command line application to download crawls from WASAPI. (Stable) (Tools & Software / Utilities)
README
[![Build Status](https://travis-ci.com/sul-dlss/wasapi-downloader.svg?branch=main)](https://travis-ci.com/sul-dlss/wasapi-downloader)
[![Coverage Status](https://coveralls.io/repos/github/sul-dlss/wasapi-downloader/badge.svg?branch=main)](https://coveralls.io/github/sul-dlss/wasapi-downloader?branch=main)
[![GitHub version](https://badge.fury.io/gh/sul-dlss%2Fwasapi-downloader.svg)](https://badge.fury.io/gh/sul-dlss%2Fwasapi-downloader)# wasapi-downloader
Java command line application to download crawls from WASAPI.## Local Setup
You'll need the following prerequisites installed on your local computer:
- Java (7)
- Ruby (we use Capistrano for deployment)The minimal sequence of steps to verify that you can work with the code is:
1. `git clone https://github.com/sul-dlss/wasapi-downloader.git`
2. `cd wasapi-downloader`
3. `./gradlew installDist` (compile and test the code and create a script to execute it)
4. `./build/install/wasapi-downloader/bin/wasapi-downloader --help` (explains usage)An example invocation of the downloader:
```
./build/install/wasapi-downloader/bin/wasapi-downloader --collectionId 123 --crawlStartAfter 2014-03-14
```### Configuration
This repository contains an example `config/settings.properties` file with dummy values for the required configuration settings. In order to successfully execute the Java application, you will need to override these default settings.
### Usage
#### Building
wasapi-downloader uses the gradle wrapper (https://docs.gradle.org/3.3/userguide/gradle_wrapper.html) so users don't have to worry about installing gradle. However, using the gradle wrapper once (`gradlew [task]`) installs gradle on your system and from then forward you can simply execute `gradle [tasks]` rather than `gradlew [tasks]` (though either will work).
wasapi-downloader is built using [Gradle](https://gradle.org/docs). To create a runnable installation with all needed jars and shell script (cleaning out old builds first):
`./gradle clean installDist`
List all available build tasks:
`./gradle tasks`
#### Running
To run:
`./build/install/wasapi-downloader/bin/wasapi-downloader --help` (explain usage)
An example invocation of the downloader:
```
./build/install/wasapi-downloader/bin/wasapi-downloader --collectionId 123 --crawlStartAfter 2014-03-14
```See more examples below in the [Production section](#stanford-production-use).
## Deployment
Capistrano is used for deployment to Stanford VMs.
1. On your laptop, run
`bundle`
to install the Ruby capistrano gems and other dependencies for deployment.
2. Deploy code to remote VM:
`cap deploy`
`` is either `dev`, `stage` or `prod`, as specified in `config/deploy/`.
This will also get our (Stanford's) latest configuration settings.
## (Stanford) Production Use
The deployment command shown above creates an executable Java application. After logging onto the production server you may run wasapi-downloader by following these steps:
```
cd wasapi-downloader/current/
./build/install/wasapi-downloader/bin/wasapi-downloader
```The `--help` option will display a message listing all of the arguments:
`./build/install/wasapi-downloader/bin/wasapi-downloader --help`
Some of the available command line arguments have a default value set in `config/settings.properties`. `--help` will display the current configuration as taken from the `settings.properties` file. Command line arguments will override values set from `config/settings.properties`.
### Common Usage Examples
For many users of the production instance of wasapi-downloader, the following examples will be relevant/helpful:
#### Download all crawl files available across all collections available to your account (less likely)
`./build/install/wasapi-downloader/bin/wasapi-downloader`
#### Download all crawl files available for a certain collection (more likely)
`./build/install/wasapi-downloader/bin/wasapi-downloader --collectionId 8001`
#### Download all crawl files for a certain collection (ex. 8001) after a certain date (ex: 2014)
`./build/install/wasapi-downloader/bin/wasapi-downloader --collectionId 8001 --crawlStartAfter 2014-01-01`
#### Download all crawl files for a certain collection (ex. 8001) created before a certain date (ex: 2012) into a particular output directory (ex. `/tmp/`, which override the `config.settings` default value):
`./build/install/wasapi-downloader/bin/wasapi-downloader --collectionId 8001 --crawlStartBefore 2012-01-01 --outputBaseDir /tmp/`
#### Download all crawl files available for a certain collection (more likely) created before a certain date (ex: 2012) and after a certain date (ex: 2014)
`./build/install/wasapi-downloader/bin/wasapi-downloader --collectionId 8001 --crawlStartBefore 2012-01-01 --crawlStartAfter 2014-01-01`
#### Download a single file:
`./build/install/wasapi-downloader/bin/wasapi-downloader --filename ARCHIVEIT-5425-MONTHLY-JOB302671-20170526114117181-00049.warc.gz`
**Note:** When a `--filename` argument is present, all other request parameters (crawl start/end, collection ID, job ID) are ignored.