Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/lblod/harvest-collector-service
Microservice that generates harvest collections by following navigational properties in downloaded file addresses
https://github.com/lblod/harvest-collector-service
mu-service
Last synced: about 1 month ago
JSON representation
Microservice that generates harvest collections by following navigational properties in downloaded file addresses
- Host: GitHub
- URL: https://github.com/lblod/harvest-collector-service
- Owner: lblod
- License: mit
- Created: 2019-06-07T09:17:17.000Z (over 5 years ago)
- Default Branch: master
- Last Pushed: 2023-05-08T09:58:31.000Z (over 1 year ago)
- Last Synced: 2024-10-30T09:37:50.039Z (3 months ago)
- Topics: mu-service
- Language: JavaScript
- Homepage:
- Size: 135 KB
- Stars: 0
- Watchers: 18
- Forks: 0
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
- License: LICENSE.md
Awesome Lists containing this project
README
# harvest-collector-service
Microservice that creates harvest collections by parsing downloaded HTML files and triggering downloads of additional file addresses by following navigational properties.The following navigational properties currently trigger a new download:
* http://lblod.data.gift/vocabularies/besluit/linkToPublication## Usage
### Docker-compose
Add the following snippet in your `docker-compose.yml`:
```
harvest:
image: lblod/harvest-collector-service
volumes:
- ./data/files:/share
```The `/share` volume contains the downloaded files (as downloaded by the `lblod/download-url-service`).
### Delta configuration
```
{
match: {
predicate: {
type: 'uri',
value: 'http://www.w3.org/ns/adms#status'
},
object: {
type: 'uri',
value: 'http://lblod.data.gift/file-download-statuses/success'
}
},
callback: {
method: 'POST',
url: 'http://harvest-collector/delta',
},
options: {
resourceFormat: 'v0.0.1',
gracePeriod: 1000,
ignoreFromSelf: true
}
}
```## Model
The service harvests collections containing a set of remote data object that are related by following navigational properties.Eg.
```
@prefix mu: .
@prefix harvesting: .
@prefix dct: .a harvesting:HarvestingCollection ;
mu:uuid "326ce8f6-9567-4e1d-ab3d-cda23d143701" ;
dct:hasPart ;
dct:hasPart ;
dct:hasPart .
```## API
### POST /harvest
Trigger a new harvest round. Each harvest round consist of:
1. Creating a new harvest collection for new downloaded file addresses
2. Inspecting navigational properties in new downloaded files and triggering additional downloads attached to the same harvest collection. These additional downloads will be harvested in a following round (after the download has successfully finished)
3. Updating the state of harvest collections for which all files have been harvested### Cron job trigger
In case you need to harvest tasks that you created manually, for example via migrations, a cron job trigger exists. This can be useful if there is the need to harvest a big number of URLs that cannot be accessed via the `linkToPublication` tag.
To trigger it, you can use the following environment variables:
```
ALLOW_CRON_JOB (default 'false'): true if we should run the cron jobs, false otherwise
CRON_FREQUENCY (default '*/5 * * * *''): cron jobs frequency
SCHEDULED_TASK_CREATOR (default 'http://lblod.data.gift/services/migrations'): URI of the creator of the scheduled collecting tasks
```## Restrictions
The service expects HTML files containing at least a `body` tag.