Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/eth-library/dataset-dj

file aggregation and compression
https://github.com/eth-library/dataset-dj

Last synced: 7 days ago
JSON representation

file aggregation and compression

Awesome Lists containing this project

README

        

# DataDJ

Data DJ is a value-adding service for collections and archives, initially conceived at ETH Library Lab and currently in development at ETH Library. It helps to provide more convenient and efficient access to batches of digitised records and files. The service works in conjunction with collections' existing websites and search portals. The collection's website forwards the user's request for a list of files to the Data DJ, our service then gathers and compresses the files, and notifies the user via email with a convenient download link.

The requests to the sample application DataDJ can be accessed at https://dj-api-ucooq6lz5a-oa.a.run.app/. The Requests presented throughout the README are written for Visual Studio Code [REST Client](https://marketplace.visualstudio.com/items?itemName=humao.rest-client), however they can simply be transformed to be used with other API Clients or `curl`.

If you are planning to work on this project, contact us to ask for the detailed internal documentation.

## Quickstart Guide

### 1. Request an archive from a list of files

Edit the curl request below to include your `email` and the list of `files` that you want to download (note the included filepath). Aditionally `meta` information can be included using said field. The endpoint can be called using `curl`. Once the files have been gathered and downloaded, you should receive an email with the download link. This endpoint should be called by a data collection, forwarding the files requested by a user and specifiying the users email address. Please note that the archiveID remains empty in the current iteration of the service.

Example:
```http
POST https://dj-api-ucooq6lz5a-oa.a.run.app/archive
Content-Type: application/json
Authorization: Bearer service_key

{
"email": "[email protected]",
"archiveID": "",
"content": [
{
"sourceID": "0ff529e3",
"files": ["/test/dir/file1", "/test/dir/file2"]
},
{
"sourceID": "eba48cdb",
"files": ["/test/dir/file3", "/test/dir/file4"]
}],
"meta": "{meta: information}"
}
```
---
## API Endpoints

### Check if DataDJ service is live (Public)

```http
GET https://dj-api-ucooq6lz5a-oa.a.run.app/ping
```

### Register Services, Taskhandler and Sources
#### 1. Register new Service (Admin)

An admin can task the DJ to generate a new service token/key and to send an email with a redeem link to the specified email address. The service key is required by collections to interact with the DJ for anything related to creating and altering archives.

```http
POST https://dj-api-ucooq6lz5a-oa.a.run.app/admin/createKeyLink
Content-Type: application/json
Authorization: Bearer admin_key

{
"email": "[email protected]"
}
```

#### 2. Register new Taskhandler (Admin)

A taskhandler is the part of the DataDJ responsible for gathering and compressing the requested files, as well as sending an email containing a download link to the user who requested the files. In order to interact to the API part of the DataDJ, the taskhandler requires a handler token/key similar to a service key. Said key can be generated by an admin via the following request and has to be manually handed to the operator of the taskhandler in question (for now).

```http
POST https://dj-api-ucooq6lz5a-oa.a.run.app/admin/registerHandler
Content-Type: application/json
Authorization: Bearer admin_key
```

#### 3. Register new Source (Service)

A source is a representation of a collection holding files to be downloaded. This services the purpose to identify which files have to be gathered where and also to keep track of the origin of every file to provide an overview of every sources contribution to the final archive with all its files. The registration request returns a source-id which subsequentially has to be used to uniquely identify the source when interacting with the DataDJ.

```http
POST https://dj-api-ucooq6lz5a-oa.a.run.app/source
Content-Type: application/json
Authorization: Bearer service_key

{
"name": "Test-Source-One",
"Organisation": "ETHZ"
}
```

### Creating, modifying or downloading archives (Service)
https://dj-api-ucooq6lz5a-oa.a.run.app/archive

This endpoint expects a request that contains four fields:

```json
{
"email":"",
"archiveID":"",
"files":[],
"meta": ""
}
```
`email`, `archiveID` and `meta` are strings, whereas `files` is a list of strings containing the names of the files.
Depending on which fields are left empty, the API triggers different operations. For now only option 4 is being used in tests, whereas the other option are kept for the future.

#### 1. Create an archive from a list of files

Both `email` and `archiveID` are left empty, whereas `files` contains the names of the files the archive should be initialised with.

Example:
```http
POST https://dj-api-ucooq6lz5a-oa.a.run.app/archive
Content-Type: application/json
Authorization: Bearer service_key

{
"email": "",
"archiveID": "",
"content": [
{
"sourceID": "0ff529e3",
"files": ["/test/dir/file1", "/test/dir/file2"]
},
{
"sourceID": "eba48cdb",
"files": ["/test/dir/file3", "/test/dir/file4"]
}],
"meta": "{meta: information}"
}
```

#### 2. Add a list of files to an archive

`email` is left empty. `archiveID` contains the identifier of a previously created archive and `files` the list of files you want to add to the archive.

Example:
```http
POST https://dj-api-ucooq6lz5a-oa.a.run.app/archive
Content-Type: application/json
Authorization: service_key

{
"email": "",
"archiveID": "e01fd941",
"content": [
{
"sourceID": "0ff529e3",
"files": ["/test/dir/file1", "/test/dir/file2"]
},
{
"sourceID": "eba48cdb",
"files": ["/test/dir/file3", "/test/dir/file4"]
}],
"meta": "{meta: information}"
}
```

#### 3. Download an archive

`email` contains the email address the download link is being sent to, `archiveID` specifies the archive you want to download and `files` is left empty. The DataDj will send you a download link that allows you to download the archive as a .zip file.

Example:
```http
POST https://dj-api-ucooq6lz5a-oa.a.run.app/archive
Content-Type: application/json
Authorization: Bearer service_key

{
"email": "[email protected]",
"archiveID": "e01fd941",
"content": [],
"meta": ""
}
```

#### 4. Directly download a list of files as archive

`email` contains the email address the download link is being sent to, `archiveID` is left empty and `files` contains the names of the files you want to download.
The DJ creates an archive of the files in the request and will also return its identifier in the response, in case that archive needs to be accessed or modified later on. However it is not necessary to separatly trigger the notification containing the download link as this is going to happen automatically.

Example:
```http
POST https://dj-api-ucooq6lz5a-oa.a.run.app/archive
Content-Type: application/json
Authorization: Bearer service_key

{
"email": "[email protected]",
"archiveID": "",
"content": [
{
"sourceID": "0ff529e3",
"files": ["/test/dir/file1", "/test/dir/file2"]
},
{
"sourceID": "eba48cdb",
"files": ["/test/dir/file3", "/test/dir/file4"]
}],
"meta": "{meta: information}"
}
```

Currently, the ```/archive``` endpoint returns an object describing the order which was created for the archive in question. Orders are objects telling the taskhandlers which archives should be downloaded.
```json
{
"orderID": "a5777ffb",
"archiveID": "4afc3f67",
"email": "[email protected]",
"date": "2022-12-14 16:27:28.967665178 +0000 UTC m=+67114.216955617",
"status": "opened",
"sources": [
"0ff529e3"
]
}
```

### Inspecting an archive (Service)

https://data-dj-2021.oa.r.appspot.com/archive/id

This endpoint allows to inspect the contents of an archive `id` either in the browser or via an API client. The response is a JSON object representing the archive.

Example:
```http
GET https://dj-api-ucooq6lz5a-oa.a.run.app/archive/a2e11165
Content-Type: application/json
Authorization: Bearer service_key
```
Example Response:
```json
{
"id": "a2e11165",
"content": [
{
"sourceID": "0ff529e3",
"files": [
"/test/dir/file1",
"/test/dir/file2"
]
},
{
"sourceID": "eba48cdb",
"files": [
"/test/dir/file3",
"/test/dir/file4"
]
}
],
"meta": "{meta: information}",
"timeCreated": "2022-12-09 13:31:43.320372 +0100 CET m=+305.508934168",
"timeUpdated": "",
"status": "opened",
"sources": [
"0ff529e3",
"eba48cdb"
]
}
```
---

# Local Development (Outdated)

1. make a copy of `.env.example` and save it as `.env.local`
1. replace the example directory paths, bucketnames and other settings as needed.

_option a: run with go_

download and run the redis image with docker
```
docker pull redis
docker run --name dj-redis -p 6379:6379 -d redis
```
_start the task handler_
open a terminal in project root.
export all of the variables in the `.env.local` file
run the task handler
```
source .env.local
export $(cut -d= -f1 .env.local)
go run ./taskHandler/*.go
```

open a separate terminal in project root.
export all of the variables in the `.env.local` file
run the api
```
source .env.local && export $(cut -d= -f1 .env.local)
go run ./api/*.go
```
note that for any changes in the environment file to take effect, you must export the variables again and restart that part of the application.

_option b: (to be completed)_
to run publisher and subscriber applications using docker. include the path to the .env.local file in the docker run command.
```
docker run --env-file=./.env.local -p 8080:8080 data-dj-image
```
### Docker commands
- `docker build --platform=linux/amd64 -f Dockerfile.api -t dj-api-amd64 .`
- `docker tag dj-api-amd64:0.0.1 europe-west6-docker.pkg.dev/data-dj-2021/dj-docker-repo/dj-api:0.0.1`
- `docker push europe-west6-docker.pkg.dev/data-dj-2021/dj-docker-repo/dj-api:0.0.1`

### Steps for Google Cloud Run
- Follow instructions: [https://zahadum.notion.site/Google-Cloud-4c32dcbe1cfb4b479e8680e852ef0d84](https://zahadum.notion.site/Google-Cloud-4c32dcbe1cfb4b479e8680e852ef0d84)

curl -X POST "0.0.0.0:8765/admin/createKeyLink" \
-H "Authorization: Bearer $ADMIN_KEY" \
-H "content:application/json" \
-d '{"email":"[email protected]"}'`

# Authentication

generates a token
saves hashed token in mongo
middleware function validates token during requests

set mongo collection to delete a document after the given number of seconds.
Does not apply if the index field is not in the document e.g. if a doc does not have `expiryRequestedDate` it will not be deleted.
`db.apiKeys.createIndex( { "expiryRequestedDate": 1 }, { expireAfterSeconds: 3600 } )`

# Useful Reference Material for Go

- [Learning Go](https://learning.oreilly.com/library/view/learning-go/9781492077206/) by Jon Bodner
general reference for programming in GO; types, syntax, imports etc.
see Ch13 for writing tests

- [Cloud Native Go](https://learning.oreilly.com/library/view/cloud-native-go/9781492076322)

# Material for MongoDB

http://www.inanzzz.com/index.php/post/g7e8/running-mongodb-migration-script-at-the-docker-startup