Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/datopian/datahub-git-based
⚙️ A design for a next generation, fully-git(hub) + cloud based DataHub.
https://github.com/datopian/datahub-git-based
Last synced: about 2 months ago
JSON representation
⚙️ A design for a next generation, fully-git(hub) + cloud based DataHub.
- Host: GitHub
- URL: https://github.com/datopian/datahub-git-based
- Owner: datopian
- Created: 2020-06-17T21:19:21.000Z (over 4 years ago)
- Default Branch: master
- Last Pushed: 2020-06-17T21:21:23.000Z (over 4 years ago)
- Last Synced: 2024-08-08T18:25:46.868Z (5 months ago)
- Homepage: https://tech.datopian.com/
- Size: 7.81 KB
- Stars: 8
- Watchers: 12
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
- awesome-starred - datopian/datahub-git-based - ⚙️ A design for a next generation, fully-git(hub) + cloud based DataHub. (others)
README
A next generation, fully-git(hub) based DataHub.
Git-based is definite. Initial KISS approach means going fully GitHub based so GitHub provides MetaStore *and* HubStore.
Here's a walk through of the desired experience. We walk through 3 cases:
* "Small" data files -- these are stored in git.
* "Big" data files (> a few Mb) -- these are stored via git-lfs + giftless into your cloud of choice
* "Dependencies" -- I want to use external data in my project that I don't manage or want to store directly **not specced yet**# Small Data Example
## The Dataset
Suppose you have a dataset on your hard disk like:
```
my-dataset/
datapackage.json
mydata.csv
README.md
````datapackage.json` contents:
```json=
{
"name": "my-dataset",
"title" "My awesome dataset",
"resources": [
{
"name": "mydata",
"path": "mydata.csv"
}
]
}
````mydata.csv`
```
A,B,C
1,2,3
````README.md`
```console=
$ echo "This is my awesome dataset. Check it out now!" > README.md
```## Pushing a Dataset
```console=
$ cd my-dataset
$ git init
$ git add *
$ git commit "My new dataset"
$ git push https://github.com/myorg/my-dataset
```If we added lfs setup we need an lfsinfo:
```
echo "..." > .lfsinfo
```## Then on DataHub.io
Showcase page:
`$ open https://datahub.io/@myorg/my-dataset/`
# Big(ger) data ... (too big on git)
Let's modify our previous example to have a large file:
```
my-dataset/
datapackage.json
mybigdata.csv
README.md
````datapackage.json` contents:
```json=
{
"name": "my-dataset",
"title" "My awesome dataset",
"resources": [
{
"name": "mybigdata",
"path": "mybigdata.csv"
}
]
}
```## Add support for storing large data in the cloud
Set a custom LFS server that will then hand out credentials for you to store the file in the cloud
```
# see https://github.com/git-lfs/git-lfs/wiki/Tutorial#lfs-url
# this uses the default datahub.io provided cloud storage
$ git config -f .lfsconfig lfs.url https://giftless.datahub.io/
$ git add .lfsconfig
$ git commit -m "Setting up custom storage for my big data files on datahub.io storage"
```:::info
It would be really cool to allow people to bring their own storage if they wanted e.g. `lfs.url` is```
https://giftless.datahub.io/s3/mybucket/
```
:::## Push the Dataset
```console=
$ git lfs track mybigdata.csv
$ git add *
$ git commit "My new dataset"
$ git push https://github.com/myorg/my-dataset
```# API interaction
NB: these API examples are more based on classic DataHub API setup rather than pure git(hub) based. However, wanted to include them for reference and for the future.
## Getting Started (as a client)
You can access the Web API using any HTTP client, for example `curl`.
## Auth Token
First, you must obtain a signed access token. For example, using a token signed by `ckanext-authz-service` and the CKAN 2.x API. In the following examples, we'll assume this token is saved as `$API_TOKEN`.
You will add to every request an authorization header like:
```
curl -H "Authorization: Bearer $API_TOKEN"
```We will omit this header in the examples that follow for the sake of simplicity. But please add it back in.
## Creating a dataset
A basic sequence
```
# create your project myorg/mydataset
curl -X POST https://datahub.io/api/projects/myorg%2Fmydataset# you have an empty dataset!
curl -X GET https://myckan.org/api/projects/:id/dataset
{
// TODO: is name without @ character
name: '@myorg/mydataset',
id: 'unique-identifier'
}curl -X PUT https://myckan.org/api/projects/:id/dataset
{
'title': 'My new title'
}curl -X GET https://myckan.org/api/projects/:id/dataset/revisions
```Another way to push a file (not in LFS):
```
curl -X PUT https://myckan.org/api/projects/:id/resources/data%2Fdata.csv -d @data.csv
```The server will take care of adding some metadata to `datapackage.json` pointing to this file and saving this file to storage.
## Access a dataset that is owned by an org I'm a member of
2. List the organizations by calling the org list endpoint in the "hub API"
3. Pick an org I want to access
4. List datasets owned by this org by calling the dataset search API endpoint
5. Get an authz token to read the dataset and resources I want to access from the authz endpoint
6. Get the selected dataset metadata from the metastore service (this is what *metastore-service* is about)
7. Get the resources pointed out by this dataset from blob storage service (currently `giftless`)Ideally, over time, all of these endpoints will be behind the same API gateway, consume the same identity / authorization tokens, etc.
# What is the MVP plan?
Outcome visioning: I can do either the paths above and view my dataset on next.datahub.io
```mermaid
graph TDstart[Start]
push[Push small & big dataset]
showcase[View a dataset showcase]start --> push
start --> showcasepush --> workflows[Workflows]
```* [ ] Push
* [x] Push "small" data **working today pretty much 😉 - the only thing beyond basic git is adding datapackage.json which you literally do "by hand" for now**
* [ ] Push flow works with big files to the cloud
* [ ] Deploy a giftless server https://github.com/datopian/giftless
* [ ] Have a cloud storage provider and e.g. a bucket
* [ ] Have giftless backend for this
* [ ] Push my dataset (as per above)
* [ ] Verify it worked (someone else can clone!)
* [ ] What about auth? **Let's make giftless hand out tokens all the time atm ...**
* [ ] Showcase: showcase a dataset i.e. i can visit datahub.io/github.com/datasets/my-small-test-dataset/ and ditto for big dataset and it looks something like datahub.io/core/finance-vix today ...
* [ ] Choose a JS based frontend framework
* [ ] Mock out the page
* [ ] Wire it up
* [ ] Write a backend client (and abstraction) library e.g. `getDatasetMetadata(identifier, ref='master'), getFileUrl(dataset, ...)`
* [ ] Deploy
* [ ] Workflows: build workflows that are triggered on each change or other times e.g. "Derived data / alternate formats" (build me zip, build me json for this csv), data validation, ...
* [ ] Abuse github workflows for now ...
* [ ] Have a pattern and core library for this ..
* [ ] Plan out how to move to cloud based airflow or Beam or similar ...
* [ ] Monitoring and reporting (e.g. minutes used, what's failing etc)
* [ ] Integrate into a user dashboard on datahub.io