{"id":13930044,"url":"https://github.com/datopian/datahub-git-based","last_synced_at":"2025-07-02T11:06:30.125Z","repository":{"id":141959505,"uuid":"273080388","full_name":"datopian/datahub-git-based","owner":"datopian","description":"⚙️ A design for a next generation, fully-git(hub) + cloud based DataHub. ","archived":false,"fork":false,"pushed_at":"2020-06-17T21:21:23.000Z","size":8,"stargazers_count":8,"open_issues_count":0,"forks_count":0,"subscribers_count":12,"default_branch":"master","last_synced_at":"2024-08-08T18:25:46.868Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"https://tech.datopian.com/","language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/datopian.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2020-06-17T21:19:21.000Z","updated_at":"2024-03-12T03:43:16.000Z","dependencies_parsed_at":"2024-01-17T06:12:08.539Z","dependency_job_id":"fdc64584-3b49-4bdf-adb2-ef51f887b3db","html_url":"https://github.com/datopian/datahub-git-based","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/datopian%2Fdatahub-git-based","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/datopian%2Fdatahub-git-based/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/datopian%2Fdatahub-git-based/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/datopian%2Fdatahub-git-based/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/datopian","download_url":"https://codeload.github.com/datopian/datahub-git-based/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":226607585,"owners_count":17658478,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-07T18:02:58.286Z","updated_at":"2024-11-26T19:30:47.053Z","avatar_url":"https://github.com/datopian.png","language":null,"funding_links":[],"categories":["others"],"sub_categories":[],"readme":"A next generation, fully-git(hub) based DataHub.\n\nGit-based is definite. Initial KISS approach means going fully GitHub based so GitHub provides MetaStore *and* HubStore.\n\nHere's a walk through of the desired experience. We walk through 3 cases:\n\n* \"Small\" data files -- these are stored in git.\n* \"Big\" data files (\u003e a few Mb) -- these are stored via git-lfs + giftless into your cloud of choice\n* \"Dependencies\" -- I want to use external data in my project that I don't manage or want to store directly **not specced yet**\n\n# Small Data Example\n\n## The Dataset\n\nSuppose you have a dataset on your hard disk like:\n\n```\nmy-dataset/\n  datapackage.json\n  mydata.csv\n  README.md\n```\n\n`datapackage.json` contents:\n\n```json=\n{\n  \"name\": \"my-dataset\",\n  \"title\" \"My awesome dataset\",\n  \"resources\": [\n    {\n      \"name\": \"mydata\",\n      \"path\": \"mydata.csv\"\n    }\n  ]\n}\n```\n\n`mydata.csv`\n\n```\nA,B,C\n1,2,3\n```\n\n`README.md`\n\n```console=\n$ echo \"This is my awesome dataset. Check it out now!\" \u003e README.md\n```\n\n## Pushing a Dataset\n\n```console=\n$ cd my-dataset\n$ git init\n$ git add *\n$ git commit \"My new dataset\"\n$ git push https://github.com/myorg/my-dataset\n```\n\nIf we added lfs setup we need an lfsinfo:\n\n```\necho \"...\" \u003e .lfsinfo\n```\n\n## Then on DataHub.io\n\nShowcase page:\n\n`$ open https://datahub.io/@myorg/my-dataset/`\n\n\n# Big(ger) data ... (too big on git)\n\nLet's modify our previous example to have a large file:\n\n```\nmy-dataset/\n  datapackage.json\n  mybigdata.csv\n  README.md\n```\n\n`datapackage.json` contents:\n\n```json=\n{\n  \"name\": \"my-dataset\",\n  \"title\" \"My awesome dataset\",\n  \"resources\": [\n    {\n      \"name\": \"mybigdata\",\n      \"path\": \"mybigdata.csv\"\n    }\n  ]\n}\n```\n\n## Add support for storing large data in the cloud\n\nSet a custom LFS server that will then hand out credentials for you to store the file in the cloud\n\n```\n# see https://github.com/git-lfs/git-lfs/wiki/Tutorial#lfs-url\n# this uses the default datahub.io provided cloud storage\n$ git config -f .lfsconfig lfs.url https://giftless.datahub.io/\n$ git add .lfsconfig\n$ git commit -m \"Setting up custom storage for my big data files on datahub.io storage\"\n```\n\n:::info\nIt would be really cool to allow people to bring their own storage if they wanted e.g. `lfs.url` is\n\n```\nhttps://giftless.datahub.io/s3/mybucket/\n```\n:::\n\n## Push the Dataset\n\n```console=\n$ git lfs track mybigdata.csv\n$ git add *\n$ git commit \"My new dataset\"\n$ git push https://github.com/myorg/my-dataset\n```\n\n# API interaction\n\nNB: these API examples are more based on classic DataHub API setup rather than pure git(hub) based. However, wanted to include them for reference and for the future.\n\n## Getting Started (as a client)\n\nYou can access the Web API using any HTTP client, for example `curl`. \n\n## Auth Token\n\nFirst, you must obtain a signed access token. For example, using a token signed by `ckanext-authz-service` and the CKAN 2.x API. In the following examples,  we'll assume this token is saved as `$API_TOKEN`.\n\nYou will add to every request an authorization header like:\n\n```\ncurl -H \"Authorization: Bearer $API_TOKEN\"\n```\n\nWe will omit this header in the examples that follow for the sake of simplicity. But please add it back in.\n\n## Creating a dataset\n\nA basic sequence\n\n```\n# create your project myorg/mydataset\ncurl -X POST https://datahub.io/api/projects/myorg%2Fmydataset\n\n# you have an empty dataset!\ncurl -X GET https://myckan.org/api/projects/:id/dataset\n{\n  // TODO: is name without @ character\n  name: '@myorg/mydataset',\n  id: 'unique-identifier'\n}\n\ncurl -X PUT https://myckan.org/api/projects/:id/dataset\n{\n  'title': 'My new title'  \n}\n\ncurl -X GET https://myckan.org/api/projects/:id/dataset/revisions\n```\n\nAnother way to push a file (not in LFS):\n\n```\ncurl -X PUT https://myckan.org/api/projects/:id/resources/data%2Fdata.csv -d @data.csv\n```\n\nThe server will take care of adding some metadata to `datapackage.json` pointing to this file and saving this file to storage. \n\n## Access a dataset that is owned by an org I'm a member of\n\n2. List the organizations by calling the org list endpoint in the \"hub API\"\n3. Pick an org I want to access\n4. List datasets owned by this org by calling the dataset search API endpoint\n5. Get an authz token to read the dataset and resources I want to access from the authz endpoint\n6. Get the selected dataset metadata from the metastore service (this is what *metastore-service* is about)\n7. Get the resources pointed out by this dataset from blob storage service (currently `giftless`)\n\nIdeally, over time, all of these endpoints will be behind the same API gateway, consume the same identity / authorization tokens, etc. \n\n\n# What is the MVP plan?\n\nOutcome visioning: I can do either the paths above and view my dataset on next.datahub.io\n\n```mermaid\ngraph TD\n\nstart[Start]\npush[Push small \u0026 big dataset]\nshowcase[View a dataset showcase]\n\nstart --\u003e push\nstart --\u003e showcase\n\npush --\u003e workflows[Workflows]\n```\n\n\n* [ ] Push\n  * [x] Push \"small\" data **working today pretty much 😉 - the only thing beyond basic git is adding datapackage.json which you literally do \"by hand\" for now**\n  * [ ] Push flow works with big files to the cloud\n    * [ ] Deploy a giftless server https://github.com/datopian/giftless\n      * [ ] Have a cloud storage provider and e.g. a bucket\n      * [ ] Have giftless backend for this\n    * [ ] Push my dataset (as per above)\n    * [ ] Verify it worked (someone else can clone!)\n      * [ ] What about auth? **Let's make giftless hand out tokens all the time atm ...**\n* [ ] Showcase: showcase a dataset i.e. i can visit datahub.io/github.com/datasets/my-small-test-dataset/ and ditto for big dataset and it looks something like datahub.io/core/finance-vix today ...\n  * [ ] Choose a JS based frontend framework\n  * [ ] Mock out the page\n  * [ ] Wire it up\n    * [ ] Write a backend client (and abstraction) library e.g. `getDatasetMetadata(identifier, ref='master'), getFileUrl(dataset, ...)`\n  * [ ] Deploy\n* [ ] Workflows: build workflows that are triggered on each change or other times e.g. \"Derived data / alternate formats\" (build me zip, build me json for this csv), data validation, ...\n  * [ ] Abuse github workflows for now ...\n    * [ ] Have a pattern and core library for this ..\n  * [ ] Plan out how to move to cloud based airflow or Beam or similar ...\n  * [ ] Monitoring and reporting (e.g. minutes used, what's failing etc)\n    * [ ] Integrate into a user dashboard on datahub.io\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdatopian%2Fdatahub-git-based","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdatopian%2Fdatahub-git-based","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdatopian%2Fdatahub-git-based/lists"}