{"id":26053575,"url":"https://github.com/datasets-org/datasets_server","last_synced_at":"2026-04-22T07:36:15.584Z","repository":{"id":151900041,"uuid":"77235078","full_name":"datasets-org/datasets_server","owner":"datasets-org","description":"Datasets server","archived":false,"fork":false,"pushed_at":"2018-03-31T20:54:44.000Z","size":47,"stargazers_count":1,"open_issues_count":8,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-03-08T07:43:18.218Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/datasets-org.png","metadata":{"files":{"readme":"readme.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2016-12-23T15:45:08.000Z","updated_at":"2017-12-21T22:02:45.000Z","dependencies_parsed_at":"2023-05-15T00:45:24.619Z","dependency_job_id":null,"html_url":"https://github.com/datasets-org/datasets_server","commit_stats":null,"previous_names":[],"tags_count":8,"template":false,"template_full_name":null,"purl":"pkg:github/datasets-org/datasets_server","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/datasets-org%2Fdatasets_server","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/datasets-org%2Fdatasets_server/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/datasets-org%2Fdatasets_server/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/datasets-org%2Fdatasets_server/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/datasets-org","download_url":"https://codeload.github.com/datasets-org/datasets_server/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/datasets-org%2Fdatasets_server/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32126222,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-22T00:31:26.853Z","status":"online","status_checked_at":"2026-04-22T02:00:05.693Z","response_time":58,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-03-08T07:43:14.496Z","updated_at":"2026-04-22T07:36:15.578Z","avatar_url":"https://github.com/datasets-org.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Datasets server\n\nDatasets project solves problem with organizing data sets. It also tries to \nensure experiment consistency and repeatability by data set **immutability**,\n unique identification, usage and change logs.\n\nThis project is inspired by: https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/45390.pdf\n\nData set discovery and identification is based on presence of the file `dataset\n.yaml`. \n\n## Complementary projects\n- https://github.com/tivvit/datasets\n- https://github.com/tivvit/datasets_browser\n\n## Data set properties\n- `id` - UUID\n- `name` - Human readable name  \n- `maintainer` - Email to person responsible for the data set \n- `tags` - Data set tags for simple identification\n- `internal` - Denotes if the data set is not publicly available\n- `data` - Paths to folders with data (inside the data set path) \n- `url` - Public url for the data set\n- `from` - id of the parent data set\n\nGenerated:\n- `type` - \"fs\" for the filesystem\n- `changelog` - Changes detected in the data set \n- `usages` - Reported usages (from the lib)\n\nGenerated from the fs:\n\nFields starting with `_` are paths in the container (changed based on \n`storage_replace` to final fields - `path` ...)\n- `paths`, `_paths` - Path to data set \n- `links`, `_links` - Symlinks pointing to the data set\n- `markdowns`, `_markdowns` - Markdown files found in the data set\n- `characteristics` - Generated statistics of the data set (size, number of \nfiles, extensions)\n\n## Config\n- `database_path` - Where the LMDB should be stored \n- `iter_file_limit` - When searching `dataset.yaml` folders with more then \nthis count won't be scanned\n- `datasets` - paths to folders used for scanning  \n- `storage_replace` - Replace the container paths with the real ones\n\n## Storage types\nData sets may be added trough the API or with the file system analysis. Other\n sources like HDFS or databases may be added. \n\nThe system is currently used with distributed FS (MooseFS - similar to GFS or\n Ceph) mounted with FUSE. Local FS will also work great.\n\n## Database\nAny key-value database is ok. Right now local **LMDB** is used. \n\nOther database may be used by adding connector with `storage.Storage` interface.\nAerospike will be officially supported soon.\n\n## Todo\n- data set monitoring + email notifications\n\n## Development\n```sh\ndocker-compose up dev\n```\n\nFeel free to contribute.\n\n## Copyright and License\n\u0026copy; 2016 [Vít Listík](http://tivvit.cz)\n\nReleased under [MIT license](https://github.com/tivvit/datasets_server/blob/master/LICENSE)","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdatasets-org%2Fdatasets_server","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdatasets-org%2Fdatasets_server","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdatasets-org%2Fdatasets_server/lists"}