https://github.com/datasets-org/datasets_server

Datasets server
https://github.com/datasets-org/datasets_server

Last synced: about 2 months ago
JSON representation

Datasets server

Host: GitHub
URL: https://github.com/datasets-org/datasets_server
Owner: datasets-org
License: mit
Created: 2016-12-23T15:45:08.000Z (over 9 years ago)
Default Branch: master
Last Pushed: 2018-03-31T20:54:44.000Z (about 8 years ago)
Last Synced: 2025-03-08T07:43:18.218Z (over 1 year ago)
Language: Python
Homepage:
Size: 45.9 KB
Stars: 1
Watchers: 1
Forks: 0
Open Issues: 8
Metadata Files:
- Readme: readme.md
- License: LICENSE

Awesome Lists containing this project

README

# Datasets server

Datasets project solves problem with organizing data sets. It also tries to
ensure experiment consistency and repeatability by data set **immutability**,
unique identification, usage and change logs.

This project is inspired by: https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/45390.pdf

Data set discovery and identification is based on presence of the file `dataset
.yaml`.

## Complementary projects
- https://github.com/tivvit/datasets
- https://github.com/tivvit/datasets_browser

## Data set properties
- `id` - UUID
- `name` - Human readable name
- `maintainer` - Email to person responsible for the data set
- `tags` - Data set tags for simple identification
- `internal` - Denotes if the data set is not publicly available
- `data` - Paths to folders with data (inside the data set path)
- `url` - Public url for the data set
- `from` - id of the parent data set

Generated:
- `type` - "fs" for the filesystem
- `changelog` - Changes detected in the data set
- `usages` - Reported usages (from the lib)

Generated from the fs:

Fields starting with `_` are paths in the container (changed based on
`storage_replace` to final fields - `path` ...)
- `paths`, `_paths` - Path to data set
- `links`, `_links` - Symlinks pointing to the data set
- `markdowns`, `_markdowns` - Markdown files found in the data set
- `characteristics` - Generated statistics of the data set (size, number of
files, extensions)

## Config
- `database_path` - Where the LMDB should be stored
- `iter_file_limit` - When searching `dataset.yaml` folders with more then
this count won't be scanned
- `datasets` - paths to folders used for scanning
- `storage_replace` - Replace the container paths with the real ones

## Storage types
Data sets may be added trough the API or with the file system analysis. Other
sources like HDFS or databases may be added.

The system is currently used with distributed FS (MooseFS - similar to GFS or
Ceph) mounted with FUSE. Local FS will also work great.

## Database
Any key-value database is ok. Right now local **LMDB** is used.

Other database may be used by adding connector with `storage.Storage` interface.
Aerospike will be officially supported soon.

## Todo
- data set monitoring + email notifications

## Development
```sh
docker-compose up dev
```

Feel free to contribute.

Released under [MIT license](https://github.com/tivvit/datasets_server/blob/master/LICENSE)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/datasets-org/datasets_server

Awesome Lists containing this project

README