Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/cgarciae/dataget

A framework-agnostic datasets library for Machine Learning research and education.
https://github.com/cgarciae/dataget

Last synced: 3 months ago
JSON representation

A framework-agnostic datasets library for Machine Learning research and education.

Host: GitHub
URL: https://github.com/cgarciae/dataget
Owner: cgarciae
License: mit
Created: 2017-04-10T18:49:49.000Z (almost 8 years ago)
Default Branch: master
Last Pushed: 2022-12-08T12:21:50.000Z (about 2 years ago)
Last Synced: 2024-09-18T05:57:46.869Z (4 months ago)
Language: Python
Homepage: https://cgarciae.github.io/dataget
Size: 1.19 MB
Stars: 18
Watchers: 8
Forks: 7
Open Issues: 4
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE

Awesome Lists containing this project

README

        # Dataget

Dataget is an easy to use, framework-agnostic, dataset library that gives you quick access to a collection of Machine Learning datasets through a simple API.

Main features:

* **Minimal**: Downloads entire datasets with just 1 line of code.

* **Framework Agnostic**: Loads data as `numpy` arrays or `pandas` dataframes which can be easily used with the majority of Machine Learning frameworks.

* **Transparent**: By default stores the data in your current project so you can easily inspect it.

* **Memory Efficient**: When a dataset doesn't fit in memory it will return metadata instead so you can iteratively load it.

* **Integrates with Kaggle**: Supports loading datasets directly from Kaggle in a variety of formats.

Checkout the [documentation](https://cgarciae.github.io/dataget/) for the list of available datasets.

## Getting Started

In dataget you just have to do two things:

* Instantiate a `Dataset` from our collection.

* Call the `get` method to download the data to disk and load it into memory.

Both are usually done in one line:

```python

import dataget

X_train, y_train, X_test, y_test = dataget.image.mnist().get()

```

This example downloads the [MNIST](http://yann.lecun.com/exdb/mnist/) dataset to `./data/image_mnist` and loads it as `numpy` arrays.

### Kaggle Support

Kaggle [promotes](https://www.kaggle.com/docs/datasets#supported-file-types) the use of `csv` files and `dataget` loves it! With dataget you can quickly download any dataset from the platform and have immediate access to the data:

```python

import dataget

df_train, df_test = dataget.kaggle(dataset="cristiangarcia/pointcloudmnist2d").get(

    files=["train.csv", "test.csv"]

)

```

To start using Kaggle datasets just make sure you have properly installed and configured the [Kaggle API](https://github.com/Kaggle/kaggle-api). In the future we want to expand Kaggle support in the following ways:

* Be able to load any file that `numpy` or `pandas` can read.

* Have generic support for other types of datasets like images, audio, video, etc. 

    * e.g `dataget.data.kaggle(..., type="image").get(...)`

## Installation

```bash

pip install dataget

```

## Contributing

Adding a new dataset is easy! Read our guide on [Creating a Dataset](https://cgarciae.github.io/dataget/dataset/) if you are interested in contributing a dataset.

## License

MIT License