Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/cgarciae/dataget
A framework-agnostic datasets library for Machine Learning research and education.
https://github.com/cgarciae/dataget
Last synced: 3 months ago
JSON representation
A framework-agnostic datasets library for Machine Learning research and education.
- Host: GitHub
- URL: https://github.com/cgarciae/dataget
- Owner: cgarciae
- License: mit
- Created: 2017-04-10T18:49:49.000Z (almost 8 years ago)
- Default Branch: master
- Last Pushed: 2022-12-08T12:21:50.000Z (about 2 years ago)
- Last Synced: 2024-09-18T05:57:46.869Z (4 months ago)
- Language: Python
- Homepage: https://cgarciae.github.io/dataget
- Size: 1.19 MB
- Stars: 18
- Watchers: 8
- Forks: 7
- Open Issues: 4
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE
Awesome Lists containing this project
README
# Dataget
Dataget is an easy to use, framework-agnostic, dataset library that gives you quick access to a collection of Machine Learning datasets through a simple API.
Main features:
* **Minimal**: Downloads entire datasets with just 1 line of code.
* **Framework Agnostic**: Loads data as `numpy` arrays or `pandas` dataframes which can be easily used with the majority of Machine Learning frameworks.
* **Transparent**: By default stores the data in your current project so you can easily inspect it.
* **Memory Efficient**: When a dataset doesn't fit in memory it will return metadata instead so you can iteratively load it.
* **Integrates with Kaggle**: Supports loading datasets directly from Kaggle in a variety of formats.Checkout the [documentation](https://cgarciae.github.io/dataget/) for the list of available datasets.
## Getting Started
In dataget you just have to do two things:
* Instantiate a `Dataset` from our collection.
* Call the `get` method to download the data to disk and load it into memory.Both are usually done in one line:
```python
import datagetX_train, y_train, X_test, y_test = dataget.image.mnist().get()
```This example downloads the [MNIST](http://yann.lecun.com/exdb/mnist/) dataset to `./data/image_mnist` and loads it as `numpy` arrays.
### Kaggle Support
Kaggle [promotes](https://www.kaggle.com/docs/datasets#supported-file-types) the use of `csv` files and `dataget` loves it! With dataget you can quickly download any dataset from the platform and have immediate access to the data:
```python
import datagetdf_train, df_test = dataget.kaggle(dataset="cristiangarcia/pointcloudmnist2d").get(
files=["train.csv", "test.csv"]
)
```
To start using Kaggle datasets just make sure you have properly installed and configured the [Kaggle API](https://github.com/Kaggle/kaggle-api). In the future we want to expand Kaggle support in the following ways:* Be able to load any file that `numpy` or `pandas` can read.
* Have generic support for other types of datasets like images, audio, video, etc.
* e.g `dataget.data.kaggle(..., type="image").get(...)`## Installation
```bash
pip install dataget
```## Contributing
Adding a new dataset is easy! Read our guide on [Creating a Dataset](https://cgarciae.github.io/dataget/dataset/) if you are interested in contributing a dataset.## License
MIT License