Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/iamaziz/PyDataset
Instant access to many datasets in Python.
https://github.com/iamaziz/PyDataset
data-science datasets python
Last synced: 3 months ago
JSON representation
Instant access to many datasets in Python.
- Host: GitHub
- URL: https://github.com/iamaziz/PyDataset
- Owner: iamaziz
- License: mit
- Created: 2016-01-31T20:43:28.000Z (almost 9 years ago)
- Default Branch: master
- Last Pushed: 2022-03-25T16:24:01.000Z (over 2 years ago)
- Last Synced: 2024-04-25T16:03:16.412Z (7 months ago)
- Topics: data-science, datasets, python
- Language: Python
- Homepage:
- Size: 14.9 MB
- Stars: 932
- Watchers: 34
- Forks: 86
- Open Issues: 13
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE.txt
Awesome Lists containing this project
README
## PyDataset
[![PyPI version](https://badge.fury.io/py/pydataset.svg)](http://badge.fury.io/py/pydataset)Provides instant access to many datasets right from Python (in pandas DataFrame structure).
### What?
The idea is simple. There are various datasets available out there, but they are scattered in different places over the web.
Is there a quick way (in Python) to access them instantly without going through the hassle of searching, downloading, and reading ... etc?
PyDataset tries to address that question :)### Usage:
Start with importing `data()`:
```python
from pydataset import data
```
- To load a dataset:
```python
titanic = data('titanic')
```
- To display the documentation of a dataset:
```python
data('titanic', show_doc=True)
```
- To see the available datasets:
```python
data()
```That's it.
See more [examples](examples).### Why?
In `R`, there is a very easy and immediate way to access multiple statistical datasets,
in almost no effort. All it takes is one line ` > data(dataset_name)`.
This makes the life easier for quick prototyping and testing.
Well, I am jealous that Python does not have a similar functionality.
Thus, the aim of `pydataset` is to fill that gap.Currently, `pydataset` has about 757 (mostly numerical-based) datasets, that are based on `RDatasets`.
In the future, I plan to scale it to include a larger set of datasets.
For example,
1) include textual data for NLP-related tasks, and
2) allow adding a new dataset to the in-module repository.### Installation:
`$ pip install pydataset`
#### Uninstall:
- `$ pip uninstall pydataset`
- `$ rm -rf $HOME/.pydataset`### Changelog
**0.2.0**
- Add search dataset by name similarity.
- Example:```python
>>> data('heat')
Did you mean:
Wheat, heart, Heating, Yeast, eidat, badhealth, deaths, agefat, hla, heptathlon, azt
```**0.1.1**
- Fix: add support to Windows and fix filepaths, issue #1
### Dependency:
- pandas### Miscellaneous:
- Tested on OSX and Linux (debian).
- Supports both Python 2 (2.7.11) and Python 3 (3.5.1).#### TODO:
- add textual datasets (e.g. NLTK stuff).
- add samples generators.#### Thanks to:
- [RDatasets](https://github.com/vincentarelbundock/Rdatasets): R's datasets collection.