Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/jovianhq/opendatasets

A Python library for downloading datasets from Kaggle, Google Drive, and other online sources.
https://github.com/jovianhq/opendatasets

data-science datasets machine-learning python

Last synced: 3 days ago
JSON representation

A Python library for downloading datasets from Kaggle, Google Drive, and other online sources.

Awesome Lists containing this project

README

        

# opendatasets

`opendatasets` is a Python library for downloading datasets from online sources like [Kaggle](https://www.kaggle.com/datasets) and Google Drive using a simple Python command.

### Installation

Install the library using `pip`:

```
pip install opendatasets --upgrade
```

### Usage - Downloading a dataset

Datasets can be downloaded within a Jupyter notebook or Python script using the `opendatasets.download` helper function. Here's some sample code for downloading the [US Elections Dataset](https://www.kaggle.com/tunguz/us-elections-dataset):

```
import opendatasets as od
dataset_url = 'https://www.kaggle.com/tunguz/us-elections-dataset'
od.download('https://www.kaggle.com/tunguz/us-elections-dataset')
```

`dataset_url` can also point to a public Google Drive link or a raw file URL.

### Kaggle Credentials

`opendatasets` uses the [Kaggle Official API](https://github.com/Kaggle/kaggle-api) for donwloading dataset from Kaggle. Follow these steps to find your API credentials:

1. Go to [https://kaggle.com/me/account](https://kaggle.com/me/account) (sign in if required).

2. Scroll down to the "API" section and click "Create New API Token". This will download a file `kaggle.json` with the following contents:

```
{"username":"YOUR_KAGGLE_USERNAME","key":"YOUR_KAGGLE_KEY"}
```

3. When you run `opendatsets.download`, you will be asked to enter your username & Kaggle API, which you can get from the file downloaded in step 2.

Note that you need to download the `kaggle.json` file only once. You can also place the `kaggle.json` file in the same directory as the Jupyter notebook, and the credentials will be read automatically.

**IMPORTANT NOTE**: If you're downloading a competition dataset, make sure to first accept the rules of the competition.

### Some interesting datasets

You can find interesting datasets on Kaggle: https://www.kaggle.com/datasets

*You can also create a new dataset on Kaggle by uploading a CSV file here: https://www.kaggle.com/datasets?new=true (make sure to keep your dataset public, otherwise it will not be downloadable)*

- Video Games sales: https://www.kaggle.com/gregorut/videogamesales
- World University Rankings: https://www.kaggle.com/mylesoneill/world-university-rankings
- Netflix Tv shows and Movies: https://www.kaggle.com/shivamb/netflix-shows/notebooks
- StackOverflow Developer Survey: https://www.kaggle.com/stackoverflow/stack-overflow-2018-developer-survey
- Google Play Store Android Apps Data: https://www.kaggle.com/lava18/google-play-store-apps
- Indian Stock Market Data: https://www.kaggle.com/rohanrao/nifty50-stock-market-data
- Indian Air Quality: https://www.kaggle.com/rohanrao/air-quality-data-in-india
- Worldwide Covid-19 Cases: https://www.kaggle.com/imdevskp/corona-virus-report
- USA Covid-19 Cases: https://www.kaggle.com/sudalairajkumar/covid19-in-usa
- US Election Results (2012): https://www.kaggle.com/tunguz/us-elections-dataset
- US Stock Market: https://www.kaggle.com/borismarjanovic/price-volume-data-for-all-us-stocks-etfs/
- Crop production in India: https://www.kaggle.com/srinivas1/agricuture-crops-production-in-india
- Agricultural raw material prices: https://www.kaggle.com/kianwee/agricultural-raw-material-prices-19902020
- Agricultural land values: https://www.kaggle.com/jmullan/agricultural-land-values-19972017
- Digital payments in India: https://www.kaggle.com/lazycipher/upi-usage-statistics-aug16-to-feb20
- US Unemployment Rate Data: https://www.kaggle.com/jayrav13/unemployment-by-county-us
- India Road accident Data: https://community.data.gov.in/statistics-of-road-accidents-in-india/
- Data Science Jobs Data:
- https://www.kaggle.com/sl6149/data-scientist-job-market-in-the-us
- https://www.kaggle.com/jonatancr/data-science-jobs-around-the-world
- https://www.kaggle.com/rkb0023/glassdoor-data-science-jobs
- Youtube Trending Videos: https://www.kaggle.com/datasnaek/youtube-new
- Asteroid Dataset: https://www.kaggle.com/sakhawat18/asteroid-dataset
- Solar flares Data: https://www.kaggle.com/khsamaha/solar-flares-rhessi
- F-1 Race Data: https://www.kaggle.com/cjgdev/formula-1-race-data-19502017
- Automobile Insurance: https://www.kaggle.com/aashishjhamtani/automobile-insurance
- PUBG video game matches: https://www.kaggle.com/skihikingkevin/pubg-match-deaths
- CounterStrike GO (video game)
- https://www.kaggle.com/mateusdmachado/csgo-professional-matches
- https://www.kaggle.com/skihikingkevin/csgo-matchmaking-damage
- Dota 2 (video game): https://www.kaggle.com/devinanzelmo/dota-2-matches
- Cricket One-Day Internationals Data: https://www.kaggle.com/jaykay12/odi-cricket-matches-19712017
- Cricket Indian Premier League Data: https://www.kaggle.com/nowke9/ipldata
- Basketball (NCAA): https://www.kaggle.com/ncaa/ncaa-basketball
- Basketball NBA Players Stats: https://www.kaggle.com/ncaa/ncaa-basketball
- Football datasets:
- https://www.kaggle.com/martj42/international-football-results-from-1872-to-2017
- https://www.kaggle.com/abecklas/fifa-world-cup
- https://www.kaggle.com/egadharmawan/uefa-champion-league-final-all-season-19552019
- Hotel Booking Demand: https://www.kaggle.com/jessemostipak/hotel-booking-demand
- New York Airbnb listings: https://www.kaggle.com/dgomonov/new-york-city-airbnb-open-data

Other sources to look for datasets:
- [UCI Machine Learning Repository](http://archive.ics.uci.edu/ml/index.php)
- [awesome-public-datasets](https://github.com/awesomedata/awesome-public-datasets)
- [Google Dataset Search](https://datasetsearch.research.google.com)

*If you use an external source other than Kaggle, you'll create a new dataset on Kaggle by uploading a CSV file here: https://www.kaggle.com/datasets?new=true (make sure to keep your dataset public, otherwise it will not be downloadable using `opendatasets`)*

## Curated Datasets

`opendatasets` also provides some curated datsets that you can download by passing the Dataset ID to `opendatasets.download`. Here's an example:

```
import opendatasets
opendatasets.download('stackoverflow-developer-survey-2020')
```

The following datasets are available for download.


Dataset ID
Description
Source


stackoverflow-developer-survey-2020
Stack Overflow Developer Survey 2020

Stack Overflow



owid-covid-19-latest
Covid-19 Stats by Our World in Data

Our World in Data



state-of-javascript-2016
State of Javascript Annual Survey 2016

StateOfJS



state-of-javascript-2017
State of Javascript Annual Survey 2017

StateOfJS



state-of-javascript-2018
State of Javascript Annual Survey 2018

StateOfJS



state-of-javascript-2019
State of Javascript Annual Survey 2019

StateOfJS



countries-languages-spoken
Languages Spoken in Different Countries

Infoplease

More datasets will be added soon..

## Contributing

This is an open source project and we welcome contributions.

### Local Development Setup

1. Clone the repository:

```
git clone https://github.com/JovianML/opendatasets.git
```

2. Setup the Python environment for development

```
conda create -n opendatasets python=3.5
conda activate opendatasets
pip install -r requirements.txt
```

3. Open up the project in VS code and make your changes. Make sure to install the Python Extension for VS Code and select the `opendatasets` conda environment.

This package is developed and maintained by the [Jovian](https://www.jovian.ai) team.