Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/mukhopadhyay/opendata

Open Data ❤️
https://github.com/mukhopadhyay/opendata

data data-science datasets deep-learning kaggle kaggle-dataset machine-learning open-source opendata

Last synced: about 2 months ago
JSON representation

Open Data ❤️

Awesome Lists containing this project

README

        


Open Data ❤️


[![PRs Welcome](https://img.shields.io/badge/PRs-welcome-brightgreen.svg)](http://makeapullrequest.com)

[![forthebadge](https://forthebadge.com/images/badges/built-with-love.svg)](https://forthebadge.com) [![forthebadge](https://forthebadge.com/images/badges/made-with-markdown.svg)](https://forthebadge.com)





> **Open Data** is the idea that some data should be freely available
> to everyone to use and republish as they wish, without restrictions
> from copyright, patents or other mechanisms of control.
> [Wikipedia](https://en.wikipedia.org/wiki/Open_data)



## Index
- [Index](#index)
- [📊 OpenData Websites](#opendata-websites)
- [🖼️ Image Datasets](#image-datasets)
- [📚 NLP Datasets](#nlp-datasets)
- [🎵 Audio Datasets](#audio-datasets)
- [Open Government Sites](https://github.com/Mukhopadhyay/OpenData/blob/master/OPEN_GOV.md)

---



## 📊 OpenData Websites

|Name|Description|URL|
|:---|:----------|:--|
|**CDC Open Data**|The Centers for Disease Control and Prevention (CDC) is the national public health agency of the United States.|[data.cdc.gov](https://data.cdc.gov/)|
|**Data.world**|Data.world is the enterprice data catalog for modern data stack.|[data.world](https://data.world/)|
|**Five Thirty Eight**|[FiveThirtyEight](https://fivethirtyeight.com/) is a website using data and evidence to advance public knowledge. This is their open data portal sharing the data and code behind some of their articles and graphics.|[data.fivethirtyeight.com](https://data.fivethirtyeight.com/)|
|**GENESIS-ONLINE**|The German Federal Statistical Office is the institution to contact first for official data on the society, the economy, the environment and the state.|[www-genesis.destatis.de](https://www-genesis.destatis.de/genesis/online/data)|
|**Kaggle**|[Kaggle](https://www.kaggle.com/), a subsidiary of Google LLC, is an online community of data scientists and machine learning practitioners. Kaggle allows users to find and publish data sets, explore and build models in a web-based data-science environment, work with other data scientists and machine learning engineers, and enter competitions to solve data science challenges.|[www.kaggle.com](https://www.kaggle.com/datasets)|
|**Project Gutenberg**|[Project Gutenberg](https://www.gutenberg.org/) is an online library of free eBooks. Books can be found in almost all extensions.|[www.gutenberg.org](https://www.gutenberg.org/)|
|**Registry of Open Data on AWS**|This registry exists to help people discover and share datasets that are available via AWS resources.|[registry.opendata.aws](https://registry.opendata.aws/)|
|**Science On a Sphere**|[Science On a Sphere](https://sos.noaa.gov/) is a room-sized, global display system that projects visualization of planetary data onto a six foot diameter sphere to help illustrate Earth System science to people of all ages.|[sos.noaa.gov](https://sos.noaa.gov/catalog/datasets/)|
|**Stanford Large Network Dataset collection**|The SNAP library is being actively developed since 2004 and is organically growing as a result of their research pursuits in analysis of large social and information networks.|[snap.stanford.edu](https://snap.stanford.edu/data/)|
|**Stanford Open Data**|Portal for Stanford Open Data|[stanfordopendata.org](https://stanfordopendata.org/#/datasets)|
|**The World Bank**|The World Bank is an internatinoal financial institution that provides loans and grants to the governments of low-and middle-income countries for the purpose of pursuing capitalprojects.|[datacatalog.worldbank.org](https://datacatalog.worldbank.org/home)|
|**U.S Census Bureau**|The United States census is a census that is legally mandated by the US constitution.|[data.census.gov](https://data.census.gov/cedsci/)|
|**U.S Department of Commerce**|Open Data by U.S Department of Commerce|[data.commerce.gov](https://data.commerce.gov/browse)|
|**U.S Education Open Data**|Data Profiles from U.S. Department of Education|[data.ed.gov](https://data.ed.gov/dataset)|
|**U.S Transportation Open Data**|Department of Transportation of United States of America|[data.transportation.gov](https://data.transportation.gov/)|
|**UCI ML Repository**|The UCI ML repository is a collectino of databases, domain theories and data generators that are used by the machine learning comunity for the empirical analysis of machine learning algorithms.|[archive.ics.uci.edu](http://archive.ics.uci.edu/ml/index.php)|
|**UNICEF**|UNICEF, also known as the United Nations Children's Fund, is a United Nations agency responsible for providing humanitarian and developmental aid to children worldwide.|[data.unicef.org](https://data.unicef.org/)|
|**World Health Organization**|The World Health Organization (WHO) is a specialized agency of the United Nations responsible for international public health.|[www.who.int](https://www.who.int/data/gho/data/indicators)|
|**Yelp**|The Yelp Open Dataset is a subset of their businesses, reviews, and user data for use in personal, educational and academic purposes.|[www.yelp.com](https://www.yelp.com/dataset)|
---



[⬆️ Go back to index](#index)

## 📚 NLP Datasets

|Name|Description|URL|
|:---|:----------|:--|
|**20 Newsgroups**|A collection featuring 20,000 documents that covers 20 newsgroups and subjects|[qwone.com](http://qwone.com/~jason/20Newsgroups/)|
|**Amazon question/answer data**|This dataset contains question and answer data from Amazon, totaling around 1.4 million answered questions.|[jmcauley.ucsd.edu](http://jmcauley.ucsd.edu/data/amazon/qa/)|
|**ArXiv**|This massive 270 GB dataset features all arXiv research papers in fulltext.|[arxiv.org](https://arxiv.org/help/bulk_data_s3)|
|**Enron Email Dataset**|This dataset contains 500,000+ messages of Enron officials' emails and is especially of use for anyone looking to expand their understanding of the inner-workings of email tools.|[www.cs.cmu.edu](https://www.cs.cmu.edu/~./enron/)|
|**Google Books Ngrams**|A data set containing Google Books n-gram corpora. This data set is freely available on Amazon S3 in a Hadoop friendly file format and is licensed under a Creative Commons Attribution 3.0 Unported License.|[aws.amazon.com](https://aws.amazon.com/datasets/google-books-ngrams/)|
|**IMDB Reviews**|This is a database for binary sentiment classification substantially more data than previous benchmark datasets.|[ai.stanford.edu](https://ai.stanford.edu/~amaas/data/sentiment/)|
|**Machine Translation of Various Languages**|This dataset consists of training data for four European languages|[statmt.org](http://statmt.org/wmt18/index.html)|
|**Multi-Domain Sentiment Dataset**|A massive variety of Amazon products along with their corresponding reviews|[www.cs.jhu.edu](https://www.cs.jhu.edu/~mdredze/datasets/sentiment/)|
|**Rueters News Dataset**|Originally appearing in 1987, this dataset has been labeled, indexed, and compiled for use in machine learning. |[archive.ics.uci.edu](https://archive.ics.uci.edu/ml/datasets/Reuters-21578+Text+Categorization+Collection)|
|**Sentiment140**|Sentiment140 allows you to discover the sentiment of a brand, product or topic on Twitter|[help.sentiment140.com](http://help.sentiment140.com/for-students/)|
|**Stanford Sentiment Treebank**|Dataset for training a model to identify sentiment with the use of longer phases with its 10,000+ Rotten Tomatoes reviews|[nlp.stanford.edu](https://nlp.stanford.edu/sentiment/code.html)|
|**The WikiQA Corpus**|This publicly-available Q&A dataset was initially compiled to aid in all open-domain question answering research.|[www.microsoft.com](https://www.microsoft.com/en-us/download/details.aspx?id=52419&from=https%3A%2F%2Fresearch.microsoft.com%2Fapps%2Fmobile%2Fdownload.aspx%3Fp%3D4495da01-db8c-4041-a7f6-7984a4f6a905)|
|**Twenty Newsgroups Dataset**|This data set consists of 20000 messages taken from 20 newsgroups|[archive.ics.uci.edu](https://archive.ics.uci.edu/ml/datasets/Twenty+Newsgroups)|
|**Twitter US Airline Sentiment**|Analyze how travelers in February 2015 expressed their feelings on Twitter|[www.kaggle.com](https://www.kaggle.com/crowdflower/twitter-airline-sentiment/)|
|**UCI's Spambase Data set**|This dataset was created by a team at HP (Hewlett-Packard) to help create a spam filter. It contains a litanie of emails previously labeled as spam by users. |[archive.ics.uci.edu](https://archive.ics.uci.edu/ml/datasets/Spambase)|
|**Wikipedia Links Data**|This Google dataset contains approximately 13 million documents with each containing a hyperlink (one minimum each) that goes to an English wikipedia page|[code.google.com](https://code.google.com/archive/p/wiki-links/downloads)|
|**WordNet**|WordNet is a large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept.|[wordnet.princeton.edu](https://wordnet.princeton.edu/)|
|**Yelp Open Dataset**|This Yelp dataset features 8.5M+ reviews of over 160,000 businesses. It also has 200,000+ pictures and spans across 8 major metropolitan areas.|[www.yelp.com](https://www.yelp.com/dataset)|
|**YouTubers-saying-things**|Dataset containing popular YouTuber's video subtitles|[www.kaggle.com](https://www.kaggle.com/praneshmukhopadhyay/youtubers-saying-things)|
---



[⬆️ Go back to index](#index)

## 🖼️ Image Datasets

|Name|Description|URL|
|:---|:----------|:--|
|**CIFAR-10**|The CIFAR-10 dataset consists of 60000 32x32 color imagse in 10 classes, with 6000 imagse per class. There are 50000 training images and 10000 test images.|[www.cs.toronto.edu](https://www.cs.toronto.edu/~kriz/cifar.html)|
|**COCO (Common Objects in Context)**|COCO is a large-scale object detection, segmentation, and captioning dataset.|[cocodataset.org](https://cocodataset.org/#home)|
|**Fashion-MNIST**|Fashion-MNIST is a dataset consisting of training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28x28 grayscale image, associated with a label from 10 classes.|[github.com](https://github.com/zalandoresearch/fashion-mnist)|
|**ImageNet**|ImageNet is an image database organized according to the [WordNet](https://wordnet.princeton.edu/) hierarchy (currently only the nouns), in which each node of the hierarchy is depicted by hundreds and thousands of images.|[www.image-net.org](https://www.image-net.org/index.php)|
|**Open Images Dataset**|Open Images is a dataset of ~9M images annotated with image-level labels, object bounding boxes, object segmentation masks, visual relationships, and localized narratives.|[storage.googleapis.com](https://storage.googleapis.com/openimages/web/index.html)|
|**SVHN (Street View House Number)**|SVHN is a real-world image dataset for developing machine learning and object recognition algorithms with minimal requirement on data preprocessing and formatting.|[ufldl.stanford.edu](http://ufldl.stanford.edu/housenumbers/)|
|**VisualQA**|VQA is a new dataset containing open-ended questions about images. The questions require an understanding of vision, language and commonsense of knowledge to answer.|[visualqa.org](https://visualqa.org/)|
---



[⬆️ Go back to index](#index)

## 🎵 Audio Datasets

|Name|Description|URL|
|:---|:----------|:--|
|**Ballroom**|This dataset gives many informatinos on ballroom dancing. Some characteristic excerpts of many dance style are provided in real audio format. Their tempi are also available.|[mtg.upf.edu](http://mtg.upf.edu/ismir2004/contest/tempoContest/node5.html)|
|**FMA (Free Music Archive)**|A Dataset for Music Analysis|[github.com](https://github.com/mdeff/fma)|
|**Free Spoken Digit Dataset**|A free audio dataset of spoken digits. Think MNIST for audio.|[github.com](https://github.com/Jakobovski/free-spoken-digit-dataset)|
|**LibriSpeech**|LibriSpeech is a corpus of approximately 1000 hours of 16KHz read English speech, prepared by Vassil Panayotov with the assistance of Daniel Povey.|[www.openslr.org](http://www.openslr.org/12/)|
|**Urban Sound 8K Dataset**|This dataset contains 8732 labeled sound excerpts (<=4s) of urban sounds from 10 classes.|[urbansounddataset.weebly.com](https://urbansounddataset.weebly.com/urbansound8k.html)|
|**VoxCeleb**|This is an audio visual dataset consisting of short clips of human speech, extracted from interview videoes uploaded to YouTube|[www.robots.ox.ac.uk](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/)|
---