Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.
Awesome Lists | Featured Topics | Projects
https://github.com/mukhopadhyay/opendata

Open Data ❤️
https://github.com/mukhopadhyay/opendata
data data-science datasets deep-learning kaggle kaggle-dataset machine-learning open-source opendata
Last synced: about 2 months ago
JSON representation
Open Data ❤️
Host: GitHub
URL: https://github.com/mukhopadhyay/opendata
Owner: Mukhopadhyay
License: mit
Created: 2021-11-28T19:37:56.000Z (about 3 years ago)
Default Branch: master
Last Pushed: 2022-05-06T18:38:13.000Z (over 2 years ago)
Last Synced: 2024-10-13T13:24:25.288Z (4 months ago)
Topics: data, data-science, datasets, deep-learning, kaggle, kaggle-dataset, machine-learning, open-source, opendata
Language: Python
Homepage:
Size: 62.5 KB
Stars: 3
Watchers: 2
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
Awesome Lists containing this project

README

        


    Open Data ❤️





[![PRs Welcome](https://img.shields.io/badge/PRs-welcome-brightgreen.svg)](http://makeapullrequest.com)

[![forthebadge](https://forthebadge.com/images/badges/built-with-love.svg)](https://forthebadge.com) [![forthebadge](https://forthebadge.com/images/badges/made-with-markdown.svg)](https://forthebadge.com)











> **Open Data** is the idea that some data should be freely available

> to everyone to use and republish as they wish, without restrictions

> from copyright, patents or other mechanisms of control.

> [Wikipedia](https://en.wikipedia.org/wiki/Open_data)








## Index

- [Index](#index)

- [📊 OpenData Websites](#opendata-websites)

- [🖼️ Image Datasets](#image-datasets)

- [📚 NLP Datasets](#nlp-datasets)

- [🎵 Audio Datasets](#audio-datasets)

- [Open Government Sites](https://github.com/Mukhopadhyay/OpenData/blob/master/OPEN_GOV.md)

---







## 📊 OpenData Websites



|Name|Description|URL|

|:---|:----------|:--|

|**CDC Open Data**|The Centers for Disease Control and Prevention (CDC) is the national public health agency of the United States.|[data.cdc.gov](https://data.cdc.gov/)|

|**Data.world**|Data.world is the enterprice data catalog for modern data stack.|[data.world](https://data.world/)|

|**Five Thirty Eight**|[FiveThirtyEight](https://fivethirtyeight.com/) is a website using data and evidence to advance public knowledge. This is their open data portal sharing the data and code behind some of their articles and graphics.|[data.fivethirtyeight.com](https://data.fivethirtyeight.com/)|

|**GENESIS-ONLINE**|The German Federal Statistical Office is the institution to contact first for official data on the society, the economy, the environment and the state.|[www-genesis.destatis.de](https://www-genesis.destatis.de/genesis/online/data)|

|**Kaggle**|[Kaggle](https://www.kaggle.com/), a subsidiary of Google LLC, is an online community of data scientists and machine learning practitioners. Kaggle allows users to find and publish data sets, explore and build models in a web-based data-science environment, work with other data scientists and machine learning engineers, and enter competitions to solve data science challenges.|[www.kaggle.com](https://www.kaggle.com/datasets)|

|**Project Gutenberg**|[Project Gutenberg](https://www.gutenberg.org/) is an online library of free eBooks. Books can be found in almost all extensions.|[www.gutenberg.org](https://www.gutenberg.org/)|

|**Registry of Open Data on AWS**|This registry exists to help people discover and share datasets that are available via AWS resources.|[registry.opendata.aws](https://registry.opendata.aws/)|

|**Science On a Sphere**|[Science On a Sphere](https://sos.noaa.gov/) is a room-sized, global display system that projects visualization of planetary data onto a six foot diameter sphere to help illustrate Earth System science to people of all ages.|[sos.noaa.gov](https://sos.noaa.gov/catalog/datasets/)|

|**Stanford Large Network Dataset collection**|The SNAP library is being actively developed since 2004 and is organically growing as a result of their research pursuits in analysis of large social and information networks.|[snap.stanford.edu](https://snap.stanford.edu/data/)|

|**Stanford Open Data**|Portal for Stanford Open Data|[stanfordopendata.org](https://stanfordopendata.org/#/datasets)|

|**The World Bank**|The World Bank is an internatinoal financial institution that provides loans and grants to the governments of low-and middle-income countries for the purpose of pursuing capitalprojects.|[datacatalog.worldbank.org](https://datacatalog.worldbank.org/home)|

|**U.S Census Bureau**|The United States census is a census that is legally mandated by the US constitution.|[data.census.gov](https://data.census.gov/cedsci/)|

|**U.S Department of Commerce**|Open Data by U.S Department of Commerce|[data.commerce.gov](https://data.commerce.gov/browse)|

|**U.S Education Open Data**|Data Profiles from U.S. Department of Education|[data.ed.gov](https://data.ed.gov/dataset)|

|**U.S Transportation Open Data**|Department of Transportation of United States of America|[data.transportation.gov](https://data.transportation.gov/)|

|**UCI ML Repository**|The UCI ML repository is a collectino of databases, domain theories and data generators that are used by the machine learning comunity for the empirical analysis of machine learning algorithms.|[archive.ics.uci.edu](http://archive.ics.uci.edu/ml/index.php)|

|**UNICEF**|UNICEF, also known as the United Nations Children's Fund, is a United Nations agency responsible for providing humanitarian and developmental aid to children worldwide.|[data.unicef.org](https://data.unicef.org/)|

|**World Health Organization**|The World Health Organization (WHO) is a specialized agency of the United Nations responsible for international public health.|[www.who.int](https://www.who.int/data/gho/data/indicators)|

|**Yelp**|The Yelp Open Dataset is a subset of their businesses, reviews, and user data for use in personal, educational and academic purposes.|[www.yelp.com](https://www.yelp.com/dataset)|

---





[⬆️ Go back to index](#index)





## 📚 NLP Datasets





|Name|Description|URL|

|:---|:----------|:--|

|**20 Newsgroups**|A collection featuring 20,000 documents that covers 20 newsgroups and subjects|[qwone.com](http://qwone.com/~jason/20Newsgroups/)|

|**Amazon question/answer data**|This dataset contains question and answer data from Amazon, totaling around 1.4 million answered questions.|[jmcauley.ucsd.edu](http://jmcauley.ucsd.edu/data/amazon/qa/)|

|**ArXiv**|This massive 270 GB dataset features all arXiv research papers in fulltext.|[arxiv.org](https://arxiv.org/help/bulk_data_s3)|

|**Enron Email Dataset**|This dataset contains 500,000+ messages of Enron officials' emails and is especially of use for anyone looking to expand their understanding of the inner-workings of email tools.|[www.cs.cmu.edu](https://www.cs.cmu.edu/~./enron/)|

|**Google Books Ngrams**|A data set containing Google Books n-gram corpora. This data set is freely available on Amazon S3 in a Hadoop friendly file format and is licensed under a Creative Commons Attribution 3.0 Unported License.|[aws.amazon.com](https://aws.amazon.com/datasets/google-books-ngrams/)|

|**IMDB Reviews**|This is a database for binary sentiment classification substantially more data than previous benchmark datasets.|[ai.stanford.edu](https://ai.stanford.edu/~amaas/data/sentiment/)|

|**Machine Translation of Various Languages**|This dataset consists of training data for four European languages|[statmt.org](http://statmt.org/wmt18/index.html)|

|**Multi-Domain Sentiment Dataset**|A massive variety of Amazon products along with their corresponding reviews|[www.cs.jhu.edu](https://www.cs.jhu.edu/~mdredze/datasets/sentiment/)|

|**Rueters News Dataset**|Originally appearing in 1987, this dataset has been labeled, indexed, and compiled for use in machine learning. |[archive.ics.uci.edu](https://archive.ics.uci.edu/ml/datasets/Reuters-21578+Text+Categorization+Collection)|

|**Sentiment140**|Sentiment140 allows you to discover the sentiment of a brand, product or topic on Twitter|[help.sentiment140.com](http://help.sentiment140.com/for-students/)|

|**Stanford Sentiment Treebank**|Dataset for training a model to identify sentiment with the use of longer phases with its 10,000+ Rotten Tomatoes reviews|[nlp.stanford.edu](https://nlp.stanford.edu/sentiment/code.html)|

|**The WikiQA Corpus**|This publicly-available Q&A dataset was initially compiled to aid in all open-domain question answering research.|[www.microsoft.com](https://www.microsoft.com/en-us/download/details.aspx?id=52419&from=https%3A%2F%2Fresearch.microsoft.com%2Fapps%2Fmobile%2Fdownload.aspx%3Fp%3D4495da01-db8c-4041-a7f6-7984a4f6a905)|

|**Twenty Newsgroups Dataset**|This data set consists of 20000 messages taken from 20 newsgroups|[archive.ics.uci.edu](https://archive.ics.uci.edu/ml/datasets/Twenty+Newsgroups)|

|**Twitter US Airline Sentiment**|Analyze how travelers in February 2015 expressed their feelings on Twitter|[www.kaggle.com](https://www.kaggle.com/crowdflower/twitter-airline-sentiment/)|

|**UCI's Spambase Data set**|This dataset was created by a team at HP (Hewlett-Packard) to help create a spam filter. It contains a litanie of emails previously labeled as spam by users. |[archive.ics.uci.edu](https://archive.ics.uci.edu/ml/datasets/Spambase)|

|**Wikipedia Links Data**|This Google dataset contains approximately 13 million documents with each containing a hyperlink (one minimum each) that goes to an English wikipedia page|[code.google.com](https://code.google.com/archive/p/wiki-links/downloads)|

|**WordNet**|WordNet is a large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept.|[wordnet.princeton.edu](https://wordnet.princeton.edu/)|

|**Yelp Open Dataset**|This Yelp dataset features 8.5M+ reviews of over 160,000 businesses. It also has 200,000+ pictures and spans across 8 major metropolitan areas.|[www.yelp.com](https://www.yelp.com/dataset)|

|**YouTubers-saying-things**|Dataset containing popular YouTuber's video subtitles|[www.kaggle.com](https://www.kaggle.com/praneshmukhopadhyay/youtubers-saying-things)|

---





[⬆️ Go back to index](#index)





## 🖼️ Image Datasets





|Name|Description|URL|

|:---|:----------|:--|

|**CIFAR-10**|The CIFAR-10 dataset consists of 60000 32x32 color imagse in 10 classes, with 6000 imagse per class. There are 50000 training images and 10000 test images.|[www.cs.toronto.edu](https://www.cs.toronto.edu/~kriz/cifar.html)|

|**COCO (Common Objects in Context)**|COCO is a large-scale object detection, segmentation, and captioning dataset.|[cocodataset.org](https://cocodataset.org/#home)|

|**Fashion-MNIST**|Fashion-MNIST is a dataset consisting of training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28x28 grayscale image, associated with a label from 10 classes.|[github.com](https://github.com/zalandoresearch/fashion-mnist)|

|**ImageNet**|ImageNet is an image database organized according to the [WordNet](https://wordnet.princeton.edu/) hierarchy (currently only the nouns), in which each node of the hierarchy is depicted by hundreds and thousands of images.|[www.image-net.org](https://www.image-net.org/index.php)|

|**Open Images Dataset**|Open Images is a dataset of ~9M images annotated with image-level labels, object bounding boxes, object segmentation masks, visual relationships, and localized narratives.|[storage.googleapis.com](https://storage.googleapis.com/openimages/web/index.html)|

|**SVHN (Street View House Number)**|SVHN is a real-world image dataset for developing machine learning and object recognition algorithms with minimal requirement on data preprocessing and formatting.|[ufldl.stanford.edu](http://ufldl.stanford.edu/housenumbers/)|

|**VisualQA**|VQA is a new dataset containing open-ended questions about images. The questions require an understanding of vision, language and commonsense of knowledge to answer.|[visualqa.org](https://visualqa.org/)|

---





[⬆️ Go back to index](#index)





## 🎵 Audio Datasets





|Name|Description|URL|

|:---|:----------|:--|

|**Ballroom**|This dataset gives many informatinos on ballroom dancing. Some characteristic excerpts of many dance style are provided in real audio format. Their tempi are also available.|[mtg.upf.edu](http://mtg.upf.edu/ismir2004/contest/tempoContest/node5.html)|

|**FMA (Free Music Archive)**|A Dataset for Music Analysis|[github.com](https://github.com/mdeff/fma)|

|**Free Spoken Digit Dataset**|A free audio dataset of spoken digits. Think MNIST for audio.|[github.com](https://github.com/Jakobovski/free-spoken-digit-dataset)|

|**LibriSpeech**|LibriSpeech is a corpus of approximately 1000 hours of 16KHz read English speech, prepared by Vassil Panayotov with the assistance of Daniel Povey.|[www.openslr.org](http://www.openslr.org/12/)|

|**Urban Sound 8K Dataset**|This dataset contains 8732 labeled sound excerpts (<=4s) of urban sounds from 10 classes.|[urbansounddataset.weebly.com](https://urbansounddataset.weebly.com/urbansound8k.html)|

|**VoxCeleb**|This is an audio visual dataset consisting of short clips of human speech, extracted from interview videoes uploaded to YouTube|[www.robots.ox.ac.uk](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/)|

---