awesome-datasets

A comprehensive list of annotated training datasets classified by use case.
https://github.com/kili-technology/awesome-datasets

Last synced: 4 days ago
JSON representation

Speech Recognition
- English
  - M-AILABS Speech Dataset
  - CREMA-D
Document Classification
- English
  - RVL-CDIP Dataset - CDIP dataset consists of 400,000 grayscale images in 16 classes (letter, memo, email, file folder, form, handwritten, invoice, advertisement, budget, news article, presentation, scientific publication, questionnaire, resume, scientific report, specification), with 25,000 images per class.
  - Top Streamers on Twitch
Document Question Answering
- English
  - Multi-Domain Wizard-of-Oz dataset (MultiWOZ) - oriented annotated corpus.
  - AmbigQA - domain question answering; especially when exploring new topics, it can be difficult to ask questions that have a single, unambiguous answer. AmbigQA is a new open-domain question answering task which involves predicting a set of question-answer pairs, where every plausible answer is paired with a disambiguated rewrite of the original question.
  - chatterbot/english
  - Coached Conversational Preference Elicitation
  - ConvAI2 dataset
  - Customer Support on Twitter
  - HotpotQA - skip questions, with a strong emphasis on supporting facts to allow for more explicit question answering systems. The data set consists of 113,000 Wikipedia-based QA pairs.
  - Maluuba goal-oriented dialogue - in particular, finding flights and a hotel. The data set contains complex conversations and decisions covering over 250 hotels, flights and destinations.
  - Natural Questions (NQ), - scale corpus for training and evaluating open-ended question answering systems, and the first to replicate the end-to-end process in which people find answers to questions. NQ is a large corpus, consisting of 300,000 questions of natural origin, as well as human-annotated answers from Wikipedia pages, for use in training in quality assurance systems. In addition, we have included 16,000 examples where the answers (to the same questions) are provided by 5 different annotators, useful for evaluating the performance of the QA systems learned.
  - QuAC - seeking QI dialogues (100K questions in total). Question Answering in Context is a dataset for modeling, understanding, and participating in information-seeking dialogues.
  - RecipeQA - by-step instructions and images. Each RecipeQA question involves multiple modalities such as titles, descriptions or images, and working towards an answer requires (i) a common understanding of images and text, (ii) capturing the temporal flow of events, and (iii) understanding procedural knowledge.
  - Relational Strategies in Customer Service (RSiCS) Dataset - related customer service data from four sources. Conversation logs from three commercial customer service VIAs and airline forums on TripAdvisor.com during the month of August 2016.
  - Santa Barbara Corpus of Spoken American English
  - TREC QA Collection - domain and closed-domain questions.
  - The WikiQA corpus - domain questions. In order to reflect the true information needs of general users, they used Bing query logs as a source of questions. Each question is linked to a Wikipedia page that potentially contains the answer.
  - Ubuntu Dialogue Corpus - person conversations from Ubuntu discussion logs, used to receive technical support for various Ubuntu-related issues. The dataset contains 930,000 dialogs and over 100,000,000 words.
  - NarrativeQA - and-answer pairs.
  - OpenBookQA - book exams to assess human understanding of a subject. The open book that accompanies our questions is a set of 1329 elementary level scientific facts. Approximately 6,000 questions focus on understanding these facts and applying them to new situations.
  - QASC - and-answer data set that focuses on sentence composition. It consists of 9,980 8-channel multiple-choice questions on elementary school science (8,134 train, 926 dev, 920 test), and is accompanied by a corpus of 17M sentences.
  - SGD (Schema-Guided Dialogue) dataset - domain conversations covering 16 domains.
  - ConvAI2 dataset
  - Customer Support on Twitter
  - Maluuba goal-oriented dialogue - in particular, finding flights and a hotel. The data set contains complex conversations and decisions covering over 250 hotels, flights and destinations.
  - Break
- Multilingual
  - EXCITEMENTS datasets
  - TyDi QA - answer pairs.
Key Information Extraction
- English
  - NIST - and-white images of synthesized documents: 900 simulated tax submissions, 5,590 images of completed structured form faces, 5,590 text files containing entry field answers.
  - The Kleister NDA dataset - disclosure Agreements, with 3,229 unique pages and 2,160 entities to extract.
  - The Kleister Charity dataset
  - CORD - level semantic labels for parsing. Labels are the bouding box position and the text of the key informations.
- Multilingual
  - XFUND - labeled forms with key-value pairs in 7 languages (Chinese, Japanese, Spanish, French, Italian, German, Portuguese).
  - GHEGA - sheets of electronic components and 136 patents. Each group is further divided in classes: data-sheets classes share the component type and producer; patents classes share the patent source.
  - XFUND - labeled forms with key-value pairs in 7 languages (Chinese, Japanese, Spanish, French, Italian, German, Portuguese).
Optical Character Recognition
- English
  - SROIE
  - RDCL2019
  - FUNSD
  - Total Text Dataset - Oriented, and Curved, one of a kind.
  - Total Text Dataset - Oriented, and Curved, one of a kind.
  - Synth90k
Document Layout Analysis
- English
  - DocBank
  - Layout Analysis Dataset
  - TableBank - based table detection and recognition dataset built with novel weak supervision from Word and Latex documents on the internet, contains 417K high-quality labeled tables.
  - Layout Analysis Dataset
  - PubLayNet
Instant Segmentation
- Defense
  - A DATASET FOR DETECTING FLYING AIRPLANES ON SATELLITE IMAGES - truth annotations of flying airplanes in part of those images to support future research involving flying airplane detection. This dataset is part of the work entitled "Measuring economic activity from space: a case study using flying airplanes and COVID-19" published by the IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing.
  - Airbus Aircraft Detection
  - Highway Traffic Videos Dataset - 5 in
  - iSAID - level annotations. The object categories in iSAID include plane, ship, storage tank, baseball diamond, tennis court, basketball court, ground track field, harbor, bridge, large vehicle, small vehicle, helicopter, roundabout, soccer ball field and swimming pool. The images of ISAID are mainly collected from google earth
  - Traffic Analysis in Original Video Data of Ayalon Road
- Manufacturing
  - casting product image data for quality inspection - scaled images. In all images, augmentation already applied.
  - DAGM 2007
  - Oil Storage Tanks - containing industrial areas around the world. Images are annotated with bounding box information for floating head tanks in the image. Fixed head tanks are not annotated.
  - Kolector surface
  - Metadata - Aerial imagery object identification dataset for building and road detection, and building height estimation - resolution aerial imagery; (2) annotations of over 40,000 building footprints (OSM shapefiles) as well as road polylines; and (3) topographical height data (LIDAR). This dataset can be used as ground truth to train computer vision and machine learning algorithms for object identification and analysis, in particular for building detection and height estimation, as well as road detection.
  - casting product image data for quality inspection - scaled images. In all images, augmentation already applied.
  - DAGM 2007
- Medical
  - BRATS2016
  - CheXpert - ray interpretation, featuring uncertainty labels and radiologist-labeled reference standard evaluation sets.
  - CT Medical Images
  - TMED (Tufts Medical Echocardiogram Dataset) - identified and approved for release by our IRB. Imagery comes from transthoracic echocardiograms acquired in the course of routine care consistent with American Society of Echocardiography (ASE) guidelines, all obtained from 2015-2020 at Tufts Medical Center.
Named-Entity Recognition
- English
  - CCCS-CIC-AndMal-2020 - CIC-AndMal-2020. The dataset includes 200K benign and 200K malware samples totalling to 400K android apps with 14 prominent malware categories and 191 eminent malware families.
  - Malware
  - re3d
  - AnEM - text biomedical papers.
  - CADEC
  - i2b2-2006
  - i2b2-2014 - identification and Heart Disease Risk Factors Challenge.
  - CONLL 2003
  - MUC-6
  - NIST-IEER
  - MITMovie
  - Enron
  - BTC
  - Ritter
  - Groningen Meaning Bank (GMB) - order logic.
  - wikigold
  - GUM-3.1.0
  - Malware
- French
  - DBpedia abstract corpus
  - WikiNER
  - WikiNEuRal - quality automatically-generated dataset for Multilingual Named Entity Recognition.

Programming Languages

Python 6 Shell 2 HTML 1 Jupyter Notebook 1 MATLAB 1 CSS 1

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

awesome-datasets

Speech Recognition

English

Document Classification

English

Document Question Answering

English

Multilingual

Key Information Extraction

English

Multilingual

Optical Character Recognition

English

Document Layout Analysis

English

Instant Segmentation

Defense

Manufacturing

Medical

Named-Entity Recognition

English

French