https://github.com/kili-technology/awesome-datasets

A comprehensive list of annotated training datasets classified by use case.
https://github.com/kili-technology/awesome-datasets

List: awesome-datasets

annotation awesome-data-science awesome-datasets awesome-public-datasets corpora data dataset datasets document-processing entity-extraction entity-recognition ner nlp ocr open-datasets opendata opendatasets public-data public-dataset public-datasets

Last synced: 6 months ago
JSON representation

A comprehensive list of annotated training datasets classified by use case.

Host: GitHub
URL: https://github.com/kili-technology/awesome-datasets
Owner: kili-technology
Created: 2022-05-25T19:03:40.000Z (about 3 years ago)
Default Branch: main
Last Pushed: 2022-07-08T12:18:14.000Z (almost 3 years ago)
Last Synced: 2024-05-22T22:05:01.787Z (about 1 year ago)
Topics: annotation, awesome-data-science, awesome-datasets, awesome-public-datasets, corpora, data, dataset, datasets, document-processing, entity-extraction, entity-recognition, ner, nlp, ocr, open-datasets, opendata, opendatasets, public-data, public-dataset, public-datasets
Homepage: https://cloud.kili-technology.com/label
Size: 24.9 MB
Stars: 28
Watchers: 3
Forks: 6
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

ultimate-awesome - awesome-datasets - A comprehensive list of annotated training datasets classified by use case. (Other Lists / Julia Lists)

README

Awesome Datasets

We're collecting (an admittedly opinionated) list of data annotated datasets in high quality. Most of the data sets listed below are free, however, some are not. They are classified by use case.

*We're only at the beginning, and you can help by contributing to this GitHub!*

## How Can I Help?

If you're interested in this area and would like to hear more, join our [Slack community (coming soon)](#)! We'd also appreciate if you could fill out this short [form (coming soon)](#) to help us better understand what your interests might be.

### Feedback

If you have ideas on how we can make this repository better, feel free to submit an issue with suggestions.

### Contributing

We want this resource to grow with contributions from readers and data enthusiasts. If you'd like to make contributions to this Github repository, please read our contributing guidelines.

# Table of Content
- [Audio](#audio)
- [Speech Recognition](#Speech-Recognition)
- [English](#english)

- [Document processing](#document-processing)
- [Document Classification](#document-classification)
- [English](#english-1)
- [Key Information Extraction](#key-information-extraction)
- [English](#english-2)
- [Multilingual](#multilingual)
- [Optical Character Recognition](#optical-character-recognition)
- [English](#english-3)
- [Document Layout Analysis](#document-layout-analysis)
- [English](#english-4)
- [Japanese](#japanese)
- [Document Question Answering](#document-question-answering)
- [English](#english-5)
- [Multilingual](#multilingual-1)
- [Image Processing](#image-processing)
- [Instant Segmentation](#instant-segmentation)
- [Defense](#defense)
- [Manufacturing](#manufacturing)
- [Medical](#medical)
- [Natural Language Processing](#natural-language-processing)
- [Named-Entity Recognition](#named-entity-recognition)
- [English](#english-6)
- [Defense](#defense-1)
- [Finance](#finance)
- [Medical](#medical-1)
- [News](#news)
- [Queries](#queries)
- [Social media](#social-media)
- [Technology](#technology)
- [Twitter](#twitter)
- [Various](#various)
- [Wikipedia](#wikipedia)
- [French](#french)
- [Medical](#medical-1)
- [News](#news-1)
- [Twitter](#twitter-1)
- [Wikipedia](#wikipedia-1)
- [Relation Extraction](#relation-extraction)
# Audio
## Speech Recognition
### English
- [M-AILABS Speech Dataset](https://www.caito.de/2019/01/03/the-m-ailabs-speech-dataset/) Include 1000 hours of audio plus transcriptions. It includes multiple languages arranged by male voices, female voices, and a mix of the two. Most of the data is based on LibriVox and Project Gutenberg.

Preview

- [CREMA-D](https://dagshub.com/mert.bozkirr/CREMA-D) is a dataset of 7,442 original clips from 91 actors. These clips were from 48 male and 43 female actors between the ages of 20 and 74 coming from various ethnicities.

# Document processing

Documents are an essential part of many businesses in many fields such as law, finance and technology, among others. Automatic processing of documents such as invoices, contracts and resumes is lucrative and opens up many new business avenues. The fields of natural language processing and computer vision have seen considerable progress with the development of deep learning, so these methods have begun to be incorporated into contemporary document understanding systems.

Here is a curated list of datasets for intelligent document processing.

## Document Classification

### English

- [GHEGA](https://bit.ly/3x6z33q) contains two groups of documents: 110 data-sheets of electronic components and 136 patents. Each group is further divided in classes: data-sheets classes share the component type and producer; patents classes share the patent source.

Preview

- [RVL-CDIP Dataset](https://bit.ly/3x8W4CK) The RVL-CDIP dataset consists of 400,000 grayscale images in 16 classes (letter, memo, email, file folder, form, handwritten, invoice, advertisement, budget, news article, presentation, scientific publication, questionnaire, resume, scientific report, specification), with 25,000 images per class.

Preview

- [Top Streamers on Twitch](https://www.kaggle.com/datasets/aayushmishra1512/twitchdata) contains data of Top 1000 Streamers from past year. This data consists of different things like number of viewers, number of active viewers, followers gained and many other relevant columns regarding a particular streamer. It has 11 different columns with all the necessary information that is needed.

Preview

## Key Information Extraction

### English

- [CORD](https://bit.ly/3NmtdSc) The dataset consists of thousands of Indonesian receipts, which contains images and box/text annotations for OCR, and multi-level semantic labels for parsing. Labels are the bouding box position and the text of the key informations.

Preview

- [FUNSD](https://bit.ly/3Q3owhV) A dataset for Text Detection, Optical Character Recognition, Spatial Layout Analysis and Form Understanding. Its consists of 199 fully annotated forms, 31485 words, 9707 semantic entities, 5304 relations.

Preview

- [The Kleister NDA dataset](https://github.com/applicaai/kleister-nda) has 540 Non-disclosure Agreements, with 3,229 unique pages and 2,160 entities to extract.

Preview

- [The Kleister Charity dataset](https://github.com/applicaai/kleister-charity) consists of 2,788 annual financial reports of charity organizations, with 61,643 unique pages and 21,612 entities to extract.

Preview

- [NIST](https://bit.ly/3Q7aBaS) The NIST Structured Forms Database consists of 5,590 pages of binary, black-and-white images of synthesized documents: 900 simulated tax submissions, 5,590 images of completed structured form faces, 5,590 text files containing entry field answers.

Preview

- [SROIE](https://bit.ly/3MnIWzl) Consists of a dataset with 1000 whole scanned receipt images and annotations for the competition on scanned receipts OCR and key information extraction (SROIE). Labels are the bouding box position and the text of the key informations.

Preview

### Multilingual

- [XFUND](https://bit.ly/3zly4yW) is a multilingual form understanding benchmark dataset that includes human-labeled forms with key-value pairs in 7 languages (Chinese, Japanese, Spanish, French, Italian, German, Portuguese).

## Optical Character Recognition

### English

- [FUNSD](https://bit.ly/3Q3owhV) for Text Detection, Optical Character Recognition, Spatial Layout Analysis and Form Understanding. Its consists of 199 fully annotated forms, 31485 words, 9707 semantic entities, 5304 relations.

Preview

- [RDCL2019](https://bit.ly/3xcEEoz) contains scanned pages from contemporary magazines and technical articles.

Preview

- [SROIE](https://bit.ly/3MnIWzl) consists of a dataset with 1000 whole scanned receipt images and annotations for the competition on scanned receipts OCR and key information extraction (SROIE). Labels are the bouding box position and the text of the key informations.

Preview

- [Synth90k](https://bit.ly/3NoVLdX) consists of 9 million images covering 90k English words.

Preview

- [Total Text Dataset](https://bit.ly/3QfKrmn) consists of 1555 images with more than 3 different text orientations: Horizontal, Multi-Oriented, and Curved, one of a kind.

Preview

## Document Layout Analysis

### English

- [DocBank](https://bit.ly/3xoheOF) includes 500K document pages, with 12 types of semantic units: abstract, author, caption, date, equation, figure, footer, list, paragraph, reference, section, table, title.

Preview

- [Layout Analysis Dataset](https://bit.ly/3avHxZZ) contains realistic documents with a wide variety of layouts, reflecting the various challenges in layout analysis. Particular emphasis is placed on magazines and technical/scientific publications which are likely to be the focus of digitisation efforts.

Preview

- [PubLayNet](https://bit.ly/3NScp5u) is a large dataset of document images, of which the layout is annotated with both bounding boxes and polygonal segmentations.

Preview

- [TableBank](https://bit.ly/3NnRnMl) is a new image-based table detection and recognition dataset built with novel weak supervision from Word and Latex documents on the internet, contains 417K high-quality labeled tables.

Preview

### Japanese

HJDataset contains over 250,000 layout element annotations of seven types. In addition to bounding boxes and masks of the content regions, it also includes the hierarchical structures and reading orders for layout elements for advanced analysis.

Preview

## Document Question Answering
### English
- [AmbigQA](https://nlp.cs.washington.edu/ambigqa//) is inherent to open-domain question answering; especially when exploring new topics, it can be difficult to ask questions that have a single, unambiguous answer. AmbigQA is a new open-domain question answering task which involves predicting a set of question-answer pairs, where every plausible answer is paired with a disambiguated rewrite of the original question.

Preview

- [Break](https://allenai.github.io/Break/) is a question understanding dataset, aimed at training models to reason over complex questions.

Preview

- [chatterbot/english](https://www.kaggle.com/datasets/kausr25/chatterbotenglish) contains wide variety of topics to train your model with . The bot will get info about various fields. Though you need huge dataset to create a fully fledged bot but it is suitable for starters

Preview

- [Coached Conversational Preference Elicitation](https://research.google/tools/datasets/coached-conversational-preference-elicitation/) with 12,000 annotated statements between a user and a wizard discussing natural language movie preferences. The data were collected using the Oz Assistant method between two paid workers, one of whom acts as an "assistant" and the other as a "user".

Preview

- [ConvAI2 dataset](http://convai.io/data/ ) The dataset contains more than 2000 dialogs for a PersonaChat contest, where human evaluators recruited through the Yandex.Toloka crowdsourcing platform chatted with bots submitted by teams.

Preview

- [Customer Support on Twitter](https://www.kaggle.com/datasets/thoughtvector/customer-support-on-twitter ) This Kaggle dataset includes more than 3 million tweets and responses from leading brands on Twitter.

Preview

-

DocVQA contains 50 K questions and 12K Images in the dataset. Images are collected from UCSF Industry Documents Library. Questions and answers are manually annotated.

Preview

- [DuReader 2.0](https://allenai.github.io/Break/) is a large-scale, open-domain Chinese data set for reading comprehension (RK) and question answering (QA). It contains over 300K questions, 1.4M obvious documents and corresponding human-generated answers.

Preview

- [HotpotQA](https://hotpotqa.github.io/) is a set of question response data that includes natural multi-skip questions, with a strong emphasis on supporting facts to allow for more explicit question answering systems. The data set consists of 113,000 Wikipedia-based QA pairs.

Preview

- [Maluuba goal-oriented dialogue](https://datasets.maluuba.com/Frames ) is a set of open dialogue data where the conversation is aimed at accomplishing a task or making a decision - in particular, finding flights and a hotel. The data set contains complex conversations and decisions covering over 250 hotels, flights and destinations.
- [Multi-Domain Wizard-of-Oz dataset (MultiWOZ)](https://aclanthology.org/D18-1547/) Is a comprehensive collection of written conversations covering multiple domains and topics. The dataset contains 10,000 dialogs and is at least an order of magnitude larger than any previous task-oriented annotated corpus.

Preview

- [NarrativeQA](https://github.com/deepmind/narrativeqa) is a data set constructed to encourage deeper understanding of language. This dataset involves reasoning about reading whole books or movie scripts. This dataset contains approximately 45,000 pairs of free text question-and-answer pairs.

Preview

- [Natural Questions (NQ),](https://ai.google.com/research/NaturalQuestions) a new large-scale corpus for training and evaluating open-ended question answering systems, and the first to replicate the end-to-end process in which people find answers to questions. NQ is a large corpus, consisting of 300,000 questions of natural origin, as well as human-annotated answers from Wikipedia pages, for use in training in quality assurance systems. In addition, we have included 16,000 examples where the answers (to the same questions) are provided by 5 different annotators, useful for evaluating the performance of the QA systems learned.

Preview

- [OpenBookQA](https://github.com/allenai/OpenBookQA) inspired by open-book exams to assess human understanding of a subject. The open book that accompanies our questions is a set of 1329 elementary level scientific facts. Approximately 6,000 questions focus on understanding these facts and applying them to new situations.

Preview

- [QASC](https://github.com/allenai/qasc) is a question-and-answer data set that focuses on sentence composition. It consists of 9,980 8-channel multiple-choice questions on elementary school science (8,134 train, 926 dev, 920 test), and is accompanied by a corpus of 17M sentences.

Preview

- [QuAC](https://quac.ai/) is a data set for answering questions in context that contains 14K information-seeking QI dialogues (100K questions in total). Question Answering in Context is a dataset for modeling, understanding, and participating in information-seeking dialogues.

Preview

- [RecipeQA](https://hucvl.github.io/recipeqa/)is a set of data for multimodal understanding of recipes. It consists of more than 36,000 pairs of automatically generated questions and answers from approximately 20,000 unique recipes with step-by-step instructions and images. Each RecipeQA question involves multiple modalities such as titles, descriptions or images, and working towards an answer requires (i) a common understanding of images and text, (ii) capturing the temporal flow of events, and (iii) understanding procedural knowledge.

Preview

- [Relational Strategies in Customer Service (RSiCS) Dataset](https://nextit-public.s3-us-west-2.amazonaws.com/rsics.html) A dataset of travel-related customer service data from four sources. Conversation logs from three commercial customer service VIAs and airline forums on TripAdvisor.com during the month of August 2016.

Preview

- [Santa Barbara Corpus of Spoken American English](https://www.linguistics.ucsb.edu/research/santa-barbara-corpus) This dataset contains approximately 249,000 words of transcription, audio and timestamp at the individual intonation units.

Preview

- [SGD (Schema-Guided Dialogue) dataset](https://github.com/google-research-datasets/dstc8-schema-guided-dialogue) containing over 16k of multi-domain conversations covering 16 domains.

Preview

- [TREC QA Collection](https://trec.nist.gov/data/qa.html) has had a track record of answering questions since 1999. In each track, the task was defined so that systems had to retrieve small fragments of text containing an answer to open-domain and closed-domain questions.

Preview

- [The WikiQA corpus](https://www.microsoft.com/en-us/download/details.aspx?id=52419&from=http%3A%2F%2Fresearch.microsoft.com%2Fapps%2Fmobile%2Fdownload.aspx%3Fp%3D4495da01-db8c-4041-a7f6-7984a4f6a905) Is a set of publicly available pairs of questions and phrases collected and annotated for research on the answer to open-domain questions. In order to reflect the true information needs of general users, they used Bing query logs as a source of questions. Each question is linked to a Wikipedia page that potentially contains the answer.

Preview

- [Ubuntu Dialogue Corpus](https://www.kaggle.com/datasets/rtatman/ubuntu-dialogue-corpus) Consists of nearly one million two-person conversations from Ubuntu discussion logs, used to receive technical support for various Ubuntu-related issues. The dataset contains 930,000 dialogs and over 100,000,000 words.

Preview

### Multilingual
- [EXCITEMENTS datasets](https://github.com/hltfbk/EOP-1.2.1/wiki/Data-Sets#data-sets-that-have-to-be-downloaded-separately) is available in English and Italian and contain negative comments from customers giving reasons for their dissatisfaction with a given company.
- [OPUS](https://allenai.github.io/Break/) was created for the standardization and translation of social media texts. It is built by randomly selecting 2,000 messages from the NUS corpus of SMS in English and then translating them into formal Chinese.

Preview

- [TyDi QA](https://github.com/google-research-datasets/tydiqa#download-the-dataset ) is a set of question response data covering 11 typologically diverse languages with 204K question-answer pairs.

Preview

# Image Processing
## Instant Segmentation
### Defense
- [A DATASET FOR DETECTING FLYING AIRPLANES ON SATELLITE IMAGES](https://ieee-dataport.org/open-access/dataset-detecting-flying-airplanes-satellite-images) contains satellite images of areas of interest surrounding 30 different European airports. It also provides ground-truth annotations of flying airplanes in part of those images to support future research involving flying airplane detection. This dataset is part of the work entitled "Measuring economic activity from space: a case study using flying airplanes and COVID-19" published by the IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing.

Preview

- [Airbus Aircraft Detection](https://www.kaggle.com/datasets/airbusgeo/airbus-aircrafts-sample-dataset/download) can be used to detect the number, size and type of aircrafts present on an airport. In turn, this can provide information about the activity of any airport.

Preview

- [Highway Traffic Videos Dataset](https://www.kaggle.com/datasets/aryashah2k/highway-traffic-videos-dataset) is a database of video of traffic on the highway used in [1] and [2]. The video was taken over two days from a stationary camera overlooking I-5 in
Seattle, WA. The video were labeled manually as light, medium, and heavy traffic, which correspond respectively to free-flowing traffic, traffic at reduced speed, and stopped or very slow speed traffic.

Preview

- [iSAID](https://captain-whu.github.io/iSAID/dataset.html) contain 655 451 object instances from 15 categories across 2 806 high resolution images. iSAID uses pixel-level annotations. The object categories in iSAID include plane, ship, storage tank, baseball diamond, tennis court, basketball court, ground track field, harbor, bridge, large vehicle, small vehicle, helicopter, roundabout, soccer ball field and swimming pool. The images of ISAID are mainly collected from google earth

Preview

- [Traffic Analysis in Original Video Data of Ayalon Road](https://github.com/ido90/AyalonRoad) Is compose of 81 (14 hours) videos recording the traffic in Ayalon Road. It can detect and track vehicle over consecutive frame.

Preview

### Manufacturing
- [casting product image data for quality inspection](https://www.kaggle.com/datasets/ravirajsinh45/real-life-industrial-dataset-of-casting-product ) The dataset contains total 7348 image data. These all are the size of (300*300) pixels grey-scaled images. In all images, augmentation already applied.

Preview

- [DAGM 2007](https://www.kaggle.com/datasets/mhskjelvareid/dagm-2007-competition-dataset-optical-inspection ) is a synthetic dataset for defect detection on textured surfaces. It was originally created for a competition at the 2007 symposium of the DAGM

Preview

- [MVTec AD](https://www.mvtec.com/company/research/datasets/mvtec-ad/) is a dataset for benchmarking anomaly detection methods with a focus on industrial inspection. It contains over 5000 high-resolution images divided into fifteen different object and texture categories. Each category comprises a set of defect-free training images and a test set of images with various kinds of defects as well as images without defects.

Preview

- [Oil Storage Tanks](https://www.kaggle.com/datasets/towardsentropy/oil-storage-tanks) contains nearly 200 satellite images taken from Google Earth of tank-containing industrial areas around the world. Images are annotated with bounding box information for floating head tanks in the image. Fixed head tanks are not annotated.

Preview

- [Kolector surface](https://www.vicos.si/resources/kolektorsdd/) is a dataset to detect steel defect.

Preview

- [Metadata - Aerial imagery object identification dataset for building and road detection, and building height estimation](https://figshare.com/articles/dataset/Metadata_-_Aerial_imagery_object_identification_dataset_for_building_and_road_detection_and_building_height_estimation/3504413) For 25 locations across 9 U.S. cities, this dataset provides (1) high-resolution aerial imagery; (2) annotations of over 40,000 building footprints (OSM shapefiles) as well as road polylines; and (3) topographical height data (LIDAR). This dataset can be used as ground truth to train computer vision and machine learning algorithms for object identification and analysis, in particular for building detection and height estimation, as well as road detection.

Preview

### Medical
- [BRATS2016](https://www.smir.ch/BRATS/Start2016) BRATS 2016 is a brain tumor segmentation dataset. It shares the same training set as BRATS 2015, which consists of 220 HHG and 54 LGG. Its testing dataset consists of 191 cases with unknown grades.

Preview

- [CheXpert](https://stanfordmlgroup.github.io/competitions/chexpert/) dataset contains 224,316 chest radiographs of 65,240 patients with both frontal and lateral views available. The task is to do automated chest x-ray interpretation, featuring uncertainty labels and radiologist-labeled reference standard evaluation sets.

Preview

- [CT Medical Images](https://www.kaggle.com/datasets/kmader/siim-medical-images) is designed to allow for different methods to be tested for examining the trends in CT image data associated with using contrast and patient age. The data are a tiny subset of images from the cancer imaging archive. They consist of the middle slice of all CT images taken where valid age, modality, and contrast tags could be found. This results in 475 series from 69 different patients.

Preview

- [Open Access Series of Imaging Studies (OASIS)](https://www.oasis-brains.org/) is a retrospective compilation of data for >1000 participants that were collected across several ongoing projects through the WUSTL Knight ADRC over the course of 30 years. Participants include 609 cognitively normal adults and 489 individuals at various stages of cognitive decline ranging in age from 42-95yrs.

Preview

- [TMED (Tufts Medical Echocardiogram Dataset)](https://tmed.cs.tufts.edu/data_access.html) contains imagery from 2773 patients and supervised labels for two classification tasks from a small subset of 260 patients (because labels are difficult to acquire). All data is de-identified and approved for release by our IRB. Imagery comes from transthoracic echocardiograms acquired in the course of routine care consistent with American Society of Echocardiography (ASE) guidelines, all obtained from 2015-2020 at Tufts Medical Center.

Preview

# Natural Language Processing

## Named-Entity Recognition

Named-entity recognition (NER) (also known as (named) entity identification, entity chunking, and entity extraction) is a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into pre-defined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.

State-of-the-art NER systems for English produce near-human performance. For example, the best system entering MUC-7 scored 93.39% of F-measure while human annotators scored 97.60% and 96.95%.

### English

#### Defense
- [CCCS-CIC-AndMal-2020](https://www.unb.ca/cic/datasets/andmal2020.html) proposes a new comprehensive and huge android malware dataset, named CCCS-CIC-AndMal-2020. The dataset includes 200K benign and 200K malware samples totalling to 400K android apps with 14 prominent malware categories and 191 eminent malware families.

Preview

- [Malware](https://bit.ly/3tdIwEM) consists of texts about malware. It was developed by researchers at the Singapore University of Technology and Design and DSO National Laboratories.

Preview

- [re3d](https://bit.ly/3xcZLHq) focuses on entity and relationship extraction relevant to somebody operating in the role of a defence and security intelligence analyst.

Preview

#### Finance

SEC-filings is generated using CoNll2003 data and financial documents obtained from U.S. Security and Exchange Commission (SEC) filings.

Preview

#### Medical

- [AnEM](https://bit.ly/38RDLd0) consists of abstracts and full-text biomedical papers.

Preview

- [CADEC](https://bit.ly/3xvywtj) is a corpus of adverse drug event annotations.

Preview

- [i2b2-2006](https://bit.ly/3MhJxSR) is the Deidentification and Smoking Challenge dataset.

Preview

- [i2b2-2014](https://bit.ly/3mhcX95) is the 2014 De-identification and Heart Disease Risk Factors Challenge.

Preview

#### News

- [CONLL 2003](https://bit.ly/3NXfCR8) is an annotated dataset for Named Entity Recognition. The tokens are labeled under one of the 9 possible tags.

Preview

- [MUC-6](https://bit.ly/3GS2QkC) contains the 318 annotated Wall Street Journal articles, the scoring software and the corresponding documentation used in the MUC6 evaluation.

Preview

- [NIST-IEER](https://bit.ly/3NXfzoq)

Preview

#### Queries

- [MITMovie](https://bit.ly/3NrwoIg) is a semantically tagged training and test corpus in BIO format.

Preview

- [MITRestaurant](https://bit.ly/3x4qAxw) is a semantically tagged training and test corpus in BIO format.

Preview

#### Social media
- [Enron](https://bit.ly/3x4qAxw) Over half a million anonymized emails from over 100 users. It’s one of the few publically available collections of “real” emails available for study and training sets.

Preview

-

WNUT17 is the dataset for the WNUT 17 Emerging Entities task. It contains text from Twitter, Stack Overflow responses, YouTube comments, and Reddit comments.

Preview

#### Technology

Assembly is a dataset for Named Entity Recognition (NER) from assembly operations text manuals.

Preview

#### Twitter

- [BTC](https://bit.ly/3aomybD) is the Broad Twitter corpus, a dataset of tweets collected over stratified times, places and social uses.

Preview

- [Ritter](https://bit.ly/3xkMrlC) is the same as the training portion of WNUT16 (though with sentences in a different order).

Preview

#### Various

- [BBN](https://bit.ly/3ml7dvk) contains approximately 24,000 pronoun coreferences as well as entity and numeric annotation for approximately 2,300 documents.

Preview

- [Groningen Meaning Bank (GMB)](https://bit.ly/3Q5Dfck) comprises thousands of texts in raw and tokenised format, tags for part of speech, named entities and lexical categories, and discourse representation structures compatible with first-order logic.

Preview

- [OntoNotes 5](https://bit.ly/3Q7bgsS) is a large corpus comprising various genres of text (news, conversational telephone speech, weblogs, usenet newsgroups, broadcast, talk shows) in three languages (English, Chinese, and Arabic) with structural information (syntax and predicate argument structure) and shallow semantics (word sense linked to an ontology and coreference).

Preview

#### Wikipedia

- [GUM-3.1.0](https://bit.ly/3xmegu0) is the Georgetown University Multilayer Corpus.

Preview

- [wikigold](https://bit.ly/3aulGSF) is a manually annotated collection of Wikipedia text.

Preview

- [WikiNEuRal](https://bit.ly/3xafLdh) is a high-quality automatically-generated dataset for Multilingual Named Entity Recognition.

Preview

### French

#### Medical

QUAERO French Medical Corpus has been initially developed as a resource for named entity recognition and normalization.

Preview

#### News

Europeana Newspapers (Dutch, French, German) is a Named Entity Recognition corpora for Dutch, French, German from Europeana Newspapers.

Preview

#### Twitter

CAp 2017 - (Twitter data) concerns the problem of Named Entity Recognition (NER) for tweets written in French.

Preview

#### Wikipedia

- [DBpedia abstract corpus](https://bit.ly/3Q2OXo0) contains a conversion of Wikipedia abstracts in seven languages (dutch, english, french, german, italian, japanese and spanish) into the NLP Interchange Format (NIF).

Preview

- [WikiNER](https://bit.ly/3PZq8t4) is a multilingual named entity recognition dataset from Wikipedia.

Preview

- [WikiNEuRal](https://bit.ly/3xafLdh) is a high-quality automatically-generated dataset for Multilingual Named Entity Recognition.

Preview

## Relation Extraction

Coming soon! 😘

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/kili-technology/awesome-datasets

Awesome Lists containing this project

README

Awesome Datasets