https://github.com/karthikncode/nlp-datasets

A list of datasets/corpora for NLP tasks, in reverse chronological order.
https://github.com/karthikncode/nlp-datasets

Last synced: 18 days ago
JSON representation

A list of datasets/corpora for NLP tasks, in reverse chronological order.

Host: GitHub
URL: https://github.com/karthikncode/nlp-datasets
Owner: karthikncode
Created: 2016-04-18T20:58:31.000Z (about 9 years ago)
Default Branch: master
Last Pushed: 2020-01-04T22:32:03.000Z (over 5 years ago)
Last Synced: 2024-11-02T23:32:19.505Z (6 months ago)
Size: 5.86 KB
Stars: 919
Watchers: 81
Forks: 255
Open Issues: 5
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

        # Datasets for Natural Language Processing

This is a list of datasets/corpora for NLP tasks, in reverse chronological order.

Suggestions and pull requests are welcome. The goal is to make this a collaborative effort to maintain an updated list of quality datasets.

# Areas

  * [Question Answering](#question-answering)

  * [Dialogue Systems](#dialogue-systems)

  * [Goal-Oriented Dialogue Systems](#goal-oriented-dialogue-systems)

## Question Answering

  * **(NLVR)** A Corpus of Natural Language for Visual Reasoning, 2017 [[paper]](http://yoavartzi.com/pub/slya-acl.2017.pdf) [[data]](http://lic.nlp.cornell.edu/nlvr)

  * **(MS MARCO)** MS MARCO: A Human Generated MAchine Reading COmprehension Dataset, 2016 [[paper]](https://arxiv.org/abs/1611.09268) [[data]](http://www.msmarco.org/)

  * **(NewsQA)** NewsQA: A Machine Comprehension Dataset, 2016 [[paper]](https://arxiv.org/abs/1611.09830) [[data]](https://github.com/Maluuba/newsqa)

  * **(SQuAD)** SQuAD: 100,000+ Questions for Machine Comprehension of Text, 2016 [[paper]](http://arxiv.org/abs/1606.05250) [[data]](http://stanford-qa.com)

  * **(GraphQuestions)** On Generating Characteristic-rich Question Sets for QA Evaluation, 2016 [[paper]](http://cs.ucsb.edu/~ysu/papers/emnlp16_graphquestions.pdf) [[data]](https://github.com/ysu1989/GraphQuestions)

  * **(Story Cloze)** A Corpus and Cloze Evaluation for Deeper Understanding of

Commonsense Stories, 2016 [[paper]](http://arxiv.org/abs/1604.01696) [[data]](http://cs.rochester.edu/nlp/rocstories)

  * **(Children's Book Test)** The Goldilocks Principle: Reading Children's Books with Explicit Memory Representations, 2015 [[paper]](http://arxiv.org/abs/1511.02301) [[data]](http://www.thespermwhale.com/jaseweston/babi/CBTest.tgz)

  * **(SimpleQuestions)** Large-scale Simple Question Answering with Memory Networks, 2015 [[paper]](http://arxiv.org/pdf/1506.02075v1.pdf) [[data]](https://www.dropbox.com/s/tohrsllcfy7rch4/SimpleQuestions_v2.tgz)

  * **(WikiQA)** WikiQA: A Challenge Dataset for Open-Domain Question Answering, 2015 [[paper]](http://research.microsoft.com/pubs/252176/YangYihMeek_EMNLP-15_WikiQA.pdf) [[data]](http://research.microsoft.com/en-US/downloads/4495da01-db8c-4041-a7f6-7984a4f6a905/default.aspx)

  * **(CNN-DailyMail)** Teaching Machines to Read and Comprehend, 2015 [[paper]](http://arxiv.org/abs/1506.03340) [[code to generate]](https://github.com/deepmind/rc-data)  [[data]](http://cs.nyu.edu/~kcho/DMQA/)

  * **(QuizBowl)** A Neural Network for Factoid Question Answering over Paragraphs, 2014 [[paper]](https://www.cs.umd.edu/~miyyer/pubs/2014_qb_rnn.pdf) [[data]](https://www.cs.umd.edu/~miyyer/qblearn/index.html)

  * **(MCTest)** MCTest: A Challenge Dataset for the Open-Domain Machine Comprehension of Text, 2013 [[paper]](http://research.microsoft.com/en-us/um/redmond/projects/mctest/MCTest_EMNLP2013.pdf) [[data]](http://research.microsoft.com/en-us/um/redmond/projects/mctest/data.html) [[alternate data link]](https://github.com/mcobzarenco/mctest/tree/master/data/MCTest)  

  * **(QASent)** What is the Jeopardy model? A quasisynchronous grammar for QA, 2007 [[paper]](http://homes.cs.washington.edu/~nasmith/papers/wang+smith+mitamura.emnlp07.pdf) [[data]](http://cs.stanford.edu/people/mengqiu/data/qg-emnlp07-data.tgz)

## Dialogue Systems

  * **(Ubuntu Dialogue Corpus)** The Ubuntu Dialogue Corpus : A Large Dataset for Research in Unstructured Multi-Turn Dialogue Systems, 2015 [[paper]](http://arxiv.org/abs/1506.08909) [[data]](https://github.com/rkadlec/ubuntu-ranking-dataset-creator)

## Goal-Oriented Dialogue Systems

  * **(Frames)** Frames: A Corpus for Adding Memory to Goal-Oriented Dialogue Systems, 2016 [[paper]](https://arxiv.org/abs/1704.00057) [[data]](http://datasets.maluuba.com/Frames)

  * **(DSTC 2 & 3)** Dialog State Tracking Challenge 2 & 3, 2013 [[paper]](http://camdial.org/~mh521/dstc/downloads/handbook.pdf) [[data]](http://camdial.org/~mh521/dstc/)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/karthikncode/nlp-datasets

Awesome Lists containing this project

README