Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/karthikncode/nlp-datasets
A list of datasets/corpora for NLP tasks, in reverse chronological order.
https://github.com/karthikncode/nlp-datasets
Last synced: 11 days ago
JSON representation
A list of datasets/corpora for NLP tasks, in reverse chronological order.
- Host: GitHub
- URL: https://github.com/karthikncode/nlp-datasets
- Owner: karthikncode
- Created: 2016-04-18T20:58:31.000Z (over 8 years ago)
- Default Branch: master
- Last Pushed: 2020-01-04T22:32:03.000Z (almost 5 years ago)
- Last Synced: 2024-08-02T08:09:54.667Z (3 months ago)
- Size: 5.86 KB
- Stars: 921
- Watchers: 82
- Forks: 257
- Open Issues: 5
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Datasets for Natural Language Processing
This is a list of datasets/corpora for NLP tasks, in reverse chronological order.
Suggestions and pull requests are welcome. The goal is to make this a collaborative effort to maintain an updated list of quality datasets.# Areas
* [Question Answering](#question-answering)
* [Dialogue Systems](#dialogue-systems)
* [Goal-Oriented Dialogue Systems](#goal-oriented-dialogue-systems)## Question Answering
* **(NLVR)** A Corpus of Natural Language for Visual Reasoning, 2017 [[paper]](http://yoavartzi.com/pub/slya-acl.2017.pdf) [[data]](http://lic.nlp.cornell.edu/nlvr)
* **(MS MARCO)** MS MARCO: A Human Generated MAchine Reading COmprehension Dataset, 2016 [[paper]](https://arxiv.org/abs/1611.09268) [[data]](http://www.msmarco.org/)
* **(NewsQA)** NewsQA: A Machine Comprehension Dataset, 2016 [[paper]](https://arxiv.org/abs/1611.09830) [[data]](https://github.com/Maluuba/newsqa)
* **(SQuAD)** SQuAD: 100,000+ Questions for Machine Comprehension of Text, 2016 [[paper]](http://arxiv.org/abs/1606.05250) [[data]](http://stanford-qa.com)
* **(GraphQuestions)** On Generating Characteristic-rich Question Sets for QA Evaluation, 2016 [[paper]](http://cs.ucsb.edu/~ysu/papers/emnlp16_graphquestions.pdf) [[data]](https://github.com/ysu1989/GraphQuestions)
* **(Story Cloze)** A Corpus and Cloze Evaluation for Deeper Understanding of
Commonsense Stories, 2016 [[paper]](http://arxiv.org/abs/1604.01696) [[data]](http://cs.rochester.edu/nlp/rocstories)
* **(Children's Book Test)** The Goldilocks Principle: Reading Children's Books with Explicit Memory Representations, 2015 [[paper]](http://arxiv.org/abs/1511.02301) [[data]](http://www.thespermwhale.com/jaseweston/babi/CBTest.tgz)
* **(SimpleQuestions)** Large-scale Simple Question Answering with Memory Networks, 2015 [[paper]](http://arxiv.org/pdf/1506.02075v1.pdf) [[data]](https://www.dropbox.com/s/tohrsllcfy7rch4/SimpleQuestions_v2.tgz)
* **(WikiQA)** WikiQA: A Challenge Dataset for Open-Domain Question Answering, 2015 [[paper]](http://research.microsoft.com/pubs/252176/YangYihMeek_EMNLP-15_WikiQA.pdf) [[data]](http://research.microsoft.com/en-US/downloads/4495da01-db8c-4041-a7f6-7984a4f6a905/default.aspx)
* **(CNN-DailyMail)** Teaching Machines to Read and Comprehend, 2015 [[paper]](http://arxiv.org/abs/1506.03340) [[code to generate]](https://github.com/deepmind/rc-data) [[data]](http://cs.nyu.edu/~kcho/DMQA/)
* **(QuizBowl)** A Neural Network for Factoid Question Answering over Paragraphs, 2014 [[paper]](https://www.cs.umd.edu/~miyyer/pubs/2014_qb_rnn.pdf) [[data]](https://www.cs.umd.edu/~miyyer/qblearn/index.html)
* **(MCTest)** MCTest: A Challenge Dataset for the Open-Domain Machine Comprehension of Text, 2013 [[paper]](http://research.microsoft.com/en-us/um/redmond/projects/mctest/MCTest_EMNLP2013.pdf) [[data]](http://research.microsoft.com/en-us/um/redmond/projects/mctest/data.html) [[alternate data link]](https://github.com/mcobzarenco/mctest/tree/master/data/MCTest)
* **(QASent)** What is the Jeopardy model? A quasisynchronous grammar for QA, 2007 [[paper]](http://homes.cs.washington.edu/~nasmith/papers/wang+smith+mitamura.emnlp07.pdf) [[data]](http://cs.stanford.edu/people/mengqiu/data/qg-emnlp07-data.tgz)## Dialogue Systems
* **(Ubuntu Dialogue Corpus)** The Ubuntu Dialogue Corpus : A Large Dataset for Research in Unstructured Multi-Turn Dialogue Systems, 2015 [[paper]](http://arxiv.org/abs/1506.08909) [[data]](https://github.com/rkadlec/ubuntu-ranking-dataset-creator)## Goal-Oriented Dialogue Systems
* **(Frames)** Frames: A Corpus for Adding Memory to Goal-Oriented Dialogue Systems, 2016 [[paper]](https://arxiv.org/abs/1704.00057) [[data]](http://datasets.maluuba.com/Frames)
* **(DSTC 2 & 3)** Dialog State Tracking Challenge 2 & 3, 2013 [[paper]](http://camdial.org/~mh521/dstc/downloads/handbook.pdf) [[data]](http://camdial.org/~mh521/dstc/)