https://github.com/kyzhouhzau/nlp_dataset

Repo for collect NLP realted datasets!
https://github.com/kyzhouhzau/nlp_dataset

Last synced: 7 months ago
JSON representation

Repo for collect NLP realted datasets!

Host: GitHub
URL: https://github.com/kyzhouhzau/nlp_dataset
Owner: kyzhouhzau
License: mit
Created: 2019-05-18T14:03:37.000Z (over 6 years ago)
Default Branch: master
Last Pushed: 2019-05-24T09:28:59.000Z (over 6 years ago)
Last Synced: 2025-01-22T12:45:21.338Z (9 months ago)
Size: 5.86 KB
Stars: 2
Watchers: 1
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # NLP_Dataset

# 一、Text Classification

* [Reuters-21578 Text Categorization Collection ](http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html)--1999

* [Large Movie Review Dataset v1.0](http://ai.stanford.edu/~amaas/data/sentiment/)--2011

* [Datasets for single-label text categorization](http://ana.cachopo.org/datasets-for-single-label-text-categorization)--2007

# 二、Question Answering

* [Stanford Question Answering Dataset (SQuAD)](https://rajpurkar.github.io/SQuAD-explorer/)

* [Deepmind Question Answering Corpus](https://github.com/deepmind/rc-data)

* [Amazon question/answer data](http://jmcauley.ucsd.edu/data/amazon/qa/)

# 三、Speech Recognition

* [TIMIT Acoustic-Phonetic Continuous Speech Corpus](https://catalog.ldc.upenn.edu/LDC93S1)

* [voxforge](http://voxforge.org/)

* [LibriSpeech ASR corpus ](http://www.openslr.org/12/)

# 四、 Machine Translation

* [Aligned Hansards of the 36th Parliament of Canada

Release 2001-1a](https://www.isi.edu/natural-language/download/hansard/)

* [

European Parliament Proceedings Parallel Corpus 1996-2011

](http://www.statmt.org/europarl/)

# 五、Document Summarization

* [The AQUAINT Corpus of English News Text](https://catalog.ldc.upenn.edu/LDC2002T31)

* [Legal Case Reports Data Set](https://archive.ics.uci.edu/ml/datasets/Legal+Case+Reports)

# 六、For more datasets Please click the following link

* https://skymind.ai/wiki/open-datasets

* http://www.nltk.org/nltk_data/

* https://nlp.stanford.edu/links/statnlp.html#Corpora

* https://en.wikipedia.org/wiki/List_of_datasets_for_machine-learning_research#Text_data

* https://github.com/niderhoff/nlp-datasets

* https://machinelearningmastery.com/datasets-natural-language-processing/

# Biomedcial Field

## [Mutation extraction](http://infos.korea.ac.kr/bronco/)

* [MutationFinder(MF)](http://mutationfinder.sourceforge.net/)

was  developed  to  extractpoint mutations from the literature using a rule-based ap-proach.

* [extractor of mutation(EMU)](http://bioinf.umbc.edu/EMU/ftp/)

 It extracts not only pointmutations but also insertion/deletion mutation

* [tmVar](https://www.ncbi.nlm.nih.gov/research/bionlp/Tools/)

is a text-mining approach that is based ona conditional random fields (CRF) 

**All above data come from here:**[http://infos.korea.ac.kr/bronco/PublicCorpus.zip](http://infos.korea.ac.kr/bronco/PublicCorpus.zip)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/kyzhouhzau/nlp_dataset

Awesome Lists containing this project

README