https://github.com/kyzhouhzau/nlp_dataset
Repo for collect NLP realted datasets!
https://github.com/kyzhouhzau/nlp_dataset
Last synced: 7 months ago
JSON representation
Repo for collect NLP realted datasets!
- Host: GitHub
- URL: https://github.com/kyzhouhzau/nlp_dataset
- Owner: kyzhouhzau
- License: mit
- Created: 2019-05-18T14:03:37.000Z (over 6 years ago)
- Default Branch: master
- Last Pushed: 2019-05-24T09:28:59.000Z (over 6 years ago)
- Last Synced: 2025-01-22T12:45:21.338Z (9 months ago)
- Size: 5.86 KB
- Stars: 2
- Watchers: 1
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# NLP_Dataset
# 一、Text Classification
* [Reuters-21578 Text Categorization Collection ](http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html)--1999
* [Large Movie Review Dataset v1.0](http://ai.stanford.edu/~amaas/data/sentiment/)--2011* [Datasets for single-label text categorization](http://ana.cachopo.org/datasets-for-single-label-text-categorization)--2007
# 二、Question Answering
* [Stanford Question Answering Dataset (SQuAD)](https://rajpurkar.github.io/SQuAD-explorer/)
* [Deepmind Question Answering Corpus](https://github.com/deepmind/rc-data)
* [Amazon question/answer data](http://jmcauley.ucsd.edu/data/amazon/qa/)# 三、Speech Recognition
* [TIMIT Acoustic-Phonetic Continuous Speech Corpus](https://catalog.ldc.upenn.edu/LDC93S1)
* [voxforge](http://voxforge.org/)
* [LibriSpeech ASR corpus ](http://www.openslr.org/12/)# 四、 Machine Translation
* [Aligned Hansards of the 36th Parliament of Canada
Release 2001-1a](https://www.isi.edu/natural-language/download/hansard/)
* [
European Parliament Proceedings Parallel Corpus 1996-2011
](http://www.statmt.org/europarl/)# 五、Document Summarization
* [The AQUAINT Corpus of English News Text](https://catalog.ldc.upenn.edu/LDC2002T31)* [Legal Case Reports Data Set](https://archive.ics.uci.edu/ml/datasets/Legal+Case+Reports)
# 六、For more datasets Please click the following link
* https://skymind.ai/wiki/open-datasets
* http://www.nltk.org/nltk_data/
* https://nlp.stanford.edu/links/statnlp.html#Corpora
* https://en.wikipedia.org/wiki/List_of_datasets_for_machine-learning_research#Text_data
* https://github.com/niderhoff/nlp-datasets
* https://machinelearningmastery.com/datasets-natural-language-processing/
# Biomedcial Field
## [Mutation extraction](http://infos.korea.ac.kr/bronco/)
* [MutationFinder(MF)](http://mutationfinder.sourceforge.net/)was developed to extractpoint mutations from the literature using a rule-based ap-proach.
* [extractor of mutation(EMU)](http://bioinf.umbc.edu/EMU/ftp/)It extracts not only pointmutations but also insertion/deletion mutation
* [tmVar](https://www.ncbi.nlm.nih.gov/research/bionlp/Tools/)is a text-mining approach that is based ona conditional random fields (CRF)
**All above data come from here:**[http://infos.korea.ac.kr/bronco/PublicCorpus.zip](http://infos.korea.ac.kr/bronco/PublicCorpus.zip)