An open API service indexing awesome lists of open source software.

https://github.com/zhanlaoban/nlp_pemdc

NLP Predtrained Embeddings, Models and Datasets Collections(NLP_PEMDC). The collection will keep updating.
https://github.com/zhanlaoban/nlp_pemdc

collections dataset datasets nlp word2vec

Last synced: 3 months ago
JSON representation

NLP Predtrained Embeddings, Models and Datasets Collections(NLP_PEMDC). The collection will keep updating.

Awesome Lists containing this project

README

        

# NLP_PEMDC
**NLP** **P**redtrained **E**mbeddings, **M**odels and **D**atasets **C**ollections(**NLP_PEMDC**)

The pretrained word embeddings and datasets for NLP. The collection will keep updating. The purpose of these pre-trained word vectors and datasets is for learning and research purposes only.

不断收集我遇到的各种NLP预训练词向量、模型和数据集。这些预训练词向量和数据集的目的仅用来学习和研究。

The rankings are in no particular order, only in the order I added them. The data set belongs to the original author, thanks! If there is any infringement, please email me and let me know.

排名不分先后,仅按我添加的先后顺序。数据集所有权均属于原作者,感谢!若有侵权,请电邮我告知删除。

# Pretrained Chinese Word Vectors(embeddings):

## Word2vec

1. > ### 100+ Chinese Word Vectors 上百种预训练中文词向量
> [Github](https://github.com/Embedding/Chinese-Word-Vectors)

2. > ### Tencent AI Lab Embedding Corpus for Chinese Words and Phrases
> [URL](https://ai.tencent.com/ailab/nlp/embedding.html)

## GloVe

TODO

# Chinese Pre-trained Models
1. > ### Chinese-BERT
> [Github](https://github.com/google-research/bert)

1. > ### Chinese-BERT-wwm
>
> [Github](https://github.com/ymcui/Chinese-BERT-wwm)

1. >### Chinese-XLNet
>
>[Github1](https://github.com/brightmart/xlnet_zh)
>
>[Github2](https://github.com/ymcui/Chinese-PreTrained-XLNet)

1. >### Chinese-RoBERTa
>
>[Github](https://github.com/brightmart/roberta_zh)

1. > ### Chinese-ALBERT
>
> [Github1](https://github.com/brightmart/albert_zh)
>
> [Github2](https://github.com/google-research/ALBERT)

# Chinese Courpus:
1. > ### [集合]大规模中文自然语言处理语料 Large Scale Chinese Corpus for NLP
> [Github](https://github.com/brightmart/nlp_chinese_corpus)

2. > ### [集合]搜狗实验室语料集合
> [语料数据](http://www.sogou.com/labs/resource/list_yuliao.php)

3. > ### [集合]ChineseNlpCorpus
>
> [Github](https://github.com/SophonPlus/ChineseNlpCorpus)

4. > ### [集合]ChineseGLUE
>
> [Github]( https://github.com/chineseGLUE/chineseGLUE )
>
> 目前包含:
>
> 1. ##### LCQMC 口语化描述的语义相似度任务 Semantic Similarity Task [COLING 2018](https://www.aclweb.org/anthology/C18-1166/)
>
> 2. ##### XNLI 语言推断任务 Natural Language Inference [EMNLP 2015](https://www.aclweb.org/anthology/D15-1075/)
>
> 3. ##### TNEWS 今日头条中文新闻(短文本)分类 Short Text Classificaiton for News
>
> 4. ##### INEWS 互联网情感分析任务 Sentiment Analysis for Internet News
>
> 5. ##### THUCNEWS 长文本分类 Long Text classification
>
> 6. ##### iFLYTEK 长文本分类 Long Text classification
>
> 7. ##### DRCD 繁体阅读理解任务 Reading Comprehension for Traditional Chinese
>
> 8. ##### CMRC2018 简体中文阅读理解任务 Reading Comprehension for Simplified Chinese
>
> 9. ##### BQ 智能客服问句匹配 Question Matching for Customer Service [EMNLP 2018](https://www.aclweb.org/anthology/D18-1536/) [Download](http://icrc.hitsz.edu.cn/Article/show/175.html)
>
> 10. ##### MSRANER 命名实体识别 Name Entity Recognition
>
> 11. ##### CHID 成语阅读理解填空 Chinese IDiom Dataset for Cloze Test
>
> 12. ##### CMNLI 语言推理任务 Chinese Multi-Genre NLI

5. > ### LCSTS: A Large Scale Chinese Short Text Summarization Dataset
>
> 大规模中文短文本摘要数据集
>
> [arXiv]( https://arxiv.org/abs/1506.05865 )
>
> [Download]( http://icrc.hitsz.edu.cn/Article/show/139.html )

6. > ### chinese-poetry: 最全中文诗歌古典文集数据库
> [Github](https://github.com/chinese-poetry/chinese-poetry)

7. > ### SentiBridge: 中文实体情感知识库
> [Github](https://github.com/rainarch/SentiBridge)

# English Corpus:

1. > ### [collections]GLUE
>
> [Download]( https://gluebenchmark.com/tasks )
>
> **Including:**
>
> 1. The Corpus of Linguistic Acceptability
> 2. The Stanford Sentiment Treebank
> 3. Microsoft Research Paraphrase Corpus
> 4. Semantic Textual Similarity Benchmark
> 5. Quora Question Pairs
> 6. MultiNLI Matched
> 7. MultiNLI Mismatched
> 8. Question NLI
> 9. Recognizing Textual Entailment
> 10. Winograd NLI
> 11. Diagnostics Main

2. > ### [collections]SuperGLUE
>
> [Download](https://super.gluebenchmark.com/tasks)
>
> **Including:**
>
> 1. Broadcoverage Diagnostics
> 2. CommitmentBank
> 3. Choice of Plausible Alternatives
> 4. Multi-Sentence Reading Comprehension
> 5. Recognizing Textual Entailment
> 6. Words in Context
> 7. The Winograd Schema Challenge
> 8. BoolQ
> 9. Reading Comprehension with Commonsense Reasoning
> 10. Winogender Schema Diagnostics

3. > ### IMDB Large Movie Review Dataset
>
> This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well.
>
> [Download](https://ai.stanford.edu/~amaas/data/sentiment/)

4. > ### SQuAD2.0
>
> The Stanford Question Answering Dataset
>
> [Website](https://rajpurkar.github.io/SQuAD-explorer/)