https://github.com/zhanlaoban/nlp_pemdc
NLP Predtrained Embeddings, Models and Datasets Collections(NLP_PEMDC). The collection will keep updating.
https://github.com/zhanlaoban/nlp_pemdc
collections dataset datasets nlp word2vec
Last synced: 3 months ago
JSON representation
NLP Predtrained Embeddings, Models and Datasets Collections(NLP_PEMDC). The collection will keep updating.
- Host: GitHub
- URL: https://github.com/zhanlaoban/nlp_pemdc
- Owner: zhanlaoban
- Created: 2019-04-18T01:44:54.000Z (about 6 years ago)
- Default Branch: master
- Last Pushed: 2020-01-14T02:22:52.000Z (over 5 years ago)
- Last Synced: 2025-01-05T18:52:06.653Z (5 months ago)
- Topics: collections, dataset, datasets, nlp, word2vec
- Homepage:
- Size: 33.2 KB
- Stars: 64
- Watchers: 3
- Forks: 16
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# NLP_PEMDC
**NLP** **P**redtrained **E**mbeddings, **M**odels and **D**atasets **C**ollections(**NLP_PEMDC**)The pretrained word embeddings and datasets for NLP. The collection will keep updating. The purpose of these pre-trained word vectors and datasets is for learning and research purposes only.
不断收集我遇到的各种NLP预训练词向量、模型和数据集。这些预训练词向量和数据集的目的仅用来学习和研究。
The rankings are in no particular order, only in the order I added them. The data set belongs to the original author, thanks! If there is any infringement, please email me and let me know.
排名不分先后,仅按我添加的先后顺序。数据集所有权均属于原作者,感谢!若有侵权,请电邮我告知删除。
# Pretrained Chinese Word Vectors(embeddings):
## Word2vec
1. > ### 100+ Chinese Word Vectors 上百种预训练中文词向量
> [Github](https://github.com/Embedding/Chinese-Word-Vectors)
2. > ### Tencent AI Lab Embedding Corpus for Chinese Words and Phrases
> [URL](https://ai.tencent.com/ailab/nlp/embedding.html)## GloVe
TODO
# Chinese Pre-trained Models
1. > ### Chinese-BERT
> [Github](https://github.com/google-research/bert)
1. > ### Chinese-BERT-wwm
>
> [Github](https://github.com/ymcui/Chinese-BERT-wwm)1. >### Chinese-XLNet
>
>[Github1](https://github.com/brightmart/xlnet_zh)
>
>[Github2](https://github.com/ymcui/Chinese-PreTrained-XLNet)1. >### Chinese-RoBERTa
>
>[Github](https://github.com/brightmart/roberta_zh)1. > ### Chinese-ALBERT
>
> [Github1](https://github.com/brightmart/albert_zh)
>
> [Github2](https://github.com/google-research/ALBERT)# Chinese Courpus:
1. > ### [集合]大规模中文自然语言处理语料 Large Scale Chinese Corpus for NLP
> [Github](https://github.com/brightmart/nlp_chinese_corpus)2. > ### [集合]搜狗实验室语料集合
> [语料数据](http://www.sogou.com/labs/resource/list_yuliao.php)
3. > ### [集合]ChineseNlpCorpus
>
> [Github](https://github.com/SophonPlus/ChineseNlpCorpus)4. > ### [集合]ChineseGLUE
>
> [Github]( https://github.com/chineseGLUE/chineseGLUE )
>
> 目前包含:
>
> 1. ##### LCQMC 口语化描述的语义相似度任务 Semantic Similarity Task [COLING 2018](https://www.aclweb.org/anthology/C18-1166/)
>
> 2. ##### XNLI 语言推断任务 Natural Language Inference [EMNLP 2015](https://www.aclweb.org/anthology/D15-1075/)
>
> 3. ##### TNEWS 今日头条中文新闻(短文本)分类 Short Text Classificaiton for News
>
> 4. ##### INEWS 互联网情感分析任务 Sentiment Analysis for Internet News
>
> 5. ##### THUCNEWS 长文本分类 Long Text classification
>
> 6. ##### iFLYTEK 长文本分类 Long Text classification
>
> 7. ##### DRCD 繁体阅读理解任务 Reading Comprehension for Traditional Chinese
>
> 8. ##### CMRC2018 简体中文阅读理解任务 Reading Comprehension for Simplified Chinese
>
> 9. ##### BQ 智能客服问句匹配 Question Matching for Customer Service [EMNLP 2018](https://www.aclweb.org/anthology/D18-1536/) [Download](http://icrc.hitsz.edu.cn/Article/show/175.html)
>
> 10. ##### MSRANER 命名实体识别 Name Entity Recognition
>
> 11. ##### CHID 成语阅读理解填空 Chinese IDiom Dataset for Cloze Test
>
> 12. ##### CMNLI 语言推理任务 Chinese Multi-Genre NLI5. > ### LCSTS: A Large Scale Chinese Short Text Summarization Dataset
>
> 大规模中文短文本摘要数据集
>
> [arXiv]( https://arxiv.org/abs/1506.05865 )
>
> [Download]( http://icrc.hitsz.edu.cn/Article/show/139.html )6. > ### chinese-poetry: 最全中文诗歌古典文集数据库
> [Github](https://github.com/chinese-poetry/chinese-poetry)7. > ### SentiBridge: 中文实体情感知识库
> [Github](https://github.com/rainarch/SentiBridge)# English Corpus:
1. > ### [collections]GLUE
>
> [Download]( https://gluebenchmark.com/tasks )
>
> **Including:**
>
> 1. The Corpus of Linguistic Acceptability
> 2. The Stanford Sentiment Treebank
> 3. Microsoft Research Paraphrase Corpus
> 4. Semantic Textual Similarity Benchmark
> 5. Quora Question Pairs
> 6. MultiNLI Matched
> 7. MultiNLI Mismatched
> 8. Question NLI
> 9. Recognizing Textual Entailment
> 10. Winograd NLI
> 11. Diagnostics Main
2. > ### [collections]SuperGLUE
>
> [Download](https://super.gluebenchmark.com/tasks)
>
> **Including:**
>
> 1. Broadcoverage Diagnostics
> 2. CommitmentBank
> 3. Choice of Plausible Alternatives
> 4. Multi-Sentence Reading Comprehension
> 5. Recognizing Textual Entailment
> 6. Words in Context
> 7. The Winograd Schema Challenge
> 8. BoolQ
> 9. Reading Comprehension with Commonsense Reasoning
> 10. Winogender Schema Diagnostics
3. > ### IMDB Large Movie Review Dataset
>
> This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well.
>
> [Download](https://ai.stanford.edu/~amaas/data/sentiment/)4. > ### SQuAD2.0
>
> The Stanford Question Answering Dataset
>
> [Website](https://rajpurkar.github.io/SQuAD-explorer/)