https://github.com/zhanlaoban/nlp_pemdc

NLP Predtrained Embeddings, Models and Datasets Collections(NLP_PEMDC). The collection will keep updating.
https://github.com/zhanlaoban/nlp_pemdc

collections dataset datasets nlp word2vec

Last synced: 3 months ago
JSON representation

NLP Predtrained Embeddings, Models and Datasets Collections(NLP_PEMDC). The collection will keep updating.

Host: GitHub
URL: https://github.com/zhanlaoban/nlp_pemdc
Owner: zhanlaoban
Created: 2019-04-18T01:44:54.000Z (about 6 years ago)
Default Branch: master
Last Pushed: 2020-01-14T02:22:52.000Z (over 5 years ago)
Last Synced: 2025-01-05T18:52:06.653Z (5 months ago)
Topics: collections, dataset, datasets, nlp, word2vec
Homepage:
Size: 33.2 KB
Stars: 64
Watchers: 3
Forks: 16
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

        # NLP_PEMDC

**NLP** **P**redtrained **E**mbeddings, **M**odels and **D**atasets **C**ollections(**NLP_PEMDC**)

The pretrained word embeddings and datasets for NLP. The collection will keep updating. The purpose of these pre-trained word vectors and datasets is for learning and research purposes only. 

不断收集我遇到的各种NLP预训练词向量、模型和数据集。这些预训练词向量和数据集的目的仅用来学习和研究。

The rankings are in no particular order, only in the order I added them. The data set belongs to the original author, thanks! If there is any infringement, please email me and let me know. 

排名不分先后，仅按我添加的先后顺序。数据集所有权均属于原作者，感谢！若有侵权，请电邮我告知删除。  

# Pretrained Chinese Word Vectors(embeddings):

## Word2vec

1. > ### 100+ Chinese Word Vectors 上百种预训练中文词向量

   > [Github](https://github.com/Embedding/Chinese-Word-Vectors)

   

2. > ### Tencent AI Lab Embedding Corpus for Chinese Words and Phrases

   > [URL](https://ai.tencent.com/ailab/nlp/embedding.html)

## GloVe

TODO

# Chinese Pre-trained Models

1. > ### Chinese-BERT

   > [Github](https://github.com/google-research/bert)

   

1. > ### Chinese-BERT-wwm

   >

   > [Github](https://github.com/ymcui/Chinese-BERT-wwm)

1. >### Chinese-XLNet

   >

   >[Github1](https://github.com/brightmart/xlnet_zh)

   >

   >[Github2](https://github.com/ymcui/Chinese-PreTrained-XLNet)

1. >### Chinese-RoBERTa

   >

   >[Github](https://github.com/brightmart/roberta_zh)

1. > ### Chinese-ALBERT

   >

   > [Github1](https://github.com/brightmart/albert_zh)

   >

   > [Github2](https://github.com/google-research/ALBERT)

# Chinese Courpus:

1. > ### [集合]大规模中文自然语言处理语料 Large Scale Chinese Corpus for NLP

   > [Github](https://github.com/brightmart/nlp_chinese_corpus)

2. > ### [集合]搜狗实验室语料集合

   > [语料数据](http://www.sogou.com/labs/resource/list_yuliao.php)

   

3. >  ###  [集合]ChineseNlpCorpus

   >

   > [Github](https://github.com/SophonPlus/ChineseNlpCorpus)

4. > ### [集合]ChineseGLUE

   >

   > [Github]( https://github.com/chineseGLUE/chineseGLUE )

   >

   > 目前包含：

   >

   > 1. ##### LCQMC 口语化描述的语义相似度任务 Semantic Similarity Task  [COLING 2018](https://www.aclweb.org/anthology/C18-1166/)

   >

   > 2. ##### XNLI 语言推断任务 Natural Language Inference [EMNLP 2015](https://www.aclweb.org/anthology/D15-1075/)

   >

   > 3. ##### TNEWS 今日头条中文新闻（短文本）分类 Short Text Classificaiton for News

   >

   > 4. ##### INEWS 互联网情感分析任务 Sentiment Analysis for Internet News

   >

   > 5. ##### THUCNEWS 长文本分类 Long Text classification

   >

   > 6. ##### iFLYTEK 长文本分类 Long Text classification

   >

   > 7. ##### DRCD 繁体阅读理解任务 Reading Comprehension for Traditional Chinese

   >

   > 8. ##### CMRC2018 简体中文阅读理解任务 Reading Comprehension for Simplified Chinese

   >

   > 9. ##### BQ 智能客服问句匹配 Question Matching for Customer Service [EMNLP 2018](https://www.aclweb.org/anthology/D18-1536/) [Download](http://icrc.hitsz.edu.cn/Article/show/175.html)

   >

   > 10. ##### MSRANER 命名实体识别 Name Entity Recognition

   >

   > 11. ##### CHID 成语阅读理解填空 Chinese IDiom Dataset for Cloze Test

   >

   > 12. ##### CMNLI 语言推理任务 Chinese Multi-Genre NLI

5. > ### LCSTS: A Large Scale Chinese Short Text Summarization Dataset

   >

   > 大规模中文短文本摘要数据集

   >

   > [arXiv]( https://arxiv.org/abs/1506.05865 )

   >

   > [Download]( http://icrc.hitsz.edu.cn/Article/show/139.html )

6. > ### chinese-poetry: 最全中文诗歌古典文集数据库

   > [Github](https://github.com/chinese-poetry/chinese-poetry)

7. > ### SentiBridge: 中文实体情感知识库

   > [Github](https://github.com/rainarch/SentiBridge)

# English Corpus:

1. > ### [collections]GLUE

   >

   > [Download]( https://gluebenchmark.com/tasks )

   >

   > **Including:**

   >

   > 1.  The Corpus of Linguistic Acceptability 

   > 2.  The Stanford Sentiment Treebank 

   > 3.  Microsoft Research Paraphrase Corpus 

   > 4.  Semantic Textual Similarity Benchmark 

   > 5.  Quora Question Pairs 

   > 6.  MultiNLI Matched 

   > 7.  MultiNLI Mismatched 

   > 8.  Question NLI 

   > 9.  Recognizing Textual Entailment 

   > 10.  Winograd NLI 

   > 11.  Diagnostics Main 

   

2. > ### [collections]SuperGLUE

   >

   > [Download](https://super.gluebenchmark.com/tasks)

   >

   > **Including:**

   >

   > 1.  Broadcoverage Diagnostics 

   > 2.  CommitmentBank 

   > 3.  Choice of Plausible Alternatives 

   > 4.  Multi-Sentence Reading Comprehension 

   > 5.  Recognizing Textual Entailment 

   > 6.  Words in Context 

   > 7.  The Winograd Schema Challenge 

   > 8.  BoolQ 

   > 9.  Reading Comprehension with Commonsense Reasoning 

   > 10.  Winogender Schema Diagnostics 

   

3. > ### IMDB Large Movie Review Dataset

   >

   > This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well. 

   >

   > [Download](https://ai.stanford.edu/~amaas/data/sentiment/)

4. > ### SQuAD2.0

   >

   > The Stanford Question Answering Dataset

   >

   > [Website](https://rajpurkar.github.io/SQuAD-explorer/)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/zhanlaoban/nlp_pemdc

Awesome Lists containing this project

README