Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/Embedding/Chinese-Word-Vectors
100+ Chinese Word Vectors 上百种预训练中文词向量
https://github.com/Embedding/Chinese-Word-Vectors
chinese chinese-word-segmentation embedding embeddings vectors-trained word-embeddings
Last synced: about 1 month ago
JSON representation
100+ Chinese Word Vectors 上百种预训练中文词向量
- Host: GitHub
- URL: https://github.com/Embedding/Chinese-Word-Vectors
- Owner: Embedding
- License: apache-2.0
- Created: 2018-01-09T09:48:49.000Z (almost 7 years ago)
- Default Branch: master
- Last Pushed: 2023-10-30T14:44:50.000Z (about 1 year ago)
- Last Synced: 2024-10-29T15:02:25.612Z (about 1 month ago)
- Topics: chinese, chinese-word-segmentation, embedding, embeddings, vectors-trained, word-embeddings
- Language: Python
- Homepage:
- Size: 1.42 MB
- Stars: 11,818
- Watchers: 285
- Forks: 2,317
- Open Issues: 59
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- awesome - Chinese-Word-Vectors - 100+ Chinese Word Vectors 上百种预训练中文词向量 (Python)
- awesome-ai-list-guide - Chinese-Word-Vectors
- my-awesome - Embedding/Chinese-Word-Vectors - word-segmentation,embedding,embeddings,vectors-trained,word-embeddings pushed_at:2023-10 star:11.9k fork:2.3k 100+ Chinese Word Vectors 上百种预训练中文词向量 (Python)
README
# Chinese Word Vectors 中文词向量
[中文](https://github.com/Embedding/Chinese-Word-Vectors/blob/master/README_zh.md)This project provides 100+ Chinese Word Vectors (embeddings) trained with different **representations** (dense and sparse), **context features** (word, ngram, character, and more), and **corpora**. One can easily obtain pre-trained vectors with different properties and use them for downstream tasks.
Moreover, we provide a Chinese analogical reasoning dataset **CA8** and an evaluation toolkit for users to evaluate the quality of their word vectors.
## Reference
Please cite the paper, if using these embeddings and CA8 dataset.Shen Li, Zhe Zhao, Renfen Hu, Wensi Li, Tao Liu, Xiaoyong Du, Analogical Reasoning on Chinese Morphological and Semantic Relations, ACL 2018.
```
@InProceedings{P18-2023,
author = "Li, Shen
and Zhao, Zhe
and Hu, Renfen
and Li, Wensi
and Liu, Tao
and Du, Xiaoyong",
title = "Analogical Reasoning on Chinese Morphological and Semantic Relations",
booktitle = "Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)",
year = "2018",
publisher = "Association for Computational Linguistics",
pages = "138--143",
location = "Melbourne, Australia",
url = "http://aclweb.org/anthology/P18-2023"
}
```
A detailed analysis of the relation between the intrinsic and extrinsic evaluations of Chinese word embeddings is shown in the paper:
Yuanyuan Qiu, Hongzheng Li, Shen Li, Yingdi Jiang, Renfen Hu, Lijiao Yang. Revisiting Correlations between Intrinsic and Extrinsic Evaluations of Word Embeddings. Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data. Springer, Cham, 2018. 209-221. (CCL & NLP-NABD 2018 Best Paper)
```
@incollection{qiu2018revisiting,
title={Revisiting Correlations between Intrinsic and Extrinsic Evaluations of Word Embeddings},
author={Qiu, Yuanyuan and Li, Hongzheng and Li, Shen and Jiang, Yingdi and Hu, Renfen and Yang, Lijiao},
booktitle={Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data},
pages={209--221},
year={2018},
publisher={Springer}
}
```## Format
The pre-trained vector files are in text format. Each line contains a word and its vector. Each value is separated by space. The first line records the meta information: the first number indicates the number of words in the file and the second indicates the dimension size.Besides dense word vectors (trained with SGNS), we also provide sparse vectors (trained with PPMI). They are in the same format with liblinear, where the number before " : " denotes dimension index and the number after the " : " denotes the value.
## Pre-trained Chinese Word Vectors
### Basic Settings
Window Size
Dynamic Window
Sub-sampling
Low-Frequency Word
Iteration
Negative Sampling*
5
Yes
1e-5
10
5
5
\*Only for SGNS.
### Various Domains
Chinese Word Vectors trained with different representations, context features, and corpora.
Word2vec / Skip-Gram with Negative Sampling (SGNS)
Corpus
Context Features
Word
Word + Ngram
Word + Character
Word + Character + Ngram
Baidu Encyclopedia 百度百科
300d
300d
300d
300d / PWD: 5555
Wikipedia_zh 中文维基百科
300d
300d
300d
300d
People's Daily News 人民日报
300d
300d
300d
300d
Sogou News 搜狗新闻
300d
300d
300d
300d
Financial News 金融新闻
300d
300d
300d
300d
Zhihu_QA 知乎问答
300d
300d
300d
300d
Weibo 微博
300d
300d
300d
300d
Literature 文学作品
300d
300d / PWD: z5b4
300d
300d / PWD: yenb
Complete Library in Four Sections
四库全书*
300d
300d
NAN
NAN
Mixed-large 综合
Baidu Netdisk / Google Drive
300d
300d
300d
300d
300d
300d
300d
300d
Positive Pointwise Mutual Information (PPMI)
Corpus
Context Features
Word
Word + Ngram
Word + Character
Word + Character + Ngram
Baidu Encyclopedia 百度百科
Sparse
Sparse
Sparse
Sparse
Wikipedia_zh 中文维基百科
Sparse
Sparse
Sparse
Sparse
People's Daily News 人民日报
Sparse
Sparse
Sparse
Sparse
Sogou News 搜狗新闻
Sparse
Sparse
Sparse
Sparse
Financial News 金融新闻
Sparse
Sparse
Sparse
Sparse
Zhihu_QA 知乎问答
Sparse
Sparse
Sparse
Sparse
Weibo 微博
Sparse
Sparse
Sparse
Sparse
Literature 文学作品
Sparse
Sparse
Sparse
Sparse
Complete Library in Four Sections
四库全书*
Sparse
Sparse
NAN
NAN
Mixed-large 综合
Sparse
Sparse
Sparse
Sparse
\*Character embeddings are provided, since most of Hanzi are words in the archaic Chinese.
### Various Co-occurrence Information
We release word vectors upon different co-occurrence statistics. Target and context vectors are often called input and output vectors in some related papers.
In this part, one can obtain vectors of arbitrary linguistic units beyond word. For example, character vectors is in the context vectors of word-character.
All vectors are trained by SGNS on Baidu Encyclopedia.
Feature
Co-occurrence Type
Target Word Vectors
Context Word Vectors
Word
Word → Word
300d
300d
Ngram
Word → Ngram (1-2)
300d
300d
Word → Ngram (1-3)
300d
300d
Ngram (1-2) → Ngram (1-2)
300d
300d
Character
Word → Character (1)
300d
300d
Word → Character (1-2)
300d
300d
Word → Character (1-4)
300d
300d
Radical
Radical
300d
300d
Position
Word → Word (left/right)
300d
300d
Word → Word (distance)
300d
300d
Global
Word → Text
300d
300d
Syntactic Feature
Word → POS
300d
300d
Word → Dependency
300d
300d
## Representations
Existing word representation methods fall into one of the two classes, **dense** and **sparse** represnetations. SGNS model (a model in word2vec toolkit) and PPMI model are respectively typical methods of these two classes. SGNS model trains low-dimensional real (dense) vectors through a shallow neural network. It is also called neural embedding method. PPMI model is a sparse bag-of-feature representation weighted by positive-pointwise-mutual-information (PPMI) weighting scheme.## Context Features
Three context features: **word**, **ngram**, and **character** are commonly used in the word embedding literature. Most word representation methods essentially exploit word-word co-occurrence statistics, namely using word as context feature **(word feature)**. Inspired by language modeling problem, we introduce ngram feature into the context. Both word-word and word-ngram co-occurrence statistics are used for training **(ngram feature)**. For Chinese, characters (Hanzi) often convey strong semantics. To this end, we consider using word-word and word-character co-occurrence statistics for learning word vectors. The length of character-level ngrams ranges from 1 to 4 **(character feature)**.Besides word, ngram, and character, there are other features which have substantial influence on properties of word vectors. For example, using entire text as context feature could introduce more topic information into word vectors; using dependency parse as context feature could add syntactic constraint to word vectors. 17 co-occurrence types are considered in this project.
## Corpus
We made great efforts to collect corpus across various domains. All text data are preprocessed by removing html and xml tags. Only the plain text are kept and [HanLP(v_1.5.3)](https://github.com/hankcs/HanLP) is used for word segmentation. In addition, traditional Chinese characters are converted into simplified characters with [Open Chinese Convert (OpenCC)](https://github.com/BYVoid/OpenCC). The detailed corpora information is listed as follows:
Corpus
Size
Tokens
Vocabulary Size
Description
Baidu Encyclopedia
百度百科
4.1G
745M
5422K
Chinese Encyclopedia data from
https://baike.baidu.com/
Wikipedia_zh
中文维基百科
1.3G
223M
2129K
Chinese Wikipedia data from
https://dumps.wikimedia.org/
People's Daily News
人民日报
3.9G
668M
1664K
News data from People's Daily(1946-2017)
http://data.people.com.cn/
Sogou News
搜狗新闻
3.7G
649M
1226K
News data provided by Sogou labs
http://www.sogou.com/labs/
Financial News
金融新闻
6.2G
1055M
2785K
Financial news collected from multiple news websites
Zhihu_QA
知乎问答
2.1G
384M
1117K
Chinese QA data from
https://www.zhihu.com/
微博
0.73G
136M
850K
Chinese microblog data provided by NLPIR Lab
http://www.nlpir.org/wordpress/download/weibo.7z
Literature
文学作品
0.93G
177M
702K
8599 modern Chinese literature works
Mixed-large
综合
22.6G
4037M
10653K
We build the large corpus by merging the above corpora.
Complete Library in Four Sections
四库全书
1.5G
714M
21.8K
The largest collection of texts in pre-modern China.
All words are concerned, including low frequency words.
## Toolkits
All word vectors are trained by [ngram2vec](https://github.com/zhezhaoa/ngram2vec/) toolkit. Ngram2vec toolkit is a superset of [word2vec](https://github.com/svn2github/word2vec) and [fasttext](https://github.com/facebookresearch/fastText) toolkit, where arbitrary context features and models are supported.## Chinese Word Analogy Benchmarks
The quality of word vectors is often evaluated by analogy question tasks. In this project, two benchmarks are exploited for evaluation. The first is CA-translated, where most analogy questions are directly translated from English benchmark. Although CA-translated has been widely used in many Chinese word embedding papers, it only contains questions of three semantic questions and covers 134 Chinese words. In contrast, CA8 is specifically designed for Chinese language. It contains 17813 analogy questions and covers comprehensive morphological and semantic relations. The CA-translated, CA8, and their detailed descriptions are provided in [**testsets**](https://github.com/Embedding/Chinese-Word-Vectors/tree/master/testsets) folder.## Evaluation Toolkit
We present an evaluation toolkit in [**evaluation**](https://github.com/Embedding/Chinese-Word-Vectors/tree/master/evaluation) folder.Run the following codes to evaluate dense vectors.
```
$ python ana_eval_dense.py -v -a CA8/morphological.txt
$ python ana_eval_dense.py -v -a CA8/semantic.txt
```
Run the following codes to evaluate sparse vectors.
```
$ python ana_eval_sparse.py -v -a CA8/morphological.txt
$ python ana_eval_sparse.py -v -a CA8/semantic.txt
```