https://github.com/dalinvip/cw2vec
cw2vec: Learning Chinese Word Embeddings with Stroke n-gram Information
https://github.com/dalinvip/cw2vec
cw2vec embeddings fasttext stroke-information word2vec
Last synced: about 1 year ago
JSON representation
cw2vec: Learning Chinese Word Embeddings with Stroke n-gram Information
- Host: GitHub
- URL: https://github.com/dalinvip/cw2vec
- Owner: dalinvip
- License: apache-2.0
- Created: 2018-05-03T07:06:33.000Z (about 8 years ago)
- Default Branch: master
- Last Pushed: 2023-03-20T02:57:13.000Z (over 3 years ago)
- Last Synced: 2025-03-31T13:19:32.007Z (about 1 year ago)
- Topics: cw2vec, embeddings, fasttext, stroke-information, word2vec
- Language: C++
- Homepage:
- Size: 1.44 MB
- Stars: 273
- Watchers: 11
- Forks: 66
- Open Issues: 3
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
## Introduction ##
Paper Link: [cw2vec: Learning Chinese Word Embeddings with Stroke n-gram Information](http://www.statnlp.org/wp-content/uploads/papers/2018/cw2vec/cw2vec.pdf)
Paper Detail Summary: [cw2vec理论及其实现](https://bamtercelboo.github.io/2018/05/11/cw2vec/)
## Requirements ##
>cmake version 3.10.0-rc5
>make GNU Make 4.1
>gcc version 5.4.0
## Run Demo ##
- I have uploaded `word2vec` binary executable file in `cw2vec/word2vec/bin` and rewrite `run.sh` for simple test, you can run `run.sh` directly for simple test.
- According to the *[Building cw2vec using cmake](https://github.com/bamtercelboo/cw2vec#building-cw2vec-using-cmake)* to recompile and run other model with the *[Example use cases](https://github.com/bamtercelboo/cw2vec#example-use-cases)*.
## Building cw2vec using cmake ##
git clone git@github.com:bamtercelboo/cw2vec.git
cd cw2vec && cd word2vec && cd build
cmake ..
make
cd ../bin
This will create the word2vec binary and also all relevant libraries.
## Example use cases ##
the repo not only implement cw2vec(named **substoke**), but also the **skipgram**, **cbow** of word2vec, furthermore, fasttext skipgram is implemented(named **subword**).
Please modify train.txt and feature.txt into your own train document.
skipgram: ./word2vec skipgram -input train.txt -output skipgram_out -lr 0.025 -dim 100 -ws 5 -epoch 5 -minCount 10 -neg 5 -loss ns -thread 8 -t 1e-4 -lrUpdateRate 100
cbow: ./word2vec cbow -input train.txt -output cbow_out -lr 0.05 -dim 100 -ws 5 -epoch 5 -minCount 10 -neg 5 -loss ns -thread 8 -t 1e-4 -lrUpdateRate 100
subword: ./word2vec subword -input train.txt -output subword_out -lr 0.025 -dim 100 -ws 5 -epoch 5 -minCount 10 -neg 5 -loss ns -minn 3 -maxn 6 -thread 8 -t 1e-4 -lrUpdateRate 100
substoke: ./word2vec substoke -input train.txt -infeature feature.txt -output substoke_out -lr 0.025 -dim 100 -ws 5 -epoch 5 -minCount 10 -neg 5 -loss ns -minn 3 -maxn 18 -thread 8 -t 1e-4 -lrUpdateRate 100
## Get chinese stoke feature ##
substoke model need chinese stoke feature(`-infeature`),I have written a script to acquire the Chinese character of stroke information from [handian](http://www.zdic.net/). here is the script [extract_zh_char_stoke](https://github.com/bamtercelboo/corpus_process_script/tree/master/extract_zh_char_stoke), see the readme for details.
Now, I have uploaded a file of stroke features in simplified Chinese, which contains a total of 20901 Chinese characters for use. The file in the [Simplified_Chinese_Feature](https://github.com/bamtercelboo/cw2vec/blob/master/Simplified_Chinese_Feature/sin_chinese_feature.txt) folder. Or you can use the above script to get it yourself.
**feature file(feature.txt) like this**:
中 丨フ一丨
国 丨フ一一丨一丶一
庆 丶一ノ一ノ丶
假 ノ丨フ一丨一一フ一フ丶
期 一丨丨一一一ノ丶ノフ一一
香 ノ一丨ノ丶丨フ一一
江 丶丶一一丨一
将 丶一丨ノフ丶一丨丶
涌 丶丶一フ丶丨フ一一丨
入 ノ丶
人 ノ丶
潮 丶丶一一丨丨フ一一一丨ノフ一一
......
I provided a feature file for the test,path is `sample/substoke_feature.txt`.
## Substoke model output embeddings ##
- In this paper, the context word embeddings is used directly as the final word vector. However, according to the idea of fasttext, I also take into account the n-gram feature vector of the stroke information, the n-gram feature vector of the stroke information is taken as an average substitute for the word vector.
- There are two outputs in substoke model:
- output ends with vec is the context word vector.
- output ends with avg is the n-gram feature vector average.
## Word similarity evaluation ##
#### 1. Evaluation script ####
I have already written a Chinese word similarity evaluation script. [Chinese-Word-Similarity-and-Word-Analogy](https://github.com/bamtercelboo/Chinese_Word_Similarity_and_Word_Analogy), see the readme for details.
#### 2. Parameter Settings ####
The parameters are set as follows:
dim 100
window sizes 5
negative 5
epoch 5
minCount 10
lr skipgram(0.025),cbow(0.05),substoke(0.025)
n-gram minn=3, maxn=18
#### 3. result ####
Experimental results show follows


## Full documentation ##
Invoke a command without arguments to list available arguments and their default values:
./word2vec
usage: word2vec
The commands supported by word2vec are:
skipgram ------ train word embedding by use skipgram model
cbow ------ train word embedding by use cbow model
subword ------ train word embedding by use subword(fasttext skipgram) model
substoke ------ train chinses character embedding by use substoke(cw2vec) model
./word2vec substoke -h
Train Embedding By Using [substoke] model
Here is the help information! Usage:
The Following arguments are mandatory:
-input training file path
-infeature substoke feature file path
-output output file path
The Following arguments are optional:
-verbose verbosity level[2]
The following arguments for the dictionary are optional:
-minCount minimal number of word occurences default:[10]
-bucket number of buckets default:[2000000]
-minn min length of char ngram default:[3]
-maxn max length of char ngram default:[6]
-t sampling threshold default:[0.001]
The following arguments for training are optional:
-lr learning rate default:[0.05]
-lrUpdateRate change the rate of updates for the learning rate default:[100]
-dim size of word vectors default:[100]
-ws size of the context window default:[5]
-epoch number of epochs default:[5]
-neg number of negatives sampled default:[5]
-loss loss function {ns} default:[ns]
-thread number of threads default:[1]
-pretrainedVectors pretrained word vectors for supervised learning default:[]
-saveOutput whether output params should be saved default:[false]
## References ##
[1] [Cao, Shaosheng, et al. "cw2vec: Learning Chinese Word Embeddings with Stroke n-gram Information." (2018). ](http://www.statnlp.org/wp-content/uploads/papers/2018/cw2vec/cw2vec.pdf)
[2][ Bojanowski, Piotr, et al. "Enriching word vectors with subword information." arXiv preprint arXiv:1607.04606 (2016).](https://arxiv.org/pdf/1607.04606.pdf)
[3] [fastText-github](https://github.com/facebookresearch/fastText)
[4] [cw2vec理论及其实现](https://bamtercelboo.github.io/2018/05/11/cw2vec/)