{"id":21936646,"url":"https://github.com/dalinvip/cw2vec","last_synced_at":"2025-04-07T15:11:24.296Z","repository":{"id":170209432,"uuid":"131955509","full_name":"dalinvip/cw2vec","owner":"dalinvip","description":"cw2vec: Learning Chinese Word Embeddings with Stroke n-gram Information ","archived":false,"fork":false,"pushed_at":"2023-03-20T02:57:13.000Z","size":1513,"stargazers_count":273,"open_issues_count":3,"forks_count":66,"subscribers_count":11,"default_branch":"master","last_synced_at":"2025-03-31T13:19:32.007Z","etag":null,"topics":["cw2vec","embeddings","fasttext","stroke-information","word2vec"],"latest_commit_sha":null,"homepage":"","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/dalinvip.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2018-05-03T07:06:33.000Z","updated_at":"2025-01-19T13:17:18.000Z","dependencies_parsed_at":null,"dependency_job_id":"abc2ca7d-72c2-4b4a-b222-92c3d9ba93eb","html_url":"https://github.com/dalinvip/cw2vec","commit_stats":null,"previous_names":["dalinvip/cw2vec","bamtercelboo/cw2vec"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dalinvip%2Fcw2vec","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dalinvip%2Fcw2vec/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dalinvip%2Fcw2vec/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dalinvip%2Fcw2vec/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/dalinvip","download_url":"https://codeload.github.com/dalinvip/cw2vec/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247675609,"owners_count":20977378,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cw2vec","embeddings","fasttext","stroke-information","word2vec"],"created_at":"2024-11-29T01:15:35.353Z","updated_at":"2025-04-07T15:11:24.271Z","avatar_url":"https://github.com/dalinvip.png","language":"C++","funding_links":[],"categories":[],"sub_categories":[],"readme":"\n## Introduction ##\n\nPaper Link: [cw2vec: Learning Chinese Word Embeddings with Stroke n-gram Information](http://www.statnlp.org/wp-content/uploads/papers/2018/cw2vec/cw2vec.pdf)  \n\nPaper Detail Summary: [cw2vec理论及其实现](https://bamtercelboo.github.io/2018/05/11/cw2vec/)\n\n## Requirements ##\n\n\u003ecmake version 3.10.0-rc5  \n\u003emake  GNU Make 4.1  \n\u003egcc  version 5.4.0\n\n## Run Demo ##\n\n- I have uploaded  `word2vec` binary executable file in `cw2vec/word2vec/bin` and rewrite `run.sh` for simple test, you can run `run.sh` directly for simple test.\n\n- According to the *[Building cw2vec using cmake](https://github.com/bamtercelboo/cw2vec#building-cw2vec-using-cmake)*  to recompile and run other model with the *[Example use cases](https://github.com/bamtercelboo/cw2vec#example-use-cases)*.\n\n\n## Building cw2vec using cmake ##\n\n\tgit clone git@github.com:bamtercelboo/cw2vec.git\n\tcd cw2vec \u0026\u0026 cd word2vec \u0026\u0026 cd build\n\tcmake ..\n\tmake\n\tcd ../bin\n\nThis will create the word2vec binary and also all relevant libraries.\n\n## Example use cases ##\nthe repo not only implement cw2vec(named **substoke**), but also the **skipgram**, **cbow** of word2vec, furthermore, fasttext skipgram is implemented(named **subword**).  \n\n\tPlease modify train.txt and feature.txt into your own train document.\n\n\tskipgram: ./word2vec skipgram -input train.txt -output skipgram_out -lr 0.025 -dim 100 -ws 5 -epoch 5 -minCount 10 -neg 5 -loss ns -thread 8 -t 1e-4 -lrUpdateRate 100  \n\n\tcbow:     ./word2vec cbow -input train.txt -output cbow_out -lr 0.05 -dim 100 -ws 5 -epoch 5 -minCount 10 -neg 5 -loss ns -thread 8 -t 1e-4 -lrUpdateRate 100\n\n\tsubword:  ./word2vec subword -input train.txt -output subword_out -lr 0.025 -dim 100 -ws 5 -epoch 5 -minCount 10 -neg 5 -loss ns -minn 3 -maxn 6 -thread 8 -t 1e-4 -lrUpdateRate 100\n\n\tsubstoke: ./word2vec substoke -input train.txt -infeature feature.txt -output substoke_out -lr 0.025 -dim 100 -ws 5 -epoch 5 -minCount 10 -neg 5 -loss ns -minn 3 -maxn 18 -thread 8 -t 1e-4 -lrUpdateRate 100\n\n\n\n\n## Get chinese stoke feature ##\nsubstoke model need chinese stoke feature(`-infeature`)，I have written a script to acquire the Chinese character of stroke information from [handian](http://www.zdic.net/). here is the script [extract_zh_char_stoke](https://github.com/bamtercelboo/corpus_process_script/tree/master/extract_zh_char_stoke),  see the readme for details.  \n\nNow, I have uploaded a file of stroke features in simplified Chinese, which contains a total of 20901 Chinese characters for use. The file in the [Simplified_Chinese_Feature](https://github.com/bamtercelboo/cw2vec/blob/master/Simplified_Chinese_Feature/sin_chinese_feature.txt) folder.  Or you can use the above script to get it yourself.\n\n\n\n**feature file(feature.txt) like this**:\n\n\t中 丨フ一丨\n\t国 丨フ一一丨一丶一\n\t庆 丶一ノ一ノ丶\n\t假 ノ丨フ一丨一一フ一フ丶\n\t期 一丨丨一一一ノ丶ノフ一一\n\t香 ノ一丨ノ丶丨フ一一\n\t江 丶丶一一丨一\n\t将 丶一丨ノフ丶一丨丶\n\t涌 丶丶一フ丶丨フ一一丨\n\t入 ノ丶\n\t人 ノ丶\n\t潮 丶丶一一丨丨フ一一一丨ノフ一一\n\t......\n\nI provided a feature file for the test，path is `sample/substoke_feature.txt`.\n\n\n## Substoke model output embeddings ##\n\n- In this paper, the context word embeddings is used directly as the final word vector. However, according to the idea of fasttext, I also take into account the n-gram feature vector of the stroke information, the n-gram feature vector of the stroke information is taken as an average substitute for the word vector. \n\n-  There are two outputs in substoke model:\n\t-  output ends with vec is the context word vector.\n\t-  output ends with avg is the n-gram feature vector average.\n\n\n## Word similarity evaluation ##\n\n#### 1. Evaluation script ####\nI have already written a Chinese word similarity evaluation script. [Chinese-Word-Similarity-and-Word-Analogy](https://github.com/bamtercelboo/Chinese_Word_Similarity_and_Word_Analogy), see the readme for details.\n\n#### 2. Parameter Settings ####\nThe parameters are set as follows:  \n\n\tdim  100\n\twindow sizes  5\n\tnegative  5\n\tepoch  5\n\tminCount  10\n\tlr  skipgram(0.025)，cbow(0.05)，substoke(0.025)\n\tn-gram  minn=3, maxn=18\n\n#### 3. result ####\nExperimental results show follows  \n\n![](https://i.imgur.com/u0O6RoE.jpg)\n  \n![](https://i.imgur.com/p4gjsaD.jpg)\n\n\n## Full documentation ##\nInvoke a command without arguments to list available arguments and their default values:\n\n\t./word2vec \n\tusage: word2vec \u003ccommand\u003e \u003cargs\u003e\n\tThe commands supported by word2vec are:\n\n\tskipgram  ------ train word embedding by use skipgram model\n\tcbow      ------ train word embedding by use cbow model\n\tsubword   ------ train word embedding by use subword(fasttext skipgram)  model\n\tsubstoke  ------ train chinses character embedding by use substoke(cw2vec) model\n\n\t./word2vec substoke -h\n\tTrain Embedding By Using [substoke] model\n\tHere is the help information! Usage:\n\n\tThe Following arguments are mandatory:\n\t\t-input              training file path\n\t\t-infeature          substoke feature file path\n\t\t-output             output file path\n\t\n\tThe Following arguments are optional:\n\t\t-verbose            verbosity level[2]\n\n\tThe following arguments for the dictionary are optional:\n\t\t-minCount           minimal number of word occurences default:[10]\n\t\t-bucket             number of buckets default:[2000000]\n\t\t-minn               min length of char ngram default:[3]\n\t\t-maxn               max length of char ngram default:[6]\n\t\t-t                  sampling threshold default:[0.001]\n\n\tThe following arguments for training are optional:\n\t\t-lr                 learning rate default:[0.05]\n\t\t-lrUpdateRate       change the rate of updates for the learning rate default:[100]\n\t\t-dim                size of word vectors default:[100]\n\t\t-ws                 size of the context window default:[5]\n\t\t-epoch              number of epochs default:[5]\n\t\t-neg                number of negatives sampled default:[5]\n\t\t-loss               loss function {ns} default:[ns]\n\t\t-thread             number of threads default:[1]\n\t\t-pretrainedVectors  pretrained word vectors for supervised learning default:[]\n\t\t-saveOutput         whether output params should be saved default:[false]\n\n## References ##\n[1] [Cao, Shaosheng, et al. \"cw2vec: Learning Chinese Word Embeddings with Stroke n-gram Information.\" (2018). ](http://www.statnlp.org/wp-content/uploads/papers/2018/cw2vec/cw2vec.pdf)   \n[2][ Bojanowski, Piotr, et al. \"Enriching word vectors with subword information.\" arXiv preprint arXiv:1607.04606 (2016).](https://arxiv.org/pdf/1607.04606.pdf)  \n[3] [fastText-github](https://github.com/facebookresearch/fastText)  \n[4] [cw2vec理论及其实现](https://bamtercelboo.github.io/2018/05/11/cw2vec/)\n\n\n\n\n\t\n\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdalinvip%2Fcw2vec","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdalinvip%2Fcw2vec","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdalinvip%2Fcw2vec/lists"}