https://github.com/lokicui/doc2vec-golang
doc2vec , word2vec, implemented by golang. word embedding representation
https://github.com/lokicui/doc2vec-golang
doc2vec doc2vec-golang golang word2vec
Last synced: 3 months ago
JSON representation
doc2vec , word2vec, implemented by golang. word embedding representation
- Host: GitHub
- URL: https://github.com/lokicui/doc2vec-golang
- Owner: lokicui
- License: apache-2.0
- Created: 2016-11-11T12:27:15.000Z (over 9 years ago)
- Default Branch: master
- Last Pushed: 2018-03-30T07:15:11.000Z (about 8 years ago)
- Last Synced: 2024-06-20T11:09:43.168Z (almost 2 years ago)
- Topics: doc2vec, doc2vec-golang, golang, word2vec
- Language: Go
- Homepage:
- Size: 4.12 MB
- Stars: 39
- Watchers: 3
- Forks: 11
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# doc2vec-golang
golang implement of Tomas Mikolov's word/document embedding. You may want to feel the basic idea from Mikolov's two orignal papers, [word2vec](http://arxiv.org/pdf/1301.3781.pdf) and [doc2vec](http://cs.stanford.edu/~quocle/paragraph_vector.pdf). More recently, Andrew M. Dai etc from Google reported its power in more [detail](http://arxiv.org/pdf/1507.07998.pdf)
# usage
```javascript
[@bjsjs_11_83 doc2vec-golang]$ ./control build
traning Exec build ok
build ok
# The training data(data/zhihu_data.1w) is one document per line, two columns divided by tab,
# the first column is id, and the second column is the segmented document separated by spaces.
[@bjsjs_11_83 doc2vec-golang]$ ./train data/zhihu_data.1w
Skip-Gram Iter:48 Alpha: 0.000796 Progress: 96.81% Words/sec: 24.27k
2018-03-30 14:53:00.218536235 +0800 CST training end, 1342521 26861
[@bjsjs_11_83 doc2vec-golang]$ ./knn 2.model
please select operation type:
0:word2words
1:doc_likelihood
2:leave one out key words
3:sen2words
4:sen2docs
5:word2docs
6:doc2docs
7:doc2words
0
Enter text:网页
1 网页
0.7823723719117796 不让
0.7651260773728028 浏览
0.7642516944020028 邮件
0.7601415883811553 近
0.7517607921006224 迷恋
0.7492900066365179 等同
0.7485966355448261 传说
0.7463299535930537 基于
0.7447865182221745 版
please select operation type:
0:word2words
1:doc_likelihood
2:leave one out key words
3:sen2words
4:sen2docs
5:word2docs
6:doc2docs
7:doc2words
```
# Dependencies
* [golang](https://golang.org/)
* [msgp](https://github.com/tinylib/msgp)
# 已实现特性
* doc2vec支持CBOW和Skip-Gram两种模型,Negative Sampling和Hierarchical Softmax优化均已实现
* online infer document
* [likelihood of document](http://arxiv.org/abs/1504.07295)
* doc2words
* doc2docs
* word2words
* word2docs
# 未实现特性
* [wmd](https://github.com/hiyijian/doc2vec/blob/master/jmlr.org/proceedings/papers/v37/kusnerb15.pdf)
* [doc2vec添加同义词语义约束](http://home.ustc.edu.cn/~quanliu/papers/SWE.pdf)
* 句子提取核心词
# 参考资料
* google [word2vec](https://code.google.com/archive/p/word2vec/source/default/source) 实现
* [hiyijian/doc2vec](https://github.com/hiyijian/doc2vec)
* [word2vec语义约束](https://github.com/iunderstand/SWE)
* [doc2vec添加同义词语义约束](http://home.ustc.edu.cn/~quanliu/papers/SWE.pdf)