https://github.com/emfomy/hwe
Heterogeneous Word Embedding
https://github.com/emfomy/hwe
Last synced: 7 months ago
JSON representation
Heterogeneous Word Embedding
- Host: GitHub
- URL: https://github.com/emfomy/hwe
- Owner: emfomy
- License: other
- Created: 2018-09-10T10:06:37.000Z (about 7 years ago)
- Default Branch: ver.C
- Last Pushed: 2018-12-30T13:30:59.000Z (almost 7 years ago)
- Last Synced: 2025-01-29T03:36:01.540Z (8 months ago)
- Language: C
- Size: 1.17 MB
- Stars: 2
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE.md
Awesome Lists containing this project
README
# Heterogeneous Word Embedding
# Information
This library is a C implementation of the Heterogeneous Word Embedding (HWE), which is a general and flexible framework to incorporate each type (e.g. word-sense, part-of-speech, topic) of contextual feature for learning feature-specific word embeddings in an explicit fashion.
# Data format
## Training for HWE-POS or HWE-Topic
- Parameter setting: ```-fmode 1```
- Corpus file: Each word is appended by a corresponded feature.
- Format: `()`
- Example:
- Original sentence: ```my dog also likes eating sausage.```
- Modified sentence: ```my(PRP$) dog(NN) also(RB) likes(VBZ) eating(VBG) sausage(NN)```## Training for HWE-Sense
- Parameter setting: ```-fomde 2 -knfile ```
- Knowledge file: Each row contains a sense and corresponding words.
- Fomat: ` `
- Example:
- Line1: ```SENSE_FRUIT apple banana grape```
- Line2: ```SENSE_ANIMAL tiger monkey```## Attention
- The words/features are represented in lower/upper-cases respectively.# Usage
## Compile
```
make hwe
```## Setting
```
-train
Use text data from to train the model
-output
Use to save the resulting word vectors / word clusters
-size
Set size of word vectors; default is 100
-window
Set max skip length between words; default is 5
-sample
Set threshold for occurrence of words. Those that appear with higher frequency in the training data
will be randomly down-sampled; default is 1e-3, useful range is (0, 1e-5)
-negative
Number of negative examples; default is 5, common values are 3 - 10 (0 = not used)
-threads
Use threads (default 12)
-iter
Run more training iterations (default 5)
-min-count
This will discard words that appear less than times; default is 5
-alpha
Set the starting learning rate; default is 0.025
-debug
Set the debug mode (default = 2 = more info during training)
-binary
Save the resulting vectors in binary moded; default is 0 (off)
-save-vocab
The vocabulary will be saved to
-read-vocab
The vocabulary will be read from , not constructed from the training data
-fmode
Enable the Feature mode (default = 0)
0 = only using skip-gram
1 = predicting self-feature of sequential feature tag
2 = predicting self-feature of global feature table
-knfile
The sense-words file will be read from
```## Example
```
wget http://cs.fit.edu/~mmahoney/compression/enwik8.zip
unzip demo/enwik8./hwe -train enwik8 -output enwik8.emb -size 100 -window 5 -sample 1e-4 -negative 5 -binary 0 -fmode 2 -knfile demo/wordnetlower.tree -iter 2 -threads 32
```## Author
* Fan Jhih-Sheng <>
* Mu Yang <># Reference
* [Jhih-Sheng Fan, Mu Yang, Peng-Hsuan Li and Wei-Yun Ma, “HWE: Word Embedding with Heterogeneous Features”, ICSC2019](https://muyang.pro/file/paper/icsc_2019_hwe.pdf)
# License
[](https://creativecommons.org/licenses/by-nc-sa/4.0/) Copyright (c) 2017-2018 Fan Jhih-Sheng & Mu Yang under the [CC-BY-NC-SA 4.0 License](https://creativecommons.org/licenses/by-nc-sa/4.0/). All rights reserved.