An open API service indexing awesome lists of open source software.

https://github.com/emfomy/hwe

Heterogeneous Word Embedding
https://github.com/emfomy/hwe

Last synced: 7 months ago
JSON representation

Heterogeneous Word Embedding

Awesome Lists containing this project

README

          

# Heterogeneous Word Embedding

# Information

This library is a C implementation of the Heterogeneous Word Embedding (HWE), which is a general and flexible framework to incorporate each type (e.g. word-sense, part-of-speech, topic) of contextual feature for learning feature-specific word embeddings in an explicit fashion.

# Data format

## Training for HWE-POS or HWE-Topic
- Parameter setting: ```-fmode 1```
- Corpus file: Each word is appended by a corresponded feature.
- Format: `()`
- Example:
- Original sentence: ```my dog also likes eating sausage.```
- Modified sentence: ```my(PRP$) dog(NN) also(RB) likes(VBZ) eating(VBG) sausage(NN)```

## Training for HWE-Sense

- Parameter setting: ```-fomde 2 -knfile ```
- Knowledge file: Each row contains a sense and corresponding words.
- Fomat: ` `
- Example:
- Line1: ```SENSE_FRUIT apple banana grape```
- Line2: ```SENSE_ANIMAL tiger monkey```

## Attention
- The words/features are represented in lower/upper-cases respectively.

# Usage

## Compile

```
make hwe
```

## Setting
```
-train
Use text data from to train the model
-output
Use to save the resulting word vectors / word clusters
-size
Set size of word vectors; default is 100
-window
Set max skip length between words; default is 5
-sample
Set threshold for occurrence of words. Those that appear with higher frequency in the training data
will be randomly down-sampled; default is 1e-3, useful range is (0, 1e-5)
-negative
Number of negative examples; default is 5, common values are 3 - 10 (0 = not used)
-threads
Use threads (default 12)
-iter
Run more training iterations (default 5)
-min-count
This will discard words that appear less than times; default is 5
-alpha
Set the starting learning rate; default is 0.025
-debug
Set the debug mode (default = 2 = more info during training)
-binary
Save the resulting vectors in binary moded; default is 0 (off)
-save-vocab
The vocabulary will be saved to
-read-vocab
The vocabulary will be read from , not constructed from the training data
-fmode
Enable the Feature mode (default = 0)
0 = only using skip-gram
1 = predicting self-feature of sequential feature tag
2 = predicting self-feature of global feature table
-knfile
The sense-words file will be read from
```

## Example

```
wget http://cs.fit.edu/~mmahoney/compression/enwik8.zip
unzip demo/enwik8

./hwe -train enwik8 -output enwik8.emb -size 100 -window 5 -sample 1e-4 -negative 5 -binary 0 -fmode 2 -knfile demo/wordnetlower.tree -iter 2 -threads 32
```

## Author
* Fan Jhih-Sheng <>
* Mu Yang <>

# Reference

* [Jhih-Sheng Fan, Mu Yang, Peng-Hsuan Li and Wei-Yun Ma, “HWE: Word Embedding with Heterogeneous Features”, ICSC2019](https://muyang.pro/file/paper/icsc_2019_hwe.pdf)

# License
[![License: CC BY-NC-SA 4.0](https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png)](https://creativecommons.org/licenses/by-nc-sa/4.0/) Copyright (c) 2017-2018 Fan Jhih-Sheng & Mu Yang under the [CC-BY-NC-SA 4.0 License](https://creativecommons.org/licenses/by-nc-sa/4.0/). All rights reserved.