Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/losyer/compact_reconstruction
This repository is about the paper 'Subword-based Compact Reconstruction of Word Embeddings. Sasaki et al. NAACL2019'
https://github.com/losyer/compact_reconstruction
Last synced: 3 months ago
JSON representation
This repository is about the paper 'Subword-based Compact Reconstruction of Word Embeddings. Sasaki et al. NAACL2019'
- Host: GitHub
- URL: https://github.com/losyer/compact_reconstruction
- Owner: losyer
- Created: 2019-04-04T04:39:18.000Z (over 5 years ago)
- Default Branch: master
- Last Pushed: 2023-06-09T05:05:48.000Z (over 1 year ago)
- Last Synced: 2024-04-20T00:36:15.594Z (7 months ago)
- Language: Python
- Homepage:
- Size: 102 KB
- Stars: 9
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Compact Reconstruction
- This repository is about *Subword-based Compact Reconstruction of Word Embeddings. Sasaki et al. NAACL2019*## Table of contents
- [Usage](#usage)
- [Requirements](#requirements)
- [How to train](#how-to-train)
- [How to estimate (OOV) word vectors](#how-to-estimate-oov-word-vectors)
- [Preprocessing of setting files](#preprocessing-of-setting-files)
- [Resources](#resources)## Usage
### Requirements
- Python version >= 3.7
- chainer
- numpy### How to train
```
$ python src/train.py \
--gpu 0 \
--ref_vec_path crawl-300d-2M-subword.vec \
--freq_path resources/freq_count.crawl-300d-2M-subword.txt \
--multi_hash two \
--maxlen 200 \
--codecs_path resources/ngram_dic.max30.min3 \
--network_type 2 \
--subword_type 4 \
--limit_size 1000000 \
--bucket_size 100000 \
--result_dir ./result \
--hashed_idx \
--unique_false
```
||network_type |subword_type |hashed_idx |
|---|---|---|---|
|SUM-F |2 |0 |✘ |
|SUM-H |2 |0 |✓ |
|KVQ-H |3 |0 |✓ |
|SUM-FH |2 |4 |✓ |
|KVQ-FH |3 |4 |✓ |### How to estimate (OOV) word vectors
For estimating OOV word vectors:
```
$ python src/inference.py \
--gpu 0 \
--model_path \
result/sum/20190625_00_57_18/model_epoch_300 \
--codecs_path resources/ngram_dic.max30.min3 \
--oov_word_path resources/oov_words.txt
```For reconstructing original word embeddings:
```
$ python src/save_embedding.py \
--gpu 0 \
--inference \
--model_path result/sum/20190625_00_57_18/model_epoch_300
```
## Preprocessing of setting files
- See [preprocessing page](https://github.com/losyer/compact_reconstruction/tree/master/src/preprocess)## Resources
- See [resource page](https://github.com/losyer/compact_reconstruction/tree/master/resources)