Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/ziyin-dl/word-embedding-dimensionality-selection
On the Dimensionality of Word Embedding
https://github.com/ziyin-dl/word-embedding-dimensionality-selection
Last synced: about 2 months ago
JSON representation
On the Dimensionality of Word Embedding
- Host: GitHub
- URL: https://github.com/ziyin-dl/word-embedding-dimensionality-selection
- Owner: ziyin-dl
- License: mit
- Created: 2018-11-16T19:47:11.000Z (about 6 years ago)
- Default Branch: master
- Last Pushed: 2020-06-17T18:35:10.000Z (over 4 years ago)
- Last Synced: 2024-08-03T16:09:03.628Z (5 months ago)
- Language: Python
- Homepage: https://nips.cc/Conferences/2018/Schedule?showEvent=12567
- Size: 29.2 MB
- Stars: 329
- Watchers: 12
- Forks: 44
- Open Issues: 8
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Word Embedding Dimensionality Selection
This repo implements the dimensionality selection procedure for word embeddings. The procedure is proposed
in the following papers, based on the notion of Pairwise Inner Produce (PIP) loss. No longer pick 300 as your word embedding dimensionality!- Paper:
* Conference Version: https://nips.cc/Conferences/2018/Schedule?showEvent=12567
* arXiv: https://arxiv.org/abs/1812.04224
- Slides: https://www.dropbox.com/s/9tix9l4h39k4agn/main.pdf?dl=0
- Video of Neurips talk: https://www.facebook.com/nipsfoundation/videos/vb.375737692517476/745243882514297/?type=2&theater```
@inproceedings{yin2018dimensionality,
title={On the Dimensionality of Word Embedding},
author={Yin, Zi and Shen, Yuanyuan},
booktitle={Advances in Neural Information Processing Systems},
year={2018}
}
```
and
```
@article{yin2018pairwise,
title={Understand Functionality and Dimensionality of Vector Embeddings: the Distributional Hypothesis, the Pairwise Inner Product Loss and Its Bias-Variance Trade-off},
author={Yin, Zi},
journal={arXiv preprint arXiv:1803.00502},
year={2018}
}
```Currently, we implement the dimensionality selection procedure for the following algorithms:
- Word2Vec (skip-gram)
- GloVe
- Latent Semantic Analysis (LSA)## How to use the tool
The tool provides an optimal dimensionality for an algorithm on a corpus. For example, you can use it to
obtain the dimensionality for your Word2Vec embedding on the Text8 corpus.
You need to have the following:
- A corpus (--file [path to corpus])
- A config file (yaml) for algorithm specific parameters (--config file [path to config file])
- Name of algorithm (--algorithm [algorithm_name])Run from root directory as package, e.g.:
`python -m main --file data/text8.zip --config_file config/word2vec_sample_config.yml --algorithm word2vec`
## Implement your own
You can extend the implementation if you have another embedding algorithm that is based on matrix factorization.
The only thing to do is to implement your matrix estimator as a subclass of SignalMatrix.