https://github.com/thunlp-mt/ubilexemd
An Unsupervised Bilingual Lexicon Inducer From Non-Parallel Data by Earth Mover's Distance Minimization
https://github.com/thunlp-mt/ubilexemd
bilingual-lexicon-extraction bilingual-word-embedding
Last synced: 11 months ago
JSON representation
An Unsupervised Bilingual Lexicon Inducer From Non-Parallel Data by Earth Mover's Distance Minimization
- Host: GitHub
- URL: https://github.com/thunlp-mt/ubilexemd
- Owner: THUNLP-MT
- License: apache-2.0
- Created: 2018-12-03T12:12:07.000Z (over 7 years ago)
- Default Branch: master
- Last Pushed: 2018-12-03T12:13:17.000Z (over 7 years ago)
- Last Synced: 2025-04-14T22:52:38.693Z (about 1 year ago)
- Topics: bilingual-lexicon-extraction, bilingual-word-embedding
- Language: Python
- Homepage:
- Size: 3.85 MB
- Stars: 5
- Watchers: 3
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE.txt
Awesome Lists containing this project
README
# UBiLexEMD: An Unsupervised Bilingual Lexicon Inducer From Non-Parallel Data by Earth Mover's Distance Minimization #
This software can produce a bilingual lexicon from non-parallel data without any cross-lingual supervision. It does so by learning a transformation between source word embeddings and target ones by earth mover's distance minimization. The technique is described in the following paper:
> Meng Zhang, Yang Liu, Huanbo Luan, and Maosong Sun. Earth Mover's Distance Minimization for Unsupervised Bilingual Lexicon Induction. In Proceedings of EMNLP, 2017.
## Runtime Environment ##
This software has been tested in the following environment, but should work in a compatible one.
- 64-bit Linux
- Python 2.7 (for WGAN code in the `src` folder)
- Theano
- Lasagne ([bleeding-edge version](http://lasagne.readthedocs.io/en/latest/user/installation.html#bleeding-edge-version) needed as of April, 2017)
- scikit-learn
- Python 3.4 (for the code in the `scripts` folder)
- Matlab R2010b (for EMDOT code)
## Usage ##
### Preparation ###
1\. Specify the variables in the `config` file. For example, if `config` contains the following lines:
config=zh-en
lang1=zh
lang2=en
then the data should be located in `data/zh-en` with file extensions `zh` and `en`. Prepare the `matlab.config` file accordingly.
2\. Prepare the following data in the folder specified in Step 1:
- word2vec.zh/en: Word embeddings, which can be obtained by running word2vec on monolingual data.
- vocab-freq.zh/en: Space-separated word-frequency pairs.
Besides, prepare vocab.zh/en, vec.zh/en, count.zh/en from the above data.
### WGAN ###
Execute `./runWGAN.sh` to obtain a file named `W` that stores the transformation matrix.
### EMDOT ###
1\. Copy the `W` file produced by WGAN into `data/zh-en` (the folder specified in `config`). One such file is provided in this release.
2\. Launch Matlab. In the console, execute:
loadData
SinkhornWithTransformationInitFromData(X_s, X_t, weight_s, weight_t, length(weight_s), length(weight_t));
exit
3\. Process the transport scheme to obtain the translations.
`./processFlow.sh 10`
4\. In `data/zh-en`, result.10 will contain translations of vocab.zh. For each source word, there will be multiple translations after the tab character, separated by space.
## Known Issue ##
It is recommended that the bleeding-edge version of Lasagne works with the latest development version of Theano. It has been tested on Lasagne version 0.2.dev1 and Theano version 0.9.0dev4.dev-RELEASE, but NaN may appear on Theano version 0.8.2.