https://github.com/thunlp-mt/bilex
A Bilingual Lexicon Inducer From Non-Parallel Data
https://github.com/thunlp-mt/bilex
bilingual-lexicon-extraction bilingual-word-embedding
Last synced: about 1 year ago
JSON representation
A Bilingual Lexicon Inducer From Non-Parallel Data
- Host: GitHub
- URL: https://github.com/thunlp-mt/bilex
- Owner: THUNLP-MT
- License: apache-2.0
- Created: 2018-12-03T11:59:10.000Z (over 7 years ago)
- Default Branch: master
- Last Pushed: 2018-12-03T12:00:00.000Z (over 7 years ago)
- Last Synced: 2025-03-28T10:54:20.438Z (about 1 year ago)
- Topics: bilingual-lexicon-extraction, bilingual-word-embedding
- Language: C
- Homepage:
- Size: 21.5 KB
- Stars: 5
- Watchers: 3
- Forks: 2
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE.txt
Awesome Lists containing this project
README
# BiLex: A Bilingual Lexicon Inducer From Non-Parallel Data #
This software learns a bilingual lexicon from non-parallel data with the help of a small seed lexicon. The technique is described in the following paper:
> Meng Zhang, Haoruo Peng, Yang Liu, Huanbo Luan, and Maosong Sun. Bilingual Lexicon Induction From Non-Parallel Data With Minimal Supervision. In Proceedings of AAAI, 2017.
## Runtime Environment ##
This software has been tested in the following environment, but should work in a compatible one.
- 64-bit Linux
- Python 3.4
- GCC 4.9.4
## Usage ##
1\. Compile the code.
`./compile.sh`
2\. Specify the variables in the `config` file. For example, if `config` contains the following lines:
config=zh-en
lang1=zh
lang2=en
then the data should be located in `data/zh-en` with file extensions `zh` and `en`.
3\. Prepare data according to Step 2. Toy non-parallel data is provided, along with a Chinese-English seed lexicon with 100 word translation pairs. If your seed lexicon has more than 10000 entries, you need to modify the code by redefining `MAX_LEXICON_SIZE`.
4\. Train and obtain the bilingual lexicon.
`./run.sh`
5\. The following files will be generated in `data/zh-en` (the folder specified in `config`):
- word-vec.zh/en: Bilingual word embeddings in a human readable format. From these files vocab.zh/en and vec.zh/en are extracted.
- vocab.zh/en: Vocabularies.
- vec.zh/en: Bilingual word embeddings.
- result: Translations of vocab.zh. For each source word, there will be at most 10 translations after the tab character, each in the format `:`, separated by space and sorted in decreasing order of the cosine similarity. `` is the sentence marker; its translations should be ignored.