https://github.com/thunlp-mt/bilex

A Bilingual Lexicon Inducer From Non-Parallel Data
https://github.com/thunlp-mt/bilex

bilingual-lexicon-extraction bilingual-word-embedding

Last synced: over 1 year ago
JSON representation

A Bilingual Lexicon Inducer From Non-Parallel Data

Host: GitHub
URL: https://github.com/thunlp-mt/bilex
Owner: THUNLP-MT
License: apache-2.0
Created: 2018-12-03T11:59:10.000Z (over 7 years ago)
Default Branch: master
Last Pushed: 2018-12-03T12:00:00.000Z (over 7 years ago)
Last Synced: 2025-03-28T10:54:20.438Z (over 1 year ago)
Topics: bilingual-lexicon-extraction, bilingual-word-embedding
Language: C
Homepage:
Size: 21.5 KB
Stars: 5
Watchers: 3
Forks: 2
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE.txt

Awesome Lists containing this project

README

# BiLex: A Bilingual Lexicon Inducer From Non-Parallel Data #

This software learns a bilingual lexicon from non-parallel data with the help of a small seed lexicon. The technique is described in the following paper:

> Meng Zhang, Haoruo Peng, Yang Liu, Huanbo Luan, and Maosong Sun. Bilingual Lexicon Induction From Non-Parallel Data With Minimal Supervision. In Proceedings of AAAI, 2017.

## Runtime Environment ##

This software has been tested in the following environment, but should work in a compatible one.

- 64-bit Linux
- Python 3.4
- GCC 4.9.4

## Usage ##

1\. Compile the code.

`./compile.sh`

2\. Specify the variables in the `config` file. For example, if `config` contains the following lines:

config=zh-en
lang1=zh
lang2=en

then the data should be located in `data/zh-en` with file extensions `zh` and `en`.

3\. Prepare data according to Step 2. Toy non-parallel data is provided, along with a Chinese-English seed lexicon with 100 word translation pairs. If your seed lexicon has more than 10000 entries, you need to modify the code by redefining `MAX_LEXICON_SIZE`.

4\. Train and obtain the bilingual lexicon.

`./run.sh`

5\. The following files will be generated in `data/zh-en` (the folder specified in `config`):

- word-vec.zh/en: Bilingual word embeddings in a human readable format. From these files vocab.zh/en and vec.zh/en are extracted.
- vocab.zh/en: Vocabularies.
- vec.zh/en: Bilingual word embeddings.
- result: Translations of vocab.zh. For each source word, there will be at most 10 translations after the tab character, each in the format `:`, separated by space and sorted in decreasing order of the cosine similarity. `` is the sentence marker; its translations should be ignored.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/thunlp-mt/bilex

Awesome Lists containing this project

README