https://github.com/raphaelmerx/dictionary-generation

Generate a bilingual dictionary from a parallel corpus using mGiza
https://github.com/raphaelmerx/dictionary-generation

Last synced: 5 days ago
JSON representation

Generate a bilingual dictionary from a parallel corpus using mGiza

Host: GitHub
URL: https://github.com/raphaelmerx/dictionary-generation
Owner: raphaelmerx
Created: 2023-07-24T05:43:11.000Z (almost 2 years ago)
Default Branch: main
Last Pushed: 2023-07-24T05:46:08.000Z (almost 2 years ago)
Last Synced: 2025-01-24T14:16:57.648Z (6 months ago)
Language: Python
Size: 2.93 KB
Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Getting dictionary from parallel corpus using mGiza

## Installation

1. Install [Snakemake](https://snakemake.readthedocs.io/): `brew install snakemake`
2. Install git submodules (mosesdecoder and mgiza): `git submodule update --init --recursive`
3. Compile mgiza:
```bash
cd mgiza/mgizapp
cmake .
make
```

## Usage

1. Define config variables `LANG1`, `LANG2` and the parallel corpus file prefixes in `TRAIN_PREFIXES`. E.g.:
```
# config.yaml
LANG1: "en"
LANG2: "tpi"

TRAIN_PREFIXES:
- "bible" # assuming you have bible.en and bible.tpi files in this directory
```
2. Run snakemake : `snakemake --cores 2`

It will output a file `lang1-lang2.dic`, e.g.
```
en tpi
disease sik
sick sik
illness sik
tuberculosis tb
tb tb
he em
his em
him em
a wanpela
```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/raphaelmerx/dictionary-generation

Awesome Lists containing this project

README