Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/raphaelmerx/dictionary-generation
Generate a bilingual dictionary from a parallel corpus using mGiza
https://github.com/raphaelmerx/dictionary-generation
Last synced: 28 days ago
JSON representation
Generate a bilingual dictionary from a parallel corpus using mGiza
- Host: GitHub
- URL: https://github.com/raphaelmerx/dictionary-generation
- Owner: raphaelmerx
- Created: 2023-07-24T05:43:11.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2023-07-24T05:46:08.000Z (over 1 year ago)
- Last Synced: 2023-08-04T09:23:01.117Z (over 1 year ago)
- Language: Python
- Size: 2.93 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Getting dictionary from parallel corpus using mGiza
## Installation
1. Install [Snakemake](https://snakemake.readthedocs.io/): `brew install snakemake`
2. Install git submodules (mosesdecoder and mgiza): `git submodule update --init --recursive`
3. Compile mgiza:
```bash
cd mgiza/mgizapp
cmake .
make
```## Usage
1. Define config variables `LANG1`, `LANG2` and the parallel corpus file prefixes in `TRAIN_PREFIXES`. E.g.:
```
# config.yaml
LANG1: "en"
LANG2: "tpi"TRAIN_PREFIXES:
- "bible" # assuming you have bible.en and bible.tpi files in this directory
```
2. Run snakemake : `snakemake --cores 2`It will output a file `lang1-lang2.dic`, e.g.
```
en tpi
disease sik
sick sik
illness sik
tuberculosis tb
tb tb
he em
his em
him em
a wanpela
```