{"id":15899630,"url":"https://github.com/stefan-it/nmt-mk-en","last_synced_at":"2025-08-14T10:33:56.392Z","repository":{"id":110974097,"uuid":"97343243","full_name":"stefan-it/nmt-mk-en","owner":"stefan-it","description":"Neural Machine Translation system for Macedonian to English","archived":false,"fork":false,"pushed_at":"2018-04-21T11:37:30.000Z","size":22840,"stargazers_count":4,"open_issues_count":0,"forks_count":2,"subscribers_count":2,"default_branch":"master","last_synced_at":"2024-10-28T07:56:15.639Z","etag":null,"topics":["fairseq","macedonian","neural-machine-translation","transformer"],"latest_commit_sha":null,"homepage":null,"language":"Shell","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/stefan-it.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2017-07-15T21:28:03.000Z","updated_at":"2021-01-07T16:41:56.000Z","dependencies_parsed_at":null,"dependency_job_id":"132f3980-74d3-4a82-8e86-24422df50bd3","html_url":"https://github.com/stefan-it/nmt-mk-en","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/stefan-it%2Fnmt-mk-en","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/stefan-it%2Fnmt-mk-en/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/stefan-it%2Fnmt-mk-en/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/stefan-it%2Fnmt-mk-en/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/stefan-it","download_url":"https://codeload.github.com/stefan-it/nmt-mk-en/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":229821822,"owners_count":18129428,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["fairseq","macedonian","neural-machine-translation","transformer"],"created_at":"2024-10-06T10:22:06.321Z","updated_at":"2024-12-15T13:08:44.300Z","avatar_url":"https://github.com/stefan-it.png","language":"Shell","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Neural Machine Translation system for Macedonian to English\n\nThis repository contains all data and documentation for building a neural\nmachine translation system for Macedonian to English. This work was done during\nthe the M.Sc. course (summer term) [Machine Translation](http://cis.lmu.de/~fraser/mt_2017/)\nheld by [Prof. Dr. Alex Fraser](http://cis.lmu.de/~fraser/).\n\n# Dataset\n\nThe [*SETimes corpus*](http://nlp.ffzg.hr/resources/corpora/setimes/) contains\nof 207,777 parallel sentences for the Macedonian and English language pair.\n\nFor all experiments the corpus was split into training, development and\ntest set:\n\n| Data set    | Sentences | Download\n| ----------- | --------- | -----------------------------------------------------------------------------------------------------------------------------------------\n| Training    | 205,777   | via [GitHub](https://github.com/stefan-it/nmt-mk-en/raw/master/data/setimes.mk-en.train.tgz) or located in `data/setimes.mk-en.train.tgz`\n| Development |   1,000   | via [GitHub](https://github.com/stefan-it/nmt-mk-en/raw/master/data/setimes.mk-en.dev.tgz) or located in `data/setimes.mk-en.dev.tgz`\n| Test        |   1,000   | via [GitHub](https://github.com/stefan-it/nmt-mk-en/raw/master/data/setimes.mk-en.test.tgz) or located in `data/setimes.mk-en.test.tgz`\n\n# *fairseq* - Facebook AI Research Sequence-to-Sequence Toolkit\n\nThe first NMT system for Macedonian to English is built with [*fairseq*](https://github.com/facebookresearch/fairseq).\nWe trained three systems with different architectures:\n\n* Standard Bi-LSTM\n* CNN as encoder, LSTM as decoder\n* Fully convolutional\n\n## Preprocessing\n\nAll necessary scripts can be found in the `scripts` folder of this repository.\n\nIn the first step, we need to download and extract the parallel *SETimes* corpus\nfor Macedonian to English:\n\n```bash\nwget http://nlp.ffzg.hr/data/corpora/setimes/setimes.en-mk.txt.tgz\ntar -xf setimes.en-mk.txt.tgz\n```\n\nThe `data_preparation.sh` scripts performs the following steps on the corpus:\n\n* download of the *MOSES* tokenizer script; tokenization of the whole corpus\n* download of the *BPE* scripts; learning and applying *BPE* on the corpus\n\n```bash\n./data_preparation setimes.en-mk.mk.txt setimes.en-mk.en.txt\n```\n\nAfter that the corpus is split into training, development and test set:\n\n```bash\n./split_dataset corpus.clean.bpe.32000.mk corpus.clean.bpe.32000.en\n```\n\nThe following folder structure needs to be created:\n\n```bash\nmkdir {train,dev,test}\n\nmv dev.* dev\nmv train.* train\nmv test.* test\n\nmkdir model-data\n```\n\nAfter that the `fairseq` tool can be invoked to preprocess the corpus:\n\n```bash\nfairseq preprocess -sourcelang mk -targetlang en -trainpref train/train \\\n                   -validpref dev/dev -testpref test/test -thresholdsrc 3 \\\n                   -thresholdtgt 3 -destdir model-data\n```\n\n## Training\n\nAfter the preprossing steps the three models can be trained.\n\n### Standard Bi-LSTM\n\nWith the following command the bi-lstm model can be trained:\n\n```bash\nfairseq train -sourcelang mk -targetlang en -datadir model-data -model blstm \\\n              -nhid 512 -dropout 0.2 -dropout_hid 0 -optim adam -lr 0.0003125 \\\n              -savedir model-blstm\n```\n\n### CNN as encoder, LSTM as decoder\n\nWith the following command the CNN as encoder, LSTM as decoder model can be\ntrained:\n\n```bash\nfairseq train -sourcelang mk -targetlang en -datadir model-data -model conv \\\n              -nenclayer 6 -dropout 0.2 -dropout_hid 0 -savedir model-conv\n```\n\n### Fully convolutional\n\nWith the following command the fully convolutional model can be trained:\n\n```bash\nfairseq train -sourcelang mk -targetlang en -datadir model-data -model fconv \\\n              -nenclayer 4 -nlayer 3 -dropout 0.2 -optim nag -lr 0.25 \\\n              -clip 0.1 -momentum 0.99 -timeavg -bptt 0 -savedir model-fconv\n```\n\n## Decoding\n\n### Standard Bi-LSTM\n\nWith the following command the bi-lstm model can decode the test set:\n\n```bash\nfairseq generate -sourcelang mk -targetlang en \\\n                 -path model-blstm/model_best.th7 -datadir model-data -beam 10 \\\n                 -nbest 1 -dataset test \u003e model-blstm/system.output\n```\n\n### CNN as encoder, LSTM as decoder\n\nWith the following command the CNN as encoder, LSTM as decoder model can\ndecode the test set:\n\n```bash\nfairseq generate -sourcelang mk -targetlang en -path model-conv/model_best.th7 \\\n                 -datadir model-data -beam 10 -nbest 1 \\\n                 -dataset test \u003e model-conv/system.output\n```\n\n### Fully convolutional\n\nWith the following command the fully convolutional model can decode the test set:\n\n```bash\nfairseq generate -sourcelang mk -targetlang en -path model-fconv/model_best.th7 \\\n                 -datadir model-data -beam 10 -nbest 1 \\\n                 -dataset test \u003e model-fconv/system.output\n```\n\n## Calculating the BLEU-score\n\nWith the helper script `fairseq_bleu.sh` the BLEU-score of all models can be\ncalculated very easy. The script expects the system output file as command\nline argument:\n\n```bash\n./fairseq_bleu.sh model-blstm/system.output\n```\n\n## Results\n\nWe use different *BPE* merge operations: 16.000 and 32.000. Here are\nthe results on the final test set:\n\n| Model                        | *BPE* merge operations  | BLEU-Score\n| ---------------------------- | ----------------------- | ----------\n| Bi-LSTM                      | 32.000                  | 46,84\n| Bi-LSTM                      | 16.000                  | 47,57\n| CNN encoder, LSTM decoder    | 32.000                  | 19,83\n| CNN encoder, LSTM decoder    | 16.000                  | 9,59\n| Fully convolutional          | 32.000                  | 48,81\n| Fully convolutional          | 16.000                  | **49,03**\n\nThe best bleu-score was obtained with the fully convolutional model with\n16.000 merge operations.\n\n# *tensor2tensor* - Transformer\n\nThe second NMT system for Macedonian to English is built with the [*tensor2tensor*](https://github.com/tensorflow/tensor2tensor)\nlibrary. We trained two systems: one subword-based system and one\ncharacter-based NMT system.\n\n**Notice**: The problem description for this task is found in `translate_enmk.py`\nin the root repository. This problem was once directly included and available\nin *tensor2tensor*. But I decided to replace the integrated *tensor2tensor*\nproblem for Macedonian to English with a more challenging one. To replicate\nall experiments in this repository, the `translate_enmk.py` problem is now a\nuser-defined problem and must be included in the following way:\n\n```bash\ncp translate_enmk.py /tmp\necho \"from . import my_submodule\" \u003e /tmp/__init__.py\n```\n\nTo use this problem, the `--t2t_usr_dir` commandline option must point to the\nappropriate folder (in this example `/tmp`). For more information about\nuser-defined problems, see offical\n[documentation](https://github.com/tensorflow/tensor2tensor#adding-your-own-components).\n\n## Training (Transformer base)\n\nThe following training steps are tested with *tensor2tensor* in version *1.5.1*.\n\nFirst, we create the initial directory structure:\n\n```bash\nmkdir -p t2t_data t2t_datagen t2t_train t2t_output\n```\n\nIn the next step, the training and development datasets are downloaded and\nprepared:\n\n```bash\nt2t-datagen --data_dir=t2t_data --tmp_dir=t2t_datagen/ \\\n  --problem=translate_enmk_setimes32k --t2t_usr_dir /tmp\n```\n\nThen the training step can be started:\n\n```bash\nt2t-trainer --data_dir=t2t_data --problems=translate_enmk_setimes32k_rev \\\n  --model=transformer --hparams_set=transformer_base --output_dir=t2t_output \\\n  --t2t_usr_dir /tmp\n```\n\nThe number of GPUs used for training can be specified with the `--worker_gpu`\noption.\n\n## Decoding\n\nIn the next step, the test dataset is downloaded and extracted:\n\n```bash\nwget \"https://github.com/stefan-it/nmt-mk-en/raw/master/data/setimes.mk-en.test.tgz\"\ntar -xzf setimes.mk-en.test.tgz\n```\n\nThen the decoding step for the test dataset can be started:\n\n```bash\nt2t-decoder --data_dir=t2t_data --problems=translate_enmk_setimes32k_rev \\\n  --model=transformer --decode_hparams=\"beam_size=4,alpha=0.6\" \\\n  --decode_from_file=test.mk --decode_to_file=system.output \\\n  --hparams_set=transformer_big --output_dir=t2t_output/ \\\n  --t2t_usr_dir /tmp\n```\n\n## Calculating the BLEU-score\n\nThe BLEU-score can be calculated with the built-in `t2t-bleu` tool:\n\n```bash\nt2t-bleu --translation=system.output --reference=test.en\n```\n\n## Results\n\nThe following results can be achieved using the Transformer model. A\ncharacter-based model was also trained and measured. A big transformer model\nwas also trained using *tensor2tensor* in version *1.2.9* (latest version has\na bug, see [this](https://github.com/tensorflow/tensor2tensor/issues/529) issue).\n\n| Model                        | BLEU-Score\n| ---------------------------- | ----------\n| Transformer                  | **54,00** (uncased)\n| Transformer (big)            | 43,74 (uncased)\n| Transformer (char-based)     | 37.43 (uncased)\n\n# Further work\n\nWe want to train a char-based NMT system with the [dl4mt-c2c](https://github.com/nyu-dl/dl4mt-c2c)\nlibrary in near future.\n\n# Acknowledgments\n\nWe would like to thank the *Leibniz-Rechenzentrum der Bayerischen Akademie der\nWissenschaften* ([LRZ](https://www.lrz.de/english/)) for giving us access to the\nNVIDIA *DGX-1* supercomputer.\n\n# Presentations\n\n* Short-presentation at [Deep Learning Workshop @ LRZ](https://www.lrz.de/services/compute/courses/2017-09-14_hdlw1s17/),\n  can be found [here](short-presentation/stefan_schweter_dlw17.pdf).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fstefan-it%2Fnmt-mk-en","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fstefan-it%2Fnmt-mk-en","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fstefan-it%2Fnmt-mk-en/lists"}