{"id":28729763,"url":"https://github.com/didi/wmt2021_triangular_mt","last_synced_at":"2026-03-05T23:42:24.099Z","repository":{"id":73891823,"uuid":"238674669","full_name":"didi/wmt2021_triangular_mt","owner":"didi","description":"The baseline model code for WMT 2021 Triangular MT","archived":false,"fork":false,"pushed_at":"2021-04-07T18:52:01.000Z","size":723,"stargazers_count":13,"open_issues_count":0,"forks_count":2,"subscribers_count":15,"default_branch":"master","last_synced_at":"2025-06-15T17:11:29.738Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"http://www.statmt.org/wmt21/triangular-mt-task.html","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/didi.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2020-02-06T11:36:09.000Z","updated_at":"2023-11-28T09:59:39.000Z","dependencies_parsed_at":"2023-03-03T08:15:42.698Z","dependency_job_id":null,"html_url":"https://github.com/didi/wmt2021_triangular_mt","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/didi/wmt2021_triangular_mt","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/didi%2Fwmt2021_triangular_mt","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/didi%2Fwmt2021_triangular_mt/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/didi%2Fwmt2021_triangular_mt/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/didi%2Fwmt2021_triangular_mt/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/didi","download_url":"https://codeload.github.com/didi/wmt2021_triangular_mt/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/didi%2Fwmt2021_triangular_mt/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":30156180,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-05T22:39:40.138Z","status":"ssl_error","status_checked_at":"2026-03-05T22:39:24.771Z","response_time":93,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-06-15T17:10:59.066Z","updated_at":"2026-03-05T23:42:24.076Z","avatar_url":"https://github.com/didi.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Baseline code for WMT 2021 Triangular MT\n\nUpdated on 04/07/2021. \n\nThe baseline code for the shared task [`Triangular MT: Using English to improve Russian-to-Chinese machine translation`](http://www.statmt.org/wmt21/triangular-mt-task.html). \n\n## NOTE\n\nAll scripts should run from root folder:\n\n```\nbash **.sh\nbash scripts/**.sh\npython scripts/**.py\n```\n\n## Requirements\n\nA linux machine GPU and installed CUDA \u003e= 10.0\n\n## Setup\n\n1. Install `miniconda` on your machine.\n2. Run `setup_env.sh` with interactive mode:\n    ```bash\n       bash -i setup_env.sh\n    ```\n    Note: If you are using a server outside of China, you'd better delete two tsinghua mirrors in `environment.yml` line `3-4` and `setup_env.sh` line `9` for a better speed.\n    \n## Registration\n\nTo participate please register to the shared task on Codalab .\n[`Link to Codalab website`](https://competitions.codalab.org/competitions/30446). \n\n### Detailed Configuration\n\nWe will use the toolkit [`tensor2tensor`](https://github.com/tensorflow/tensor2tensor) to train a Transformer based NMT system. \n`config/run.ru_zh.big.single_gpu.json` lists all the configurations. \n\n```json\n{\n    \"version\": \"ru_zh.big.single_gpu\",\n    \"processed\": \"ru_zh\",\n    \"hparams\": \"transformer_big_single_gpu\",\n    \"model\": \"transformer\",\n    \"problem\": \"machine_translation\",\n    \"n_gpu\": 1,\n    \"eval_early_stopping_steps\": 14500,\n    \"eval_steps\": 10000,\n    \"local_eval_frequency\": 1000,\n    \"keep_checkpoint_max\": 10,\n    \"beam_size\": [\n        4\n    ],\n    \"alpha\": [\n        1.0\n    ]\n}\n```\n\nThe hyperparameter set is `transformer_big_single_gpu`. \nWe will use only `1` GPU. \nThe model will evaluate the dev loss and save the checkpoint every `1000` steps. \nIf the dev loss doesn't decrease for `14500` steps, the training will stop. \nWhen decoding the test set, we will use beam size `4` and use alpha value of 1.0. \nThe larger the alpha value, the longer the generated translation will be.\n\n`processed` indicates the version of the processed files. Here is `config/processed.ru_zh.json`:\n\n```json\n{\n    \"version\": \"ru_zh\",\n    \"train\": \"train.ru_zh\",\n    \"dev\": \"dev.ru_zh\",\n    \"tests\": [\n        \"dev.ru_zh\"\n    ],\n    \"bpe\": true,\n    \"vocab_size\": 30000\n}\n``` \nIt indicates that the training folder is `data/raw/train.ru_zh`, dev folder is `data/raw/dev.ru_zh` and test folder is `data/raw/dev.ru_zh`, i.e. we use the dev as test. \nThe preprocessing pipeline will use byte-pair-encoding (BPE) and the number of merge operations are `30000`. \n\n## Train and Decode\n\n\nTo train a Russian to Chinese NMT system: \n\n```\nconda activate mt_baseline\nbash pipeline.sh config/run.ru_zh.big.single_gpu.json 1 4\n```\n\n`1` is the start step and `4` is the end step.\n\n- step 1: prepare data\n- step 2: generate tf records\n- step 3: train\n- setp 4: decode_test : decode test with all combinations of (beam, alpha)\n\nAfter step 4, all the decoded results will be in folder `data/run/ru_zh.big.single_gpu_tmp/decode`:\n* `decode.b4_a1.0.test0.txt`: the decoded BPE subwords using beam size 4 and alpha value 1.0.\n* `decode.b4_a1.0.test0.tok`: the decoded tokens when we merge the BPE subwords into whole words.\n* `decode.b4_a1.0.test0.char`： the decoded utf8 characters of `decode.b4_a1.0.test0.tok` after removing space.\n* `bleu.b4_a1.0.test0.tok`: the token level BLEU score.\n* `bleu.b4_a1.0.test0.char`: the character level BLEU score. \n\nThe reference files are in folder `data/run/ru_zh.big.single_gpu_tmp/decode`.\n\n#### Note \n\nWe have released the dev set on Codalab. You can submit your system outputs on Codalab to get the Bleu score on the released dev set. You can also download the dev set by registering to the competition on [Codalab](https://competitions.codalab.org/competitions/30446#participate)\n\n## Independent Evaluation Script\n\nFolder `eval` contains the evaluation scripts to calculate the character-level BLEU score:\n\n```\ncd eval\npython bleu.py hyp.txt ref.txt\n```\nWhere `hyp.txt` and `ref.txt` can be either normal Chinese (i.e. without space between characters) or character-split Chinese.\n\nSee 'example.sh' for detailed examples. \n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdidi%2Fwmt2021_triangular_mt","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdidi%2Fwmt2021_triangular_mt","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdidi%2Fwmt2021_triangular_mt/lists"}