{"id":19398856,"url":"https://github.com/kyubyong/cjk_trans","last_synced_at":"2026-03-02T12:03:20.163Z","repository":{"id":85026556,"uuid":"196690781","full_name":"Kyubyong/cjk_trans","owner":"Kyubyong","description":"Pre-trained Machine Translation Models of Korean from/to ECJ","archived":false,"fork":false,"pushed_at":"2019-07-15T07:53:32.000Z","size":10,"stargazers_count":29,"open_issues_count":3,"forks_count":2,"subscribers_count":0,"default_branch":"master","last_synced_at":"2025-11-18T12:06:33.353Z","etag":null,"topics":["fairseq","machine-translation","pretrained-models","translation"],"latest_commit_sha":null,"homepage":"","language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Kyubyong.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2019-07-13T07:12:18.000Z","updated_at":"2024-05-06T06:07:30.000Z","dependencies_parsed_at":null,"dependency_job_id":"07b47b9c-93b5-4d5d-b1dd-3299b464b0a3","html_url":"https://github.com/Kyubyong/cjk_trans","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/Kyubyong/cjk_trans","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Kyubyong%2Fcjk_trans","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Kyubyong%2Fcjk_trans/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Kyubyong%2Fcjk_trans/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Kyubyong%2Fcjk_trans/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Kyubyong","download_url":"https://codeload.github.com/Kyubyong/cjk_trans/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Kyubyong%2Fcjk_trans/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":30001652,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-02T11:09:27.951Z","status":"ssl_error","status_checked_at":"2026-03-02T11:08:53.255Z","response_time":60,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["fairseq","machine-translation","pretrained-models","translation"],"created_at":"2024-11-10T11:07:30.331Z","updated_at":"2026-03-02T12:03:20.156Z","avatar_url":"https://github.com/Kyubyong.png","language":null,"funding_links":[],"categories":[],"sub_categories":[],"readme":"# Pre-trained Machine Translation Models of Korean from/to ECJ\n\nPre-trained models are beautiful. They save your time, energy and/or money. \nYou can obtain several pre-trained machine translation models for mostly European languages [here](https://github.com/pytorch/fairseq/blob/master/examples/translation/README.md).\nIn this project, I add six other models: Korean \u003c-\u003e English, Chinese, Japanese as I failed to find publicly available\n ones.\nNot surprisingly, the biggest challenge in training NMT models for those language pairs is the lack of large parallel corpora.\nI decided to use both public data ([OpenSubtitles](http://opus.nlpl.eu/OpenSubtitles-v2018.php)) and private data) to overcome the difficulties.\nOverall, each of their performance may not so impressive, but you can keep training it with your own data, if necessary.\n\n## Requirements\n* python \u003e=3.6\n* pytorch \u003e=1.0\n* [Fairseq](https://github.com/pytorch/fairseq)\n\n\n## Data\n|Language Pair | # Training sents (public + private) | # Test sents (private) |\n|--|--|--|\n|ko-en | 1,845,445 (1,391,190 + 454,255) | 1,050 | \n|ko-zh | 672,450 (485,843 + 186,607) | 1,417 |\n| ko-ja | 2,788,003 (302,063 + 2,485,940) | 1,174 |\n\n## Model\n* [Transformer Base](https://arxiv.org/abs/1706.03762)\n\n## Vocabulary and tokenization\n* Click the links to download the pretrained models and vocabulary files.\n\n|Language | # Vocab. | Tokenization |\n|--|--|--|\n|[ko](https://www.dropbox.com/s/hn2osffn1onycxa/wiki.ko.model?dl=0) | [8k](https://www.dropbox.com/s/98vmysovz8hpv6x/wiki.ko.dict?dl=0) |  BPE with sentencepiece | \n|[en](https://www.dropbox.com/s/5xoh2sjic1jalbw/gutenberg.model?dl=0) | [32k](https://www.dropbox.com/s/trcrvhd9vs2iwwa/gutenberg.dict?dl=0) | BPE with sentencepiece |\n| zh | [32k](https://www.dropbox.com/s/x56g5aqjy7pll51/opensubtitles.zh.dict?dl=0) | character |\n| [ja](https://www.dropbox.com/s/37xs58y9hvx9f6f/wiki.ja.model?dl=0) | [8k](https://www.dropbox.com/s/wqk5ba9m2dfbujg/wiki.ja.dict?dl=0) | BPE with sentencepiece |\n\n\n## Pre-trained models and their performance\n\n|  Pre-trained model | BLEU on test set* | \n|--|--|\n|  [ko -\u003e en](https://www.dropbox.com/s/cmvkxxk1zr2cmnf/ko-en.zip?dl=0) | 16.7 |\n|  [en -\u003e ko](https://www.dropbox.com/s/t8l9lk61rwiica5/en-ko.zip?dl=0) | 24.2 |\n| [ko -\u003e zh](https://www.dropbox.com/s/wp2d05403f5r9xq/ko-zh.zip?dl=0) | 17.13 | \n|[zh -\u003e ko](https://www.dropbox.com/s/qe1q4uslmvkyoa2/zh-ko.zip?dl=0) | 23.78 |\n| [ko -\u003e ja](https://www.dropbox.com/s/r00uu48815jx1j1/ko-ja.zip?dl=0) |40.7 |\n|[ja -\u003e ko](https://www.dropbox.com/s/4fs14yvdn0tq24u/ja-ko.zip?dl=0)| 34.6 |\n\n* Evaluation is based on the tokenization tools such as [Mecab-ko](https://bitbucket.org/eunjeon/mecab-ko/src/master/) (ko), [NLTK punct](https://www.nltk.org/api/nltk.tokenize.html) (en), [pkuseg](https://github.com/lancopku/pkuseg-python) (zh), and [MeCab](https://github.com/SamuraiT/mecab-python3) (ja).)\n\n## Finetuning Examples\n\n```\necho \"ko -\u003e en\"\npython -m torch.distributed.launch  --nproc_per_node 8 FAIRSEQ/train.py    ko-en-bin --arch transformer       --optimizer adam --lr 0.0005 --label-smoothing 0.1 --dropout 0.3       --max-tokens 4000 --min-lr '1e-09' --lr-scheduler inverse_sqrt       --weight-decay 0.0001 --criterion label_smoothed_cross_entropy       --max-epoch 80 --warmup-updates 4000 --warmup-init-lr '1e-07'    --adam-betas '(0.9, 0.98)'   --save-dir train/ko-en/ckpt  --save-interval 1 --restore-file checkpoint77.pt\n```","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkyubyong%2Fcjk_trans","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fkyubyong%2Fcjk_trans","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkyubyong%2Fcjk_trans/lists"}