{"id":13595590,"url":"https://github.com/rsennrich/subword-nmt","last_synced_at":"2025-05-14T05:10:48.048Z","repository":{"id":39352240,"uuid":"41733139","full_name":"rsennrich/subword-nmt","owner":"rsennrich","description":"Unsupervised Word Segmentation for Neural Machine Translation and Text Generation","archived":false,"fork":false,"pushed_at":"2024-08-07T14:21:16.000Z","size":253,"stargazers_count":2203,"open_issues_count":3,"forks_count":464,"subscribers_count":55,"default_branch":"master","last_synced_at":"2024-12-06T21:37:51.316Z","etag":null,"topics":["bpe","machine-translation","neural-machine-translation","nmt","segmentation","subword-units"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/rsennrich.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2015-09-01T10:50:32.000Z","updated_at":"2024-12-05T11:00:12.000Z","dependencies_parsed_at":"2024-11-26T01:15:43.535Z","dependency_job_id":null,"html_url":"https://github.com/rsennrich/subword-nmt","commit_stats":{"total_commits":114,"total_committers":19,"mean_commits":6.0,"dds":0.3157894736842105,"last_synced_commit":"810ee1487a753870ebf90d91ccdb789158268d9f"},"previous_names":[],"tags_count":6,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rsennrich%2Fsubword-nmt","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rsennrich%2Fsubword-nmt/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rsennrich%2Fsubword-nmt/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rsennrich%2Fsubword-nmt/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/rsennrich","download_url":"https://codeload.github.com/rsennrich/subword-nmt/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254076850,"owners_count":22010611,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bpe","machine-translation","neural-machine-translation","nmt","segmentation","subword-units"],"created_at":"2024-08-01T16:01:53.014Z","updated_at":"2025-05-14T05:10:47.876Z","avatar_url":"https://github.com/rsennrich.png","language":"Python","funding_links":[],"categories":["Python","Natural Language Processing","🔹 **BPE (Byte Pair Encoding) Implementations**","Vorverarbeitungstools"],"sub_categories":["Conversation \u0026 Translation","Tokenisierung"],"readme":"Subword Neural Machine Translation\n==================================\n\nThis repository contains preprocessing scripts to segment text into subword\nunits. The primary purpose is to facilitate the reproduction of our experiments\non Neural Machine Translation with subword units (see below for reference).\n\nINSTALLATION\n------------\n\ninstall via pip (from PyPI):\n\n    pip install subword-nmt\n\ninstall via pip (from Github):\n\n    pip install https://github.com/rsennrich/subword-nmt/archive/master.zip\n\nalternatively, clone this repository; the scripts are executable stand-alone.\n\n\nUSAGE INSTRUCTIONS\n------------------\n\nCheck the individual files for usage instructions.\n\nTo apply byte pair encoding to word segmentation, invoke these commands:\n\n    subword-nmt learn-bpe -s {num_operations} \u003c {train_file} \u003e {codes_file}\n    subword-nmt apply-bpe -c {codes_file} \u003c {test_file} \u003e {out_file}\n\nTo segment rare words into character n-grams, do the following:\n\n    subword-nmt get-vocab --train_file {train_file} --vocab_file {vocab_file}\n    subword-nmt segment-char-ngrams --vocab {vocab_file} -n {order} --shortlist {size} \u003c {test_file} \u003e {out_file}\n\nThe original segmentation can be restored with a simple replacement:\n\n    sed -r 's/(@@ )|(@@ ?$)//g'\n\nIf you cloned the repository and did not install a package, you can also run the individual commands as scripts:\n\n    ./subword_nmt/learn_bpe.py -s {num_operations} \u003c {train_file} \u003e {codes_file}\n\nBEST PRACTICE ADVICE FOR BYTE PAIR ENCODING IN NMT\n--------------------------------------------------\n\nWe found that for languages that share an alphabet, learning BPE on the\nconcatenation of the (two or more) involved languages increases the consistency\nof segmentation, and reduces the problem of inserting/deleting characters when\ncopying/transliterating names.\n\nHowever, this introduces undesirable edge cases in that a word may be segmented\nin a way that has only been observed in the other language, and is thus unknown\nat test time. To prevent this, `apply_bpe.py` accepts a `--vocabulary` and a\n`--vocabulary-threshold` option so that the script will only produce symbols\nwhich also appear in the vocabulary (with at least some frequency).\n\nTo use this functionality, we recommend the following recipe (assuming L1 and L2\nare the two languages):\n\nLearn byte pair encoding on the concatenation of the training text, and get resulting vocabulary for each:\n\n    cat {train_file}.L1 {train_file}.L2 | subword-nmt learn-bpe -s {num_operations} -o {codes_file}\n    subword-nmt apply-bpe -c {codes_file} \u003c {train_file}.L1 | subword-nmt get-vocab \u003e {vocab_file}.L1\n    subword-nmt apply-bpe -c {codes_file} \u003c {train_file}.L2 | subword-nmt get-vocab \u003e {vocab_file}.L2\n\nmore conventiently, you can do the same with with this command:\n\n    subword-nmt learn-joint-bpe-and-vocab --input {train_file}.L1 {train_file}.L2 -s {num_operations} -o {codes_file} --write-vocabulary {vocab_file}.L1 {vocab_file}.L2\n\nre-apply byte pair encoding with vocabulary filter:\n\n    subword-nmt apply-bpe -c {codes_file} --vocabulary {vocab_file}.L1 --vocabulary-threshold 50 \u003c {train_file}.L1 \u003e {train_file}.BPE.L1\n    subword-nmt apply-bpe -c {codes_file} --vocabulary {vocab_file}.L2 --vocabulary-threshold 50 \u003c {train_file}.L2 \u003e {train_file}.BPE.L2\n\nas a last step, extract the vocabulary to be used by the neural network. Example with Nematus:\n\n    nematus/data/build_dictionary.py {train_file}.BPE.L1 {train_file}.BPE.L2\n\n[you may want to take the union of all vocabularies to support multilingual systems]\n\nfor test/dev data, re-use the same options for consistency:\n\n    subword-nmt apply-bpe -c {codes_file} --vocabulary {vocab_file}.L1 --vocabulary-threshold 50 \u003c {test_file}.L1 \u003e {test_file}.BPE.L1\n\nADVANCED FEATURES\n-----------------\n\nOn top of the basic BPE implementation, this repository supports:\n\n- BPE dropout (Provilkov, Emelianenko and Voita, 2019): https://arxiv.org/abs/1910.13267\n  use the argument `--dropout 0.1` for `subword-nmt apply-bpe` to randomly drop out possible merges.\n  Doing this on the training corpus can improve quality of the final system; at test time, use BPE without dropout.\n  In order to obtain reproducible results, argument `--seed` can be used to set the random seed.\n\n  **Note:** In the original paper, the authors used BPE-Dropout on each new batch separately. You can copy the training corpus several times to get similar behavior to obtain multiple segmentations for the same sentence.\n\n- support for glossaries:\n  use the argument `--glossaries` for `subword-nmt apply-bpe` to provide a list of subwords and/or regular expressions\n  that should always be passed to the output without subword segmentation\n\n```\necho \"I am flying to \u003ccountry\u003eSwitzerland\u003c/country\u003e at noon .\" | subword-nmt apply-bpe --codes subword_nmt/tests/data/bpe.ref\nI am fl@@ y@@ ing to \u003c@@ coun@@ tr@@ y@@ \u003e@@ S@@ w@@ it@@ z@@ er@@ l@@ and@@ \u003c@@ /@@ coun@@ tr@@ y@@ \u003e at no@@ on .\n\necho \"I am flying to \u003ccountry\u003eSwitzerland\u003c/country\u003e at noon .\" | subword-nmt apply-bpe --codes subword_nmt/tests/data/bpe.ref --glossaries \"\u003ccountry\u003e\\w*\u003c/country\u003e\" \"fly\"\nI am fly@@ ing to \u003ccountry\u003eSwitzerland\u003c/country\u003e at no@@ on .\n```\n\n- byte-level BPE: while BPE uses characters as basic units in Sennrich et al., 2016),\n  [Radford et al., 2019](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)\n  use bytes as basic units. This can be enabled with the argument `--bytes` for `subword-nmt learn-bpe`.\n  When applying BPE with `subword-nmt apply-bpe`, no argument is necessary: whether characters or bytes are the basic units is stored in the first line of the BPE file.\n\nPUBLICATIONS\n------------\n\nThe segmentation methods are described in:\n\n```bibtex\n@inproceedings{sennrich-etal-2016-neural,\n    title = \"Neural Machine Translation of Rare Words with Subword Units\",\n    author = \"Sennrich, Rico  and\n      Haddow, Barry  and\n      Birch, Alexandra\",\n    booktitle = \"Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)\",\n    month = aug,\n    year = \"2016\",\n    address = \"Berlin, Germany\",\n    publisher = \"Association for Computational Linguistics\",\n    url = \"https://www.aclweb.org/anthology/P16-1162\",\n    doi = \"10.18653/v1/P16-1162\",\n    pages = \"1715--1725\",\n}\n```\n\nThe best practice advice is described in:\n\n```bibtex\n@inproceedings{sennrich-etal-2017-university,\n    title = \"The University of {E}dinburgh{'}s Neural {MT} Systems for {WMT}17\",\n    author = \"Sennrich, Rico  and\n      Birch, Alexandra  and\n      Currey, Anna  and\n      Germann, Ulrich  and\n      Haddow, Barry  and\n      Heafield, Kenneth  and\n      Miceli Barone, Antonio Valerio  and\n      Williams, Philip\",\n    booktitle = \"Proceedings of the Second Conference on Machine Translation\",\n    month = sep,\n    year = \"2017\",\n    address = \"Copenhagen, Denmark\",\n    publisher = \"Association for Computational Linguistics\",\n    url = \"https://www.aclweb.org/anthology/W17-4739\",\n    doi = \"10.18653/v1/W17-4739\",\n    pages = \"389--399\",\n}\n```\n\nHOW IMPLEMENTATION DIFFERS FROM Sennrich et al. (2016)\n------------------------------------------------------\n\nThis repository implements the subword segmentation as described in Sennrich et al. (2016),\nbut since version 0.2, there is one core difference related to end-of-word tokens.\n\nIn Sennrich et al. (2016), the end-of-word token `\u003c/w\u003e` is initially represented as a separate token, which can be merged with other subwords over time:\n\n```\nu n d \u003c/w\u003e\nf u n d \u003c/w\u003e\n```\n\nSince 0.2, end-of-word tokens are initially concatenated with the word-final character:\n\n```\nu n d\u003c/w\u003e\nf u n d\u003c/w\u003e\n```\n\nThe new representation ensures that when BPE codes are learned from the above examples and then applied to new text, it is clear that a subword unit `und` is unambiguously word-final, and `un` is unambiguously word-internal, preventing the production of up to two different subword units from each BPE merge operation.\n\n`apply_bpe.py` is backward-compatible and continues to accept old-style BPE files. New-style BPE files are identified by having the following first line: `#version: 0.2`\n\nACKNOWLEDGMENTS\n---------------\nThis project has received funding from Samsung Electronics Polska sp. z o.o. - Samsung R\u0026D Institute Poland, and from the European Union’s Horizon 2020 research and innovation programme under grant agreement 645452 (QT21).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frsennrich%2Fsubword-nmt","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Frsennrich%2Fsubword-nmt","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frsennrich%2Fsubword-nmt/lists"}