{"id":18119075,"url":"https://github.com/zjaume/escape-unk","last_synced_at":"2026-02-19T05:32:01.096Z","repository":{"id":60360673,"uuid":"542137259","full_name":"ZJaume/escape-unk","owner":"ZJaume","description":"Escape unknown symbols in SentecePiece vocabularies","archived":false,"fork":false,"pushed_at":"2025-02-12T09:29:34.000Z","size":22,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-09-15T01:39:00.938Z","etag":null,"topics":["escaping","natural-language-processing","neural-machine-translation","sentencepiece"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ZJaume.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2022-09-27T14:45:44.000Z","updated_at":"2025-02-12T09:28:16.000Z","dependencies_parsed_at":"2025-02-13T08:47:59.337Z","dependency_job_id":null,"html_url":"https://github.com/ZJaume/escape-unk","commit_stats":null,"previous_names":[],"tags_count":7,"template":false,"template_full_name":null,"purl":"pkg:github/ZJaume/escape-unk","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ZJaume%2Fescape-unk","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ZJaume%2Fescape-unk/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ZJaume%2Fescape-unk/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ZJaume%2Fescape-unk/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ZJaume","download_url":"https://codeload.github.com/ZJaume/escape-unk/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ZJaume%2Fescape-unk/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29604552,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-19T05:11:50.834Z","status":"ssl_error","status_checked_at":"2026-02-19T05:11:38.921Z","response_time":117,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["escaping","natural-language-processing","neural-machine-translation","sentencepiece"],"created_at":"2024-11-01T05:14:41.653Z","updated_at":"2026-02-19T05:32:00.248Z","avatar_url":"https://github.com/ZJaume.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# escape-unk\nEscape unknown symbols in SentecePiece vocabularies.\nThis is particulary useful for [MarianNMT](https://github.com/marian-nmt/marian) toolkit which does not support replacing unknown tokens with most attentive word in the source (see [here](https://github.com/marian-nmt/marian-dev/issues/732), thanks to @emjotde for the idea).\n\n**IMPORTANT NOTE**: this solution is far from ideal, as the model, especially if it has not been trained with escaped chars, may fail to copy the escaped unknown characters. Ideally, you should train your SentencePiece vocabulary with `--byte_fallback` option. This is just a workaround for scenarios where model does not have byte fallback or can not be re-trained.\n\n## Install\nJust install it from PyPi\n```\npip install escape-unk\n```\n\n## Background\nThere are some scenarios where your machine translation model has to translate sentencences containing characters unknown for the SentencePiece vocabulary.\nNeural models usually start to hallucinate, throw out garbage or just don't know hot to translate when an unknown character comes to the input.\nIn the cases where those characters simply need to be copied, escaping them to their hexadecimal representation, can be useful if the model manages to copy the escaped symbols.\n\nEscape Chinese characters in an English-German vocabulary is just like:\n```bash\necho \"Beijing (Chinese: 北京) is the capital of the People's Republic of China\" | escape-unk -m vocab.deen.spm\n```\n```\nBeijing (Chinese: [[e58c97e4baac]]) is the capital of the People's Republic of China\n```\n\nor escaping emojis\n```bash\necho \"I ❤️ you\" | escape-unk -m vocab.deen.spm\n```\n```\nI [[e29da4efb88f]] you\n```\n\nSo instead of:\n```bash\necho \"Beijing (Chinese: 北京) is the capital of the People's Republic of China\" | marian-decoder -c model.config.yml\n```\n```\nPeking (chinesisch: ) ist die Hauptstadt der Volksrepublik China\n```\n\nwe will have:\n```bash\necho \"Beijing (Chinese: 北京) is the capital of the People's Republic of China\" | escape-unk -m vocab.deen.spm | marian-decoder -c model.config.yml\n```\n```\nBeijing (chinesisch: [[e58c97e4baac]]) ist die Hauptstadt der Volksrepublik China\n```\n\nand the full pipeline with `unescape-unk`:\n```bash\necho \"Beijing ...\" | escape-unk -m vocab.deen.spm | marian-decode -c config.yml | unescape-unk\n```\n```\nBeijing (chinesisch: 北京) ist die Hauptstadt der Volksrepublik China\n```\n\n**WARNING**: if an escaped sequence is not correctly copied by the translator and generates an invalid sequence,\nthe character is omitted and substituted by an empty string.\nIf you want it to fail when this happens, use `--strict`/`-s` mode with `unescape-unk` command.\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fzjaume%2Fescape-unk","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fzjaume%2Fescape-unk","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fzjaume%2Fescape-unk/lists"}