{"id":32116831,"url":"https://github.com/hplt-project/sacremoses","last_synced_at":"2026-02-20T00:30:54.530Z","repository":{"id":31878794,"uuid":"130306621","full_name":"hplt-project/sacremoses","owner":"hplt-project","description":"Python port of Moses tokenizer, truecaser and normalizer","archived":false,"fork":false,"pushed_at":"2026-02-06T10:10:50.000Z","size":750,"stargazers_count":495,"open_issues_count":32,"forks_count":60,"subscribers_count":11,"default_branch":"master","last_synced_at":"2026-02-06T18:11:57.400Z","etag":null,"topics":["machine-translation","nlp","tokenizer"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/hplt-project.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":"SECURITY.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2018-04-20T03:59:25.000Z","updated_at":"2026-02-06T10:10:55.000Z","dependencies_parsed_at":"2024-06-18T12:29:45.149Z","dependency_job_id":"6b9e0158-0dfc-4e89-887d-b4b3006dcd67","html_url":"https://github.com/hplt-project/sacremoses","commit_stats":null,"previous_names":[],"tags_count":7,"template":false,"template_full_name":null,"purl":"pkg:github/hplt-project/sacremoses","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hplt-project%2Fsacremoses","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hplt-project%2Fsacremoses/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hplt-project%2Fsacremoses/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hplt-project%2Fsacremoses/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/hplt-project","download_url":"https://codeload.github.com/hplt-project/sacremoses/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hplt-project%2Fsacremoses/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29637408,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-19T22:32:43.237Z","status":"ssl_error","status_checked_at":"2026-02-19T22:32:38.330Z","response_time":117,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["machine-translation","nlp","tokenizer"],"created_at":"2025-10-20T16:14:33.521Z","updated_at":"2026-02-20T00:30:54.525Z","avatar_url":"https://github.com/hplt-project.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"# Sacremoses\n\n# License\n\n[MIT License](LICENSE).\n\n# Install\n\n```\npip install -U sacremoses\n```\n\nNOTE: Sacremoses only supports Python 3 now (`sacremoses\u003e=0.0.41`). If you're using Python 2, the last possible version is `sacremoses==0.0.40`.\n\n# Usage (Python)\n\n## Tokenizer and Detokenizer\n\n```python\n\u003e\u003e\u003e from sacremoses import MosesTokenizer, MosesDetokenizer\n\u003e\u003e\u003e mt = MosesTokenizer(lang='en')\n\u003e\u003e\u003e text = 'This, is a sentence with weird\\xbb symbols\\u2026 appearing everywhere\\xbf'\n\u003e\u003e\u003e expected_tokenized = 'This , is a sentence with weird \\xbb symbols \\u2026 appearing everywhere \\xbf'\n\u003e\u003e\u003e tokenized_text = mt.tokenize(text, return_str=True)\n\u003e\u003e\u003e tokenized_text == expected_tokenized\nTrue\n\n\n\u003e\u003e\u003e mt, md = MosesTokenizer(lang='en'), MosesDetokenizer(lang='en')\n\u003e\u003e\u003e sent = \"This ain't funny. It's actually hillarious, yet double Ls. | [] \u003c \u003e [ ] \u0026 You're gonna shake it off? Don't?\"\n\u003e\u003e\u003e expected_tokens = ['This', 'ain', '\u0026apos;t', 'funny', '.', 'It', '\u0026apos;s', 'actually', 'hillarious', ',', 'yet', 'double', 'Ls', '.', '\u0026#124;', '\u0026#91;', '\u0026#93;', '\u0026lt;', '\u0026gt;', '\u0026#91;', '\u0026#93;', '\u0026amp;', 'You', '\u0026apos;re', 'gonna', 'shake', 'it', 'off', '?', 'Don', '\u0026apos;t', '?']\n\u003e\u003e\u003e expected_detokens = \"This ain't funny. It's actually hillarious, yet double Ls. | [] \u003c \u003e [] \u0026 You're gonna shake it off? Don't?\"\n\u003e\u003e\u003e mt.tokenize(sent) == expected_tokens\nTrue\n\u003e\u003e\u003e md.detokenize(tokens) == expected_detokens\nTrue\n```\n\n\n## Truecaser\n\n```python\n\u003e\u003e\u003e from sacremoses import MosesTruecaser, MosesTokenizer\n\n# Train a new truecaser from a 'big.txt' file.\n\u003e\u003e\u003e mtr = MosesTruecaser()\n\u003e\u003e\u003e mtok = MosesTokenizer(lang='en')\n\n# Save the truecase model to 'big.truecasemodel' using `save_to`\n\u003e\u003e tokenized_docs = [mtok.tokenize(line) for line in open('big.txt')]\n\u003e\u003e\u003e mtr.train(tokenized_docs, save_to='big.truecasemodel')\n\n# Save the truecase model to 'big.truecasemodel' after training\n# (just in case you forgot to use `save_to`)\n\u003e\u003e\u003e mtr = MosesTruecaser()\n\u003e\u003e\u003e mtr.train('big.txt')\n\u003e\u003e\u003e mtr.save_model('big.truecasemodel')\n\n# Truecase a string after training a model.\n\u003e\u003e\u003e mtr = MosesTruecaser()\n\u003e\u003e\u003e mtr.train('big.txt')\n\u003e\u003e\u003e mtr.truecase(\"THE ADVENTURES OF SHERLOCK HOLMES\")\n['the', 'adventures', 'of', 'Sherlock', 'Holmes']\n\n# Loads a model and truecase a string using trained model.\n\u003e\u003e\u003e mtr = MosesTruecaser('big.truecasemodel')\n\u003e\u003e\u003e mtr.truecase(\"THE ADVENTURES OF SHERLOCK HOLMES\")\n['the', 'adventures', 'of', 'Sherlock', 'Holmes']\n\u003e\u003e\u003e mtr.truecase(\"THE ADVENTURES OF SHERLOCK HOLMES\", use_known=True)\n['the', 'ADVENTURES', 'OF', 'SHERLOCK', 'HOLMES']\n\u003e\u003e\u003e mtr.truecase(\"THE ADVENTURES OF SHERLOCK HOLMES\", return_str=True)\n'the adventures of Sherlock Holmes'\n```\n\n## Normalizer\n\n```python\n\u003e\u003e\u003e from sacremoses import MosesPunctNormalizer\n\u003e\u003e\u003e mpn = MosesPunctNormalizer()\n\u003e\u003e\u003e mpn.normalize('THIS EBOOK IS OTHERWISE PROVIDED TO YOU \"AS-IS.\"')\n'THIS EBOOK IS OTHERWISE PROVIDED TO YOU \"AS-IS.\"'\n```\n\n# Usage (CLI)\n\nSince version `0.0.42`, the pipeline feature for CLI is introduced, thus there\nare global options that should be set first before calling the commands:\n\n - language\n - processes\n - encoding\n - quiet\n\n```shell\n$ pip install -U sacremoses\u003e=0.1\n\n$ sacremoses --help\nUsage: sacremoses [OPTIONS] COMMAND1 [ARGS]... [COMMAND2 [ARGS]...]...\n\nOptions:\n  -l, --language TEXT      Use language specific rules when tokenizing\n  -j, --processes INTEGER  No. of processes.\n  -e, --encoding TEXT      Specify encoding of file.\n  -q, --quiet              Disable progress bar.\n  --version                Show the version and exit.\n  -h, --help               Show this message and exit.\n\nCommands:\n  detokenize\n  detruecase\n  normalize\n  tokenize\n  train-truecase\n  truecase\n```\n\n## Pipeline\n\nExample to chain the following commands:\n\n - `normalize` with `-c` option to remove control characters.\n - `tokenize` with `-a` option for aggressive dash split rules.\n - `truecase` with `-a` option to indicate that model is for ASR \n   - if `big.truemodel` exists, load the model with `-m` option,\n   - otherwise train a model and save it with `-m` option to `big.truemodel` file.\n - save the output to console to the `big.txt.norm.tok.true` file.\n\n```shell\ncat big.txt | sacremoses -l en -j 4 \\\n    normalize -c tokenize -a truecase -a -m big.truemodel \\\n    \u003e big.txt.norm.tok.true\n```\n\n## Tokenizer\n\n```shell\n$ sacremoses tokenize --help\nUsage: sacremoses tokenize [OPTIONS]\n\nOptions:\n  -a, --aggressive-dash-splits   Triggers dash split rules.\n  -x, --xml-escape               Escape special characters for XML.\n  -p, --protected-patterns TEXT  Specify file with patters to be protected in\n                                 tokenisation.\n  -c, --custom-nb-prefixes TEXT  Specify a custom non-breaking prefixes file,\n                                 add prefixes to the default ones from the\n                                 specified language.\n  -h, --help                     Show this message and exit.\n\n\n $ sacremoses -l en -j 4 tokenize  \u003c big.txt \u003e big.txt.tok\n100%|██████████████████████████████████| 128457/128457 [00:05\u003c00:00, 24363.39it/s\n\n $ wget https://raw.githubusercontent.com/moses-smt/mosesdecoder/master/scripts/tokenizer/basic-protected-patterns\n $ sacremoses -l en -j 4 tokenize -p basic-protected-patterns \u003c big.txt \u003e big.txt.tok\n100%|██████████████████████████████████| 128457/128457 [00:05\u003c00:00, 22183.94it/s\n```\n\n## Detokenizer\n\n```shell\n$ sacremoses detokenize --help\nUsage: sacremoses detokenize [OPTIONS]\n\nOptions:\n  -x, --xml-unescape  Unescape special characters for XML.\n  -h, --help          Show this message and exit.\n\n $ sacremoses -l en -j 4 detokenize \u003c big.txt.tok \u003e big.txt.tok.detok\n100%|██████████████████████████████████| 128457/128457 [00:16\u003c00:00, 7931.26it/s]\n```\n\n## Truecase\n\n```shell\n$ sacremoses truecase --help\nUsage: sacremoses truecase [OPTIONS]\n\nOptions:\n  -m, --modelfile TEXT            Filename to save/load the modelfile.\n                                  [required]\n  -a, --is-asr                    A flag to indicate that model is for ASR.\n  -p, --possibly-use-first-token  Use the first token as part of truecase\n                                  training.\n  -h, --help                      Show this message and exit.\n\n$ sacremoses -j 4 truecase -m big.model \u003c big.txt.tok \u003e big.txt.tok.true\n100%|██████████████████████████████████| 128457/128457 [00:09\u003c00:00, 14257.27it/s]\n```\n\n## Detruecase\n\n```shell\n$ sacremoses detruecase --help\nUsage: sacremoses detruecase [OPTIONS]\n\nOptions:\n  -j, --processes INTEGER  No. of processes.\n  -a, --is-headline        Whether the file are headlines.\n  -e, --encoding TEXT      Specify encoding of file.\n  -h, --help               Show this message and exit.\n\n$ sacremoses -j 4 detruecase  \u003c big.txt.tok.true \u003e big.txt.tok.true.detrue\n100%|█████████████████████████████████| 128457/128457 [00:04\u003c00:00, 26945.16it/s]\n```\n\n## Normalize\n\n```shell\n$ sacremoses normalize --help\nUsage: sacremoses normalize [OPTIONS]\n\nOptions:\n  -q, --normalize-quote-commas  Normalize quotations and commas.\n  -d, --normalize-numbers       Normalize number.\n  -p, --replace-unicode-puncts  Replace unicode punctuations BEFORE\n                                normalization.\n  -c, --remove-control-chars    Remove control characters AFTER normalization.\n  -h, --help                    Show this message and exit.\n\n$ sacremoses -j 4 normalize \u003c big.txt \u003e big.txt.norm\n100%|██████████████████████████████████| 128457/128457 [00:09\u003c00:00, 13096.23it/s]\n```\n\n# Acknowledgements\n\nThis project has received funding from the European Union’s Horizon Europe research and innovation programme under grant agreement No 101070350 and from UK Research and Innovation (UKRI) under the UK government’s Horizon Europe funding guarantee [grant number 10052546]\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhplt-project%2Fsacremoses","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhplt-project%2Fsacremoses","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhplt-project%2Fsacremoses/lists"}