{"id":19479570,"url":"https://github.com/zouharvi/tokenization-scorer","last_synced_at":"2025-10-06T03:12:10.039Z","repository":{"id":158823786,"uuid":"634255616","full_name":"zouharvi/tokenization-scorer","owner":"zouharvi","description":"Simple-to-use scoring function for arbitrarily tokenized texts.","archived":false,"fork":false,"pushed_at":"2025-02-19T12:17:59.000Z","size":43,"stargazers_count":46,"open_issues_count":0,"forks_count":5,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-09-25T07:57:54.170Z","etag":null,"topics":["bpe","segmentation","subword","tokenization"],"latest_commit_sha":null,"homepage":"https://aclanthology.org/2023.acl-long.284/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/zouharvi.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-04-29T14:44:20.000Z","updated_at":"2025-09-10T11:48:49.000Z","dependencies_parsed_at":"2024-03-08T17:45:29.253Z","dependency_job_id":"34003762-490d-47a0-b9ba-b972eab86513","html_url":"https://github.com/zouharvi/tokenization-scorer","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/zouharvi/tokenization-scorer","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zouharvi%2Ftokenization-scorer","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zouharvi%2Ftokenization-scorer/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zouharvi%2Ftokenization-scorer/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zouharvi%2Ftokenization-scorer/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/zouharvi","download_url":"https://codeload.github.com/zouharvi/tokenization-scorer/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zouharvi%2Ftokenization-scorer/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":278551596,"owners_count":26005408,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-06T02:00:05.630Z","response_time":65,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bpe","segmentation","subword","tokenization"],"created_at":"2024-11-10T19:56:02.452Z","updated_at":"2025-10-06T03:12:10.008Z","avatar_url":"https://github.com/zouharvi.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# tokenization-scorer \u0026nbsp;\u0026nbsp;\u0026nbsp;\n\n[![PyPI Version](https://img.shields.io/pypi/v/tokenization-scorer.svg)](https://pypi.python.org/pypi/tokenization-scorer)\n\u0026nbsp;\n[![test tokenization-scorer](https://github.com/zouharvi/tokenization-scorer/actions/workflows/test.yml/badge.svg)](https://github.com/zouharvi/tokenization-scorer/actions/workflows/test.yml)\n\u0026nbsp;\n[![Paper](https://img.shields.io/badge/📜%20paper-481.svg)](https://aclanthology.org/2023.acl-long.284/)\n\nSimple package for evaluating text tokenizations.\nThe input is a text (list of files or stdin) and output a single number.\nThe higher the number, the better the tokenization.\nThe intended workflow is to try multiple tokenizations and select the one with the highest number.\n\nIt can be used from the command line:\n\n```bash\npip3 install tokenization-scorer\n\ntokenization-scorer -i en-de.tokenized_with_unigramlm.{en,de}\n\u003e 0.4826\n\ntokenization-scorer -i en-de.tokenized_with_wordpiece.{en,de}\n\u003e 0.5047\n```\n\nor within Python:\n\n```python\nimport tokenization_scorer\ntext1 = \"pick @@ed pick @@l @@ed pick @@les\"\ntokenization_scorer.score(text1, metric=\"renyi\", power=2.5)\n\u003e 0.8031528501359657\n\ntext2 = \"pick @@e @@d pick @@l @@e @@d pick @@l @@e @@s\"\ntokenization_scorer.score(text2, metric=\"renyi\", power=2.5)\n\u003e 0.9105681923824472\n```\n\nUse `tokenization-scorer -h` to get an overview of supported metrics.\nThis package is a side-product of the paper [Tokenization and the Noiseless Channel](https://aclanthology.org/2023.acl-long.284/) which has [code here](https://github.com/zouharvi/tokenization-principle).\n\n```\n@inproceedings{tokenization_noiseless, \n    title={Tokenization and the Noiseless Channel},\n    author={Zouhar, Vilém and Meister, Clara and Gastaldi, Juan Luis and Sachan, Mrinmaya and Cotterell, Ryan},\n    booktitle={Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics},\n    year={2023},\n    url={https://aclanthology.org/2023.acl-long.284/},\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fzouharvi%2Ftokenization-scorer","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fzouharvi%2Ftokenization-scorer","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fzouharvi%2Ftokenization-scorer/lists"}