{"id":48891341,"url":"https://github.com/jonasknobloch/tokenizers-mbpe","last_synced_at":"2026-04-16T08:04:58.210Z","repository":{"id":210446115,"uuid":"722695762","full_name":"jonasknobloch/tokenizers-mbpe","owner":"jonasknobloch","description":"Morphologically biased byte-pair encoding pre-tokenization","archived":false,"fork":false,"pushed_at":"2024-11-11T10:56:00.000Z","size":139,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2026-03-24T12:08:01.735Z","etag":null,"topics":["byte-pair-encoding","morphological-analysis","morphology","nlp","segmentation","tokenizer"],"latest_commit_sha":null,"homepage":"","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/jonasknobloch.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-11-23T18:10:15.000Z","updated_at":"2026-01-06T22:30:01.000Z","dependencies_parsed_at":"2024-04-08T21:27:59.479Z","dependency_job_id":"30bee29d-6cb0-4fd9-875c-4090ceaa0253","html_url":"https://github.com/jonasknobloch/tokenizers-mbpe","commit_stats":{"total_commits":46,"total_committers":1,"mean_commits":46.0,"dds":0.0,"last_synced_commit":"08a526f89923d540709033910491704c7de2dedf"},"previous_names":["jonasknobloch/morphy","jonasknobloch/mbpe"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/jonasknobloch/tokenizers-mbpe","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jonasknobloch%2Ftokenizers-mbpe","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jonasknobloch%2Ftokenizers-mbpe/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jonasknobloch%2Ftokenizers-mbpe/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jonasknobloch%2Ftokenizers-mbpe/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/jonasknobloch","download_url":"https://codeload.github.com/jonasknobloch/tokenizers-mbpe/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jonasknobloch%2Ftokenizers-mbpe/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31876860,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-16T07:36:03.521Z","status":"ssl_error","status_checked_at":"2026-04-16T07:35:53.576Z","response_time":69,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["byte-pair-encoding","morphological-analysis","morphology","nlp","segmentation","tokenizer"],"created_at":"2026-04-16T08:04:14.523Z","updated_at":"2026-04-16T08:04:58.202Z","avatar_url":"https://github.com/jonasknobloch.png","language":"Rust","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Morphologically Biased Byte-Pair Encoding\n\nmBPE acts as an extension to the [huggingface/tokenizers](https://github.com/huggingface/tokenizers) library and is\ndesigned to enhance segmentations produced by the byte-pair encoding tokenization algorithm[^1]. Byte-pair encoding has\nbeen shown to poorly approximate morphological boundaries[^2], which is especially problematic for morphologically rich\nlanguage. By incorporating morphological knowledge into the pre-tokenization process, we aim to improve the quality of\nproduced segmentations through an induced bias towards morphologically motivated sub-word boundaries.\n\n[^1]: [Neural Machine Translation of Rare Words with Subword Units](https://doi.org/10.48550/arXiv.1508.07909)\n\n[^2]: [Byte Pair Encoding is Suboptimal for Language Model Pretraining](https://doi.org/10.48550/arXiv.2004.03720)\n\nPre-trained tokenizers and models are available on [Hugging Face](https://huggingface.co/jonasknobloch).\n\n* [gpt2_cx-en_00000-00000_50k](https://huggingface.co/jonasknobloch/gpt2_cx-cs_00000-00019_50k)\n* [gpt2+ts_cx-en_00000-00000_50k](https://huggingface.co/jonasknobloch/gpt2-ts_cx-en_00000-00009_50k)\n* [gpt2+morf_u0-30-50-x_cx-en_00000-00000_50k](https://huggingface.co/jonasknobloch/gpt2-morf_u0-30-50-x_cx-en_00000-00009_50k)\n* [gpt2+morf_s0-30-x-2_cx-en_00000-00000_50k](https://huggingface.co/jonasknobloch/gpt2-morf_s0-30-x-2_cx-en_00000-00009_50k)\n\n## Pre-Tokenizers\n\n### External\n\nThe external pre-tokenizer enables the integration custom pre-tokenization algorithms via a socket connection.\nTokenization parallelism should be disabled by setting `TOKENIZERS_PARALLELISM=true`. Note that disabling parallelism\nwill slow down tokenization significantly. See [jonasknobloch/unimorph](https://github.com/jonasknobloch/unimorph)\nfor a reference server implementation.\n\n### Tree-Split\n\nThe tree-split pre-tokenizer introduces additional boundaries by clustering inflected word forms retrieved from\n[UniMorph](https://unimorph.github.io)[^3] dictionaries. Form clusters are aligned by constructing a suffix tree for each\ncluster. New boundaries are then introduced by traversing the trees and introducing boundaries at nodes with multiple children.\n\n[^3]: [UniMorph 4.0: Universal Morphology](https://doi.org/10.48550/arXiv.2205.03608)\n\n### Morfessor\n\nThe Morfessor pre-tokenizer introduces additional boundaries retrieved using an arbitrary\n[Morfessor](http://morpho.aalto.fi/projects/morpho/morfessor2.shtml)[^4][^5] model. Trained Morfessor models need to be\nconverted using the provided protobuf definition and conversion script\n\n[^4]: [Unsupervised Discovery of Morphemes](https://doi.org/10.48550/arXiv.cs/0205057)\n\n[^5]: [Morfessor 2.0: Python Implementation and Extensions for Morfessor Baseline](https://urn.fi/URN:ISBN:978-952-60-5501-5)\n\n## Intrinsic Metrics\n\n### Tokenizer Fertility\n\n| tokenizer                                  | compounds | fertility |\n|--------------------------------------------|-----------|-----------|\n| gpt2_cx-en_00000-00000_50k                 | 4992469   | **1.32**  |\n| gpt2+ts_cx-en_00000-00000_50k              | 4923123   | 1.40      |\n| gpt2+morf_u0-30-50-x_cx-en_00000-00000_50k | 3630703   | 1.42      |\n| gpt2+morf_s0-30-x-2_cx-en_00000-00000_50k  | 99191     | 1.69      |\n\n### Boundary Precision and Recall\n\n| tokenizer                                  | P        | R        | F1       |\n|--------------------------------------------|----------|----------|----------|\n| gpt2_cx-en_00000-00000_50k                 | 0.33     | 0.56     | 0.42     |\n| gpt2+ts_cx-en_00000-00000_50k              | 0.40     | 0.58     | 0.47     |\n| gpt2+morf_u0-30-50-x_cx-en_00000-00000_50k | 0.45     | **0.61** | 0.52     |\n| gpt2+morf_s0-30-x-2_cx-en_00000-00000_50k  | **0.56** | 0.59     | **0.57** |\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjonasknobloch%2Ftokenizers-mbpe","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjonasknobloch%2Ftokenizers-mbpe","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjonasknobloch%2Ftokenizers-mbpe/lists"}