{"id":26826951,"url":"https://github.com/gautierdag/bpeasy","last_synced_at":"2025-04-06T12:07:12.796Z","repository":{"id":212655379,"uuid":"716786243","full_name":"gautierdag/bpeasy","owner":"gautierdag","description":"Fast bare-bones BPE for modern tokenizer training","archived":false,"fork":false,"pushed_at":"2025-04-02T09:25:01.000Z","size":1473,"stargazers_count":152,"open_issues_count":1,"forks_count":3,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-04-02T09:48:34.066Z","etag":null,"topics":["bpe","tokenization","tokenizer"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/gautierdag.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-11-09T21:47:01.000Z","updated_at":"2025-04-02T09:24:35.000Z","dependencies_parsed_at":"2024-08-29T18:09:19.553Z","dependency_job_id":null,"html_url":"https://github.com/gautierdag/bpeasy","commit_stats":null,"previous_names":["gautierdag/bpeasy"],"tags_count":6,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gautierdag%2Fbpeasy","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gautierdag%2Fbpeasy/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gautierdag%2Fbpeasy/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gautierdag%2Fbpeasy/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/gautierdag","download_url":"https://codeload.github.com/gautierdag/bpeasy/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247478317,"owners_count":20945266,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bpe","tokenization","tokenizer"],"created_at":"2025-03-30T11:31:43.098Z","updated_at":"2025-04-06T12:07:12.773Z","avatar_url":"https://github.com/gautierdag.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# bpeasy\n\n[![codecov](https://codecov.io/gh/gautierdag/bpeasy/branch/main/graph/badge.svg?token=NWHDJ22L8I)](https://codecov.io/gh/gautierdag/bpeasy) [![tests](https://github.com/gautierdag/bpeasy/actions/workflows/test.yml/badge.svg)](https://github.com/gautierdag/bpeasy/actions/workflows/test.yml) [![image](https://img.shields.io/pypi/l/bpeasy.svg)](https://pypi.python.org/pypi/bpeasy) [![image](https://img.shields.io/pypi/pyversions/bpeasy.svg)](https://pypi.python.org/pypi/bpeasy) [![PyPI version](https://badge.fury.io/py/bpeasy.svg)](https://badge.fury.io/py/bpeasy)\n\n## Overview\n\n`bpeasy` is a Python package that provides a tokenizer trainer, implementing in 400 lines of rust an efficient version of Byte Pair Encoding (BPE). The implementation largely follows the huggingface `tokenizers` library, but makes opinionated decisions to simplify the tokenizer training specifically to:\n\n1. Treat text data at the byte-level first --- all text is converted to bytes before training rather than using a character-level approach (like in Huggingface).\n2. Always use a regex-based split pre-tokenizer. This is a customisable regex that is applied to the text before training. This regex decides where to split the text and limits what kind of tokens are possible. This is technically possible in Huggingface but is not well documented. We also use the `fancy-regex` crate which supports a richer set of regex features than the `regex` crate used in Huggingface.\n3. Use `int64` types for counting to allow for training on much larger datasets without the risk of overflow.\n\n**You can think of `bpeasy` as the `tiktoken` training code that never was.**\n\nSee the [benchmarks](/benchmarks/README.md) section for a comparison with the Huggingface library.\n\n## Installation\n\nSimply install the package using pip:\n\n```bash\npip install bpeasy\n```\n\n## Training\n\nThe training function is designed to be bare-bones and returns the trained tokenizer vocab as a dictionary of bytes to integers. This is to allow for maximum flexibility in how you want to use the tokenizer. For example, you can use then port these to tiktoken or Huggingface tokenizers (see below).\n\n```python\n# should be an iterator over str\niterator = jsonl_content_iterator(args)\n# example regex from GPT-4\nregex_pattern = r\"\"\"(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+\"\"\"\n\n# returns the vocab (dict[bytes, int])\nvocab = bpeasy.train_bpe(\n    iterator,\n    regex_pattern,\n    args.max_sentencepiece_length, # max length of tokens\n    args.vocab_size, # max size of vocab\n)\n```\n\nAlternatively, you can also train using the basic tokenizer class provided:\n\n```python\nfrom bpeasy.tokenizer import BPEasyTokenizer\n\ntokenizer = BPEasyTokenizer.train(\n    iterator, # iterator over str\n    vocab_size=vocab_size,\n    max_token_length=max_token_length,\n    regex_pattern=regex_pattern,\n    special_tokens=[\"\u003cs\u003e\", \"\u003cpad\u003e\", \"\u003c/s\u003e\"],\n    fill_to_nearest_multiple_of_eight=True,\n    name=\"bpeasy\",\n)\n```\n\n### Encoding/Decoding\n\nTo test your tokenizer you can use the `BPEasyTokenizer` class, which is a wrapper around the `tiktoken.Encoding` module, simplifying the handling of vocabularies, special tokens, and regex patterns for tokenization.\n\n```python\nfrom bpeasy.tokenizer import BPEasyTokenizer\n\nyour_special_tokens = [\"\u003cs\u003e\", \"\u003cpad\u003e\", \"\u003c/s\u003e\"]\n\ntokenizer = BPEasyTokenizer(\n    vocab=vocab,\n    regex_pattern=regex_pattern,\n    special_tokens=your_special_tokens,\n    fill_to_nearest_multiple_of_eight=True, # pad vocab to multiple of 8\n    name=\"bpeasy\" # optional name for the tokenizer\n)\n\ntest = \"hello_world\"\n\n# encode and decode uses the tiktoken functions\nencoded = tokenizer.encode(test)\ndecoded = tokenizer.decode(encoded)\n\u003e \"hello_world\"\n```\n\nYou can also use `tiktoken` directly, but you would need to handle the special tokens and regex pattern yourself:\n\n```python\nimport tiktoken\n\nvocab = bpeasy.train_bpe(...)\nspecial_tokens = [\"\u003cs\u003e\", \"\u003cpad\u003e\", \"\u003c/s\u003e\"]\n\n# Sort the vocab by rank\nsorted_vocab = sorted(list(vocab.items()), key=lambda x: x[1])\n\n# add special tokens\nspecial_token_ranks = {}\nfor special_token in special_tokens:\n    special_token_ranks[special_token] = len(sorted_vocab)\n    sorted_vocab.append((special_token.encode(\"utf-8\"), len(sorted_vocab)))\n\nfull_vocab = dict(sorted_vocab)\n\nencoder = tiktoken.Encoding(\n            name=name,\n            pat_str=regex_pattern,\n            mergeable_ranks=full_vocab,\n            special_tokens=special_token_ranks,\n        )\n```\n\n### Save/Load tokenizer from file\n\nWe provide basic utility functions to save and load the tokenizer from a json file.\n\n```python\ntokenizer.save(\"path_to_file.json\")\n\ntokenizer = BPEasyTokenizer.from_file(\"path_to_file.json\")\n```\n\n### Export to HuggingFace format\n\nWe also support exporting the tokenizer to the HuggingFace format, which can then be used directly with the HuggingFace `transformers` library.\n\n```python\nfrom bpeasy.tokenizer import BPEasyTokenizer\nfrom trans\ntokenizer = BPEasyTokenizer(\n    ...\n)\n\ntokenizer.export_to_huggingface_format(\"hf_tokenizer.json\")\n\nfrom transformers import PreTrainedTokenizerFast\n\nhf_tokenizer = PreTrainedTokenizerFast(tokenizer_file=\"hf_tokenizer.json\")\n```\n\n### Export vocab to `tiktoken` txt format\n\n```python\nfrom bpeasy import \nvocab = bpeasy.train_bpe(...)\n\n# saves the vocab to a tiktoken txt file format\nsave_vocab_to_tiktoken(vocab, \"vocab.txt\", special_tokens=[\"\u003cs\u003e\", \"\u003cpad\u003e\", \"\u003c/s\u003e\"])\n\n```\n\nIf you want to use the `tiktoken` txt format, you will still need to handle the regex and special tokens yourself, as shown above,\n\n## Contributing\n\nContributions are welcome! Please open an issue if you have any suggestions or improvements.\n\n## License\n\nThis project is licensed under the MIT License.\n\n## Citation\n\nIf you use `bpeasy` in your research, please cite the following paper:\n\n```bash\n@software{bpeasy,\n  author = {Gautier Dagan},\n  title = {bpeasy},\n  year = {2024},\n  url = {https://github.com/gautierdag/bpeasy},\n  repository = {https://github.com/gautierdag/bpeasy},\n  author-email = {gautier.dagan@ed.ac.uk},\n  affiliation = {University of Edinburgh},\n  orcid = {https://orcid.org/0000-0002-1867-4201}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgautierdag%2Fbpeasy","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgautierdag%2Fbpeasy","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgautierdag%2Fbpeasy/lists"}