{"id":27710373,"url":"https://github.com/bminixhofer/tokenkit","last_synced_at":"2025-05-15T16:18:25.346Z","repository":{"id":285648591,"uuid":"954668871","full_name":"bminixhofer/tokenkit","owner":"bminixhofer","description":"A toolkit implementing advanced methods to transfer models and model knowledge across tokenizers.","archived":false,"fork":false,"pushed_at":"2025-05-14T13:04:52.000Z","size":503,"stargazers_count":20,"open_issues_count":1,"forks_count":2,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-05-14T13:32:03.271Z","etag":null,"topics":["distillation","jax","llms","machine-learning","tokenization","tokenizer-transfer","transfer-learning"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/bminixhofer.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-03-25T12:42:55.000Z","updated_at":"2025-05-12T06:49:23.000Z","dependencies_parsed_at":null,"dependency_job_id":"937760bf-851d-484c-93bf-f2d1ddc931d7","html_url":"https://github.com/bminixhofer/tokenkit","commit_stats":null,"previous_names":["bminixhofer/tokenkit"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bminixhofer%2Ftokenkit","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bminixhofer%2Ftokenkit/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bminixhofer%2Ftokenkit/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bminixhofer%2Ftokenkit/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/bminixhofer","download_url":"https://codeload.github.com/bminixhofer/tokenkit/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254374530,"owners_count":22060614,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["distillation","jax","llms","machine-learning","tokenization","tokenizer-transfer","transfer-learning"],"created_at":"2025-04-26T16:02:26.795Z","updated_at":"2025-05-15T16:18:25.330Z","avatar_url":"https://github.com/bminixhofer.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"\u003ch1 align=\"center\"\u003etokenkit🔁\u003c/h1\u003e\n\u003ch3 align=\"center\"\u003eTokenization Transfer for LLMs\u003c/h3\u003e\n\n\u003cdiv align=\"center\"\u003e\n\u003cimg src=\"https://github.com/user-attachments/assets/976478c8-8994-4780-8d77-b429ec707932\" width=\"600\"\u003e\n\u003c/div\u003e\n\n`tokenkit` is a toolkit implementing advanced methods to transfer *models* and *model knowledge* across tokenizers.\n\n## News\n\n- __2025-04-23__: A new guide on [implementing cross-tokenizer distillation via ALM from scratch in PyTorch](./docs/pytorch_alm_from_scratch.ipynb)! 🔥\n- __2025-04-22__: New [Llama3-2-3B-IT-Byte](https://huggingface.co/benjamin/Llama3-2-3B-IT-Byte) and [Gemma2-2B-IT-Byte](https://huggingface.co/benjamin/Gemma2-2B-IT-Byte) checkpoints with native `transformers` support (plus, documentation on how to train them). Also, new guides for [running tokenizer transfer](./docs/tokenizer_transfer.md) and [byteification](./docs/byteification.md)!\n- __2025-04-02__: The initial release of `tokenkit` with support for cross-tokenizer distillation via ALM and Zero-Shot Tokenizer Transfer via FVT!\n\n## Contents\n- [Why Transfer Across Tokenizers?](#why-transfer-across-tokenizers)\n- [Installation](#installation)\n- [Quickstart](#quickstart)\n- [Guides](#guides)\n    - [Tokenizer Transfer via tokenkit](./docs/tokenizer_transfer.md)\n    - [Byteification: A Unified Interface to Tokenizers](./docs/byteification.md)\n    - [Implementing ALM From Scratch in PyTorch](./docs/pytorch_alm_from_scratch.ipynb) (new! 🔥)\n- [Features](#features)\n    - [Cross-Tokenizer Distillation](#cross-tokenizer-distillation)\n    - [Zero-Shot Tokenizer Transfer](#zero-shot-tokenizer-transfer)\n    - [Token-Level Ensembling \u0026 Evaluating Transferred Models](#token-level-ensembling--evaluating-transferred-models)\n- [Citation](#citation)\n- [Acknowledgments](#acknowledgments)\n\n## Why Transfer Across Tokenizers?\n\nLLMs are bound to the tokenizer they were pretrained with. This limits their adaptability, reusability and modularity. Tokenizer transfer can lift this limitation. For example:\n- If we want to reuse an LLM trained primarily on English in another language, we might want to update its tokenizer to one that is more suitable for the new language.\n- If we want to combine (e.g., token-level ensemble) two LLMs, we need to transfer them to a common tokenizer.\n- If we want to experiment with better tokenization schemes (e.g., byte-level tokenization), we might want to transfer an existing LLM to this tokenizer instead of training a new one expensively from scratch.\n- If we want to transfer knowledge from a large teacher model to a smaller student model (which uses another tokenizer), we might want to use *cross-tokenizer distillation* to directly transfer the teacher's knowledge to the student without the need to first transfer the teacher to the student's tokenizer.\n\nThis library aims to let you accomplish all of this.\n\n## Installation\n\n`tokenkit` is primarily implemented in Jax, using PyTorch for data loading (so your PyTorch installation does not need to support an accelerator). Recommended installation:\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003cstrong\u003eTPU\u003c/strong\u003e\u003c/summary\u003e\n\n```bash\n# Clone the repository \u0026 install the library\ngit clone https://github.com/bminixhofer/tokenkit\n\n# Create a new virtual environment\n# Currently, requires Python \u003c=3.10, but we are working on this: https://github.com/bminixhofer/tokenkit/issues/4\npython -m venv tokenkit_env\n. tokenkit_env/bin/activate\n\n# Install torch \u0026 jax 0.5.0\npip install torch jax[tpu]==0.5.0 -f https://storage.googleapis.com/jax-releases/libtpu_releases.html\n\n# Currently, tokenkit relies on a fork of `lm_eval`\npip install git+https://github.com/bminixhofer/lm-evaluation-harness\n\n# Install the library and the remaining dependencies\npip install -r requirements.txt\npip install -e .\n# You can ignore warnings from the command below, see https://github.com/bminixhofer/tokenkit/issues/4\npip install paxml==1.4.0 praxis==1.4.0 --no-deps\n```\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003cstrong\u003eGPU\u003c/strong\u003e\u003c/summary\u003e\n\n```bash\n# Clone the repository \u0026 install the library\ngit clone https://github.com/bminixhofer/tokenkit\n\n# Create a new virtual environment\n# Currently, requires Python \u003c=3.10, but we are working on this: https://github.com/bminixhofer/tokenkit/issues/4\npython -m venv tokenkit_env\n. tokenkit_env/bin/activate\n\n# Install torch \u0026 jax 0.5.0\n# you may need to substitute cuda12 with the version of CUDA you are using:\npip install torch jax[cuda12]==0.5.0\n\n# Currently, tokenkit relies on a fork of `lm_eval`\npip install git+https://github.com/bminixhofer/lm-evaluation-harness\n\n# Install the library and the remaining dependencies\npip install -r requirements.txt\npip install -e .\n# You can ignore warnings from the command below, see https://github.com/bminixhofer/tokenkit/issues/4\npip install paxml==1.4.0 praxis==1.4.0 --no-deps\n```\n\u003c/details\u003e\n\n## Quickstart\n\nAfter installing the library, you can play around with the scripts in `examples/` to get started immediately. For example:\n\n```\nbash examples/llama3_to_byte_tokenizer_gpu.sh\n```\n\nIf you're interested in reproducing or improving on a public model which has been trained via ALM, you can also take a look at the `tokenkit` command used to train that model, for example [in the Training section of the Llama3-2-3B-IT-Byte model card](https://huggingface.co/benjamin/Llama3-2-3B-IT-Byte#training).\n\n## Guides\n\n- [Tokenizer Transfer via tokenkit](./docs/tokenizer_transfer.md) (start here!)\n- [Byteification: A Unified Interface to Tokenizers](./docs/byteification.md)\n- [Implementing ALM From Scratch in PyTorch](./docs/pytorch_alm_from_scratch.ipynb) (interactive notebook)\n\n## Features\n\n### Cross-Tokenizer Distillation\n\n`tokenkit` supports [Approximate Likelihood Matching (ALM)](https://arxiv.org/abs/2503.20083) for cross-tokenizer distillation. ALM usually performs best, but we have also implemented the following baselines:\n\n- [Dual Space Knowledge Distillation (DSKD)](https://arxiv.org/abs/2406.17328)\n- [Universal Logit Distillation (ULD)](https://arxiv.org/abs/2402.12030)\n- [Minimum Edit Distance Logit Alignment (MinED)](https://arxiv.org/abs/2401.10491)\n\nYou can run cross-tokenizer distillation using the [`scripts/cross_tokenizer_distill.py`](scripts/cross_tokenizer_distill.py) script. See [`examples`](examples) for examples on transferring to different subword tokenizers and to byte-level tokenization.\n\n### Zero-Shot Tokenizer Transfer\n\n`tokenkit` supports Zero-Shot Tokenizer Transfer (ZeTT) via [Fast Vocabulary Transfer (FVT)](https://aclanthology.org/2022.emnlp-industry.41). Zero-Shot Tokenizer Transfer is usually used to obtain a good initialization for additional training, but can in some cases also be useful on its own. See our [ZeTT paper](https://arxiv.org/abs/2405.07883) for more details.\n\nYou can run Zero-Shot Tokenizer Transfer using the [`scripts/zett.py`](scripts/zett.py) script.\n\n**🚧 We are working on implementing more ZeTT methods (including hypernetwork training introduced [here](https://arxiv.org/abs/2405.07883)).**\n\n### Token-Level Ensembling \u0026 Evaluating Transferred Models\n\n`tokenkit` supports autoregressive generation \u0026 loglikelihood scoring evaluation by implementing a Jax backend to the [LM Evaluation Harness](https://github.com/EleutherAI/lm-evaluation-harness). Alongside generating from single models, you can also generate from *token-level ensembles* of models. There are some predefined ensembles in [`configs/models`](configs/models). For example, this evaluates a token-level ensemle of Llama and Qwen on MMLU: \n\n```bash\npython3 scripts/eval_lockstep.py \\\n  models=llama_qwen \\\n  eval.tasks=[mmlu]\n```\n\nTo evaluate pretrained byte-level models, you'll need to pass embeddings to expand the input ids with (i.e., to use as n-gram embeddings). For example:\n\n```bash\npython3 scripts/eval.py \\\n  model.pretrained_model_name_or_path=\\'benjamin/Gemma2-2B-IT-Byte\\' \\\n  model.tokenizer_name=\\'google/gemma-2-2b-it:source=Gemma2:conversion=byte\\' \\\n  expand_model.pretrained_model_name_or_path=\\'benjamin/gemma-2-2b-it-flax\\' \\\n  expand_model.tokenizer_name=\\'google/gemma-2-2b-it:source=Gemma2\\' \\\n  eval.tasks=[mmlu]\n```\n\nTo evaluate any other model (e.g., subword-to-subword transferred models), use something like the following:\n\n```bash\npython3 scripts/eval.py \\\n  model.pretrained_model_name_or_path=\\'benjamin/Gemma2-2B-IT-with-Qwen2-Tokenizer\\' \\\n  model.tokenizer_name=\\'benjamin/Gemma2-2B-IT-with-Qwen2-Tokenizer:source=Gemma2:conversion=prebyteified\\' \\\n  eval.tasks=[mmlu] \\\n```\n\n## Citation\n\nTo refer to this repository or to cite Approximate Likelihood Matching, please use this citation:\n\n```\n@article{alm,\n  title={Cross-Tokenizer Distillation via Approximate Likelihood Matching},\n  author={Minixhofer, Benjamin and Vuli{\\'c}, Ivan and Ponti, Edoardo Maria},\n  journal={arXiv preprint arXiv:2503.20083},\n  year={2025}\n}\n```\n\nPlease use this citation for Zero-Shot Tokenizer Transfer:\n\n```\n@inproceedings{zett,\ntitle={Zero-Shot Tokenizer Transfer},\nauthor={Benjamin Minixhofer and Edoardo Ponti and Ivan Vuli{\\'c}},\nbooktitle={The Thirty-eighth Annual Conference on Neural Information Processing Systems},\nyear={2024},\nurl={https://openreview.net/forum?id=RwBObRsIzC}\n}\n```\n\n## Acknowledgments\n\nConstituent projects (ALM, ZeTT) were supported by a Royal Society University Research Fellowship ‘Inclusive and Sustainable Language Technology for a Truly Multilingual World’ (no 221137; 2022-) awarded to Ivan Vulić, by the Google Cloud Research Credits program with the award GCP329647813, and by Cloud TPUs from Google’s TPU Research Cloud (TRC). The name `tokenkit` and the README layout were inspired by [mergekit](https://github.com/arcee-ai/mergekit). [big_vision](https://github.com/google-research/big_vision) was extremely useful as a high-quality reference JAX training codebase.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbminixhofer%2Ftokenkit","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbminixhofer%2Ftokenkit","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbminixhofer%2Ftokenkit/lists"}