{"id":20925311,"url":"https://github.com/apehex/tokun","last_synced_at":"2025-05-13T17:32:59.685Z","repository":{"id":238123822,"uuid":"794945150","full_name":"apehex/tokun","owner":"apehex","description":"Tokun to can tokens","archived":false,"fork":false,"pushed_at":"2024-11-14T07:52:43.000Z","size":3793520,"stargazers_count":15,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2024-11-14T08:30:34.187Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/apehex.png","metadata":{"files":{"readme":".github/README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-05-02T09:21:28.000Z","updated_at":"2024-11-14T07:52:46.000Z","dependencies_parsed_at":"2024-11-14T08:34:52.274Z","dependency_job_id":null,"html_url":"https://github.com/apehex/tokun","commit_stats":null,"previous_names":["apehex/tokun"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/apehex%2Ftokun","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/apehex%2Ftokun/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/apehex%2Ftokun/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/apehex%2Ftokun/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/apehex","download_url":"https://codeload.github.com/apehex/tokun/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":225247646,"owners_count":17444080,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-18T20:30:59.215Z","updated_at":"2025-05-13T17:32:59.678Z","avatar_url":"https://github.com/apehex.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# tokun\n\n\u003cimg src=\"header.png\" alt=\"Neural tokenization\" title=\"Source: Image by Author and generated with MidJourney\" width=\"100%\" style=\"margin: auto;\"/\u003e\n\n\u003e **!** this project is largely obsolete, replaced by the layer [TokunEmbedding](https://github.com/apehex/mlable?tab=readme-ov-file#TokunEmbedding)\n\nThe patching technique used in image / video model can be used on text as explained in [this article](https://huggingface.co/blog/apehex/this-title-is-already-tokenized).\n\nIn short, this method reduces 2D spatial data into a 1D sequence fit for transformer architectures.\n\nConversely, text data can be treated as 2D as follows:\n\n- a scalar tensor of `B` strings is encoded using UTF-32-BE: `(B,) =\u003e (B, 4S)`\n- the bytes are grouped by chunks of `N`: `(B, 4S) =\u003e (B, 4S/N, N)`\n- the bytes are embeded independently: `(B, 4S/N, N) =\u003e (B, 4S/N, N, E)`\n- the embeddings are merged N by N: `(B, 4S/N, N, E) =\u003e (B, 4S/N, NE)`\n\n`S` is the limit length for the string inputs and the factor 4 is the number of bytes per character.\n\nThe merged byte embeddings form actual \"token\" embeddings, while keeping the information on composition.\nHence the name \"composite emheddings\".\n\nThere is no more need for a VAE or any model to learn token or sentence embeddings.\n\n## Overview\n\n\u003e `to-kun` took tokens to t-can\n\nCurrent tokenizers have notorious issues that are bringing all the LLMs down.\n\n`tokun` is a model specialized in text embedding.\nIt is **lossless** while providing **high input compression**.\n\n`tokun` produces vectors of dimension 256 equivalent to 64 UTF-32-BE bytes.\nIE each embedding can be thought of as a *token of length 16 characters*.\n\nBut these vectors are more than basic IDs, they keep meaningful information on their constituting parts.\n\n## Features\n\nThe model produces vector embeddings that can be directly ingested by another model.\n\nRegular tokens are unrelated IDs, while `tokun` has the following properties:\n\n- **international**: `tokun` performs evenly on the whole Unicode space\n- **compression**: the sequence length is divided by 16\n- **embeddings**: the output vectors have only a dimension 256\n- **lossless**: embeddings store all the information up to the byte level\n- **built-ins**: Unicode has built-in special tokens, no need for `\u003c|im_start|\u003e`\n- **meaningful**: embeddings are natively related to each-other based on their parts\n\n## Installation\n\nIn all cases, the model requires the code from the package `tokun`:\n\n```shell\npip install tokun\n```\n\n### From Hugging Face\n\nLogin to Hugging Face:\n\n```shell\nhuggingface-cli login\n```\n\nDownload the repository:\n\n```python\nimport huggingface_hub as hh\n\napi = hh.HfApi()\napi.snapshot_download(repo_id='apehex/tokun', local_dir='tokun/')\n```\n\nImport the tokenizer and model:\n\n```python\ntokenizer = tokun.huggingface.ByteTokenizer()\nmodel = hh.from_pretrained_keras('tokun/variants/16x4/')\n```\n\n### With Base Tensorflow / Keras\n\nYou can directly load the weights [from the repository](../models/).\n\nFor the most performant variant of the model, `16x4`:\n\n```python\nimport tensorflow as tf\nimport tokun.model\nimport urllib.request\n\nurllib.request.urlretrieve('https://github.com/apehex/tokun/raw/main/models/16x4/1/7.7.keras', 'model.keras')\nmodel = tf.keras.models.load_model('model.keras', compile=False)\n```\n\n## Usage\n\nSince it is small (between 1 and 2M parameters depending on the variant), the model can also be [trained on Google Colab][notebook-file-tokun-train].\n\nWe will be encoding and decoding the following sample:\n\n```python\n__s = \"\"\"Une unité lexicale ou token lexical ou plus simplement token est un couple composé d'un nom et d'une valeur optionnelle (e.g. 135677).\"\"\"\n```\n\n### With Hugging Face\n\nThe sequence dimension is fixed to 512 because exporting the Keras model requires to specify the input shape.\nSo the sample is padded to `16 * 512` characters or `64 * 512` bytes.\n\n```python\n# encode with UTF-32\n__x = tokenizer.batch_encode_plus(batch_text_or_text_pairs=[__s], padding='max_length', max_length=64 * 512, add_special_tokens=False)\n__x = tf.convert_to_tensor(__x['input_ids'])\n# tokenize\n__e = model.layers[1](__x) # encoder\n# these embeddings would be the input of a LLM\n__o = llm(__e) # replace with your LLM\n# detokenize\n__p = model.layers[2](__o) # decoder\n# interpret probabilities as byte indexes\n__y = tokun.pipeline.postprocess(__p)\n```\n\n```python\nprint(len(__s))\n# 252\nprint(__x.shape) # 16 * 512 characters = 64 * 512 bytes\n# (1, 32768)\nprint(__e.shape) # 512 embeddings\n# (1, 512, 256)\nprint(__p.shape) # back to x shape\n# (1, 32768, 256)\n```\n\n\u003e Note: the base Tensorflow implementation operates on any sequence dimension (see below)\n\n### With Base Tensorflow / Keras\n\n```python\n__x = tokun.pipeline.preprocess(text=__s, groups=[4, 16], expand=[1], flatten=True)\n__e = model._encoder(__x) # final embedding = input for another model\n# these embeddings would be the input of a LLM\n__o = llm(__e) # replace with your LLM\n# detokenize\n__p = MODEL._decoder(__o)\n# interpret probabilities as byte indexes\n__y = tokun.pipeline.postprocess(__p)\n```\n\nThe OG version doesn't fix the sequence dimension:\n\n```python\nprint(len(__s))\n# 252\nprint(__x.shape) # 4 * 252 = 1008 padded to 1024 bytes\n# (1, 1024)\nprint(__e.shape) # 252 / 16 = 1024 / 64 = 16\n# (1, 16, 256)\nprint(__p.shape) # back to x shape\n# (1, 1024, 256)\n```\n\n## Training and evaluation data\n\n`tokun` was **trained on random sequences** of UTF-32-BE bytes, so that it covers the first 4 planes of Unicode.\n\nValidation was also performed on the 7 languages of [MLQA][github-mlqa] to make sure the model keeps its accuracy on regular text.\n\n## Resources\n\n### Notebooks\n\nFinal model:\n\n- train: [file][notebook-file-tokun-train] / [Colab][notebook-colab-tokun-train]\n- demo: [file][notebook-file-tokun-demo] / [Colab][notebook-colab-tokun-demo]\n\nOlder / simpler model iterations:\n\n- `tokun-1`: [file][notebook-file-tokun-1] / [Colab][notebook-colab-tokun-1]\n- `tokun-4`: [file][notebook-file-tokun-4] / [Colab][notebook-colab-tokun-4]\n- `tokun-16`: [file][notebook-file-tokun-16] / [Colab][notebook-colab-tokun-16]\n\n### Articles\n\nMain article:\n\n- on [Github][article-file-tokun]\n- on [Hugging Face][article-hugging-face]\n\nNotes on each iteration:\n\n- `tokun-1`: [Github][article-file-tokun-1]\n- `tokun-4`: [Github][article-file-tokun-4]\n- `tokun-16`: [Github][article-file-tokun-16]\n\n## TODO\n\nSee [TODO](TODO.md).\n\n## Credits\n\nThis project was inspired by a video from Andrej Karpathy, [\"Let's build the GPT tokenizer\"][youtube-karpathy-tokenizer].\n\n## License\n\nLicensed under the [aGPLv3](LICENSE.md).\n\n[article-file-tokun]: ../articles/tokun.md\n[article-file-tokun-1]: ../articles/tokun.1.md\n[article-file-tokun-4]: ../articles/tokun.4.md\n[article-file-tokun-16]: ../articles/tokun.16.md\n[article-hugging-face]: https://huggingface.co/blog/apehex/tokenization-is-a-dead-weight\n[article-notion-tokun-1]: https://apehex.notion.site/Tokun-1-e03c438a39fe49fcb2ce303eb63b2e73\n[article-notion-tokun-4]: https://apehex.notion.site/Tokun-4-c8b4a3bd1270485a908287869553e9f2\n[article-notion-tokun-16]: https://apehex.notion.site/Tokun-16-ecf35d5207ab401d85d3aa21d0b09538\n\n[notebook-colab-tokun-1]: https://colab.research.google.com/github/apehex/tokun/blob/main/notebooks/tokun.1.ipynb\n[notebook-colab-tokun-4]: https://colab.research.google.com/github/apehex/tokun/blob/main/notebooks/tokun.4.ipynb\n[notebook-colab-tokun-16]: https://colab.research.google.com/github/apehex/tokun/blob/main/notebooks/tokun.16.ipynb\n[notebook-colab-tokun-demo]: https://colab.research.google.com/github/apehex/tokun/blob/main/notebooks/tokun.demo.ipynb\n[notebook-colab-tokun-train]: https://colab.research.google.com/github/apehex/tokun/blob/main/notebooks/tokun.train.ipynb\n[notebook-file-tokun-1]: ../notebooks/tokun.1.ipynb\n[notebook-file-tokun-4]: ../notebooks/tokun.4.ipynb\n[notebook-file-tokun-16]: ../notebooks/tokun.16.ipynb\n[notebook-file-tokun-demo]: ../notebooks/tokun.demo.ipynb\n[notebook-file-tokun-train]: ../notebooks/tokun.train.ipynb\n[notebook-hf-tokun-demo]: ../notebooks/tokun.demo.ipynb\n[notebook-hf-tokun-train]: ../notebooks/tokun.train.ipynb\n[notebook-kaggle-tokun-demo]: ../notebooks/tokun.demo.ipynb\n[notebook-kaggle-tokun-train]: ../notebooks/tokun.train.ipynb\n\n[youtube-karpathy-tokenizer]: https://www.youtube.com/watch?v=zduSFxRajkE\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fapehex%2Ftokun","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fapehex%2Ftokun","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fapehex%2Ftokun/lists"}