{"id":25284593,"url":"https://github.com/cahya-wirawan/rwkv-tokenizer","last_synced_at":"2025-04-09T07:06:53.141Z","repository":{"id":241780944,"uuid":"807635813","full_name":"cahya-wirawan/rwkv-tokenizer","owner":"cahya-wirawan","description":"A fast RWKV Tokenizer written in Rust","archived":false,"fork":false,"pushed_at":"2025-03-31T18:47:50.000Z","size":1995,"stargazers_count":44,"open_issues_count":0,"forks_count":3,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-04-07T18:02:45.093Z","etag":null,"topics":["huggingface","llm","rwkv","tiktoken","tokenizer","trie"],"latest_commit_sha":null,"homepage":"https://github.com/cahya-wirawan/rwkv-tokenizer","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/cahya-wirawan.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-05-29T13:33:31.000Z","updated_at":"2025-03-31T18:47:53.000Z","dependencies_parsed_at":"2024-09-05T13:06:33.673Z","dependency_job_id":"4aacad50-0668-4040-9608-2afa07d349e6","html_url":"https://github.com/cahya-wirawan/rwkv-tokenizer","commit_stats":null,"previous_names":["cahya-wirawan/rwkv-tokenizer"],"tags_count":16,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cahya-wirawan%2Frwkv-tokenizer","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cahya-wirawan%2Frwkv-tokenizer/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cahya-wirawan%2Frwkv-tokenizer/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cahya-wirawan%2Frwkv-tokenizer/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/cahya-wirawan","download_url":"https://codeload.github.com/cahya-wirawan/rwkv-tokenizer/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247704571,"owners_count":20982298,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["huggingface","llm","rwkv","tiktoken","tokenizer","trie"],"created_at":"2025-02-12T20:52:05.084Z","updated_at":"2025-04-09T07:06:53.134Z","avatar_url":"https://github.com/cahya-wirawan.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# RWKV Tokenizer\n\n\n[![GitHub Actions Status](https://github.com/cahya-wirawan/rwkv-tokenizer/actions/workflows/CI.yml/badge.svg)](https://github.com/cahya-wirawan/rwkv-tokenizer/actions/)\n[![Pypi.org Version](https://img.shields.io/pypi/v/pyrwkv-tokenizer.svg)](https://pypi.org/project/pyrwkv-tokenizer/)\n[![Pypi.org Downloads](https://img.shields.io/pypi/dd/pyrwkv-tokenizer)](https://pypi.org/project/pyrwkv-tokenizer/)\n[![Crates.io Version](https://img.shields.io/crates/v/rwkv-tokenizer.svg)](https://crates.io/crates/rwkv-tokenizer)\n[![Crates.io Downloads](https://img.shields.io/crates/d/rwkv-tokenizer.svg)](https://crates.io/crates/rwkv-tokenizer)\n[![License: Apache 2.0](https://img.shields.io/badge/license-Apache_2.0-blue.svg)](https://github.com/cahya-wirawan/rwkv-tokenizer/blob/main/LICENSE.txt)\n\n\nA fast RWKV Tokenizer written in Rust that supports the World Tokenizer used by the \n[RWKV](https://github.com/BlinkDL/RWKV-LM) v5 and v6 models.\n\n## Installation\nInstall the rwkv-tokenizer python module:\n```\n$ pip install pyrwkv-tokenizer\n```\n## Usage\n```\n\u003e\u003e\u003e import pyrwkv_tokenizer\n\u003e\u003e\u003e tokenizer = pyrwkv_tokenizer.RWKVTokenizer()\n\u003e\u003e\u003e tokenizer.encode(\"Today is a beautiful day. 今天是美好的一天。\")\n[33520, 4600, 332, 59219, 21509, 47, 33, 10381, 11639, 13091, 15597, 11685, 14734, 10250, 11639, 10080]\n\u003e\u003e\u003e tokenizer.decode([33520, 4600, 332, 59219, 21509, 47, 33, 10381, 11639, 13091, 15597, 11685, 14734, 10250, 11639, 10080])\n'Today is a beautiful day. 今天是美好的一天。'\n\u003e\u003e\u003e tokenizer.encode_batch([\"Today is a beautiful day.\", \" 今天是美好的一天。\"])\n[[33520, 4600, 332, 59219, 21509, 47], [33, 10381, 11639, 13091, 15597, 11685, 14734, 10250, 11639, 10080]]\n```\n\n## Performance and Validity Test\n\nWe compared the encoding results of the Rust RWKV Tokenizer and the original tokenizer using\nthe English Wikipedia and Chinese poetries datasets. Both results are identical. The Rust RWKV Tokenizer also \npasses [the original tokenizer's unit test](https://github.com/BlinkDL/ChatRWKV/blob/main/tokenizer/rwkv_tokenizer.py). \nThe following steps describe how to do the unit test:\n```\n$ pip install pytest pyrwkv-tokenizer\n$ git clone https://github.com/cahya-wirawan/rwkv-tokenizer.git\n$ cd rwkv-tokenizer\n$ pytest\n```\n\nWe did a performance comparison on [the simple English Wikipedia dataset 20220301.simple](https://huggingface.co/datasets/legacy-datasets/wikipedia)* among following tokenizer:\n- The original RWKV tokenizer (BlinkDL)\n- Huggingface implementaion of RWKV tokenizer\n- Huggingface LLama tokenizer\n- Huggingface Mistral tokenizer\n- Bert tokenizer\n- OpenAI Tiktoken\n- The Rust RWKV tokenizer\n\nThe comparison is done using this [jupyter notebook](tools/rwkv_tokenizers.ipynb) in a M2 Mac mini. The Rust RWKV \ntokenizer is around 17x faster than the original tokenizer and 9.6x faster than OpenAI Tiktoken.\n\n![performance-comparison](data/performance-comparison.png)\n\nWe updated the Rust RWKV world tokenizer to support batch encoding with multithreading. We ran the same comparison\n[script](tools/test_tiktoken-huggingface-rwkv.py)  from the [Huggingface Tokenizers](https://github.com/huggingface/tokenizers)\nwith the additional rwkv tokenizer. The result shows that the rwkv world tokenizer is significantly faster than \nthe Tiktoken and Huggingface tokenizers in all numbers of threads and document sizes (on average, its speed is ten times faster).\n\n![performance-comparison](data/performance-comparison-multithreading.png) \n\n*The simple English Wikipedia dataset can be downloaded as jsonl file from\nhttps://huggingface.co/datasets/cahya/simple-wikipedia/resolve/main/simple-wikipedia.jsonl?download=true\n\n## Tools using this tokenizer\n\nWe also created the [json2bin](https://github.com/cahya-wirawan/json2bin) application to convert datasets from JSONL format \ninto binidx format, a data format used for training RWKV models. It uses multithreading to scale up the performance and \ncan convert a dataset more than 70 times faster (around 360 MB/s) than the original \n[json2binidx_tool](https://github.com/Abel2076/json2binidx_tool) written in Python.\n\n## Changelog\n- Version 0.9.1\n  - Added utf8 error handling to decoder\n- Version 0.9.0\n  - Added multithreading for the function encode_batch()\n  - Added batch/multithreading comparison\n- Version 0.3.0\n  - Fixed the issue where some characters were not encoded correctly\n\n*This tokenizer is my very first Rust program, so it might still have many bugs and silly codes :-)*\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcahya-wirawan%2Frwkv-tokenizer","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcahya-wirawan%2Frwkv-tokenizer","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcahya-wirawan%2Frwkv-tokenizer/lists"}