{"id":15032434,"url":"https://github.com/guillaume-be/rust-tokenizers","last_synced_at":"2025-05-15T20:00:35.652Z","repository":{"id":36949034,"uuid":"220668276","full_name":"guillaume-be/rust-tokenizers","owner":"guillaume-be","description":"Rust-tokenizer offers high-performance tokenizers for modern language models, including WordPiece, Byte-Pair Encoding (BPE) and Unigram (SentencePiece) models","archived":false,"fork":false,"pushed_at":"2023-10-01T08:37:10.000Z","size":1173,"stargazers_count":313,"open_issues_count":6,"forks_count":29,"subscribers_count":7,"default_branch":"main","last_synced_at":"2025-05-10T07:43:54.214Z","etag":null,"topics":["deep-learning","rust-lang","tokenizer","transformer"],"latest_commit_sha":null,"homepage":"","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/guillaume-be.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2019-11-09T16:14:11.000Z","updated_at":"2025-05-06T10:53:00.000Z","dependencies_parsed_at":"2023-02-15T23:16:00.645Z","dependency_job_id":"2a723e81-c8a7-4134-b8f5-91083c41054e","html_url":"https://github.com/guillaume-be/rust-tokenizers","commit_stats":{"total_commits":368,"total_committers":8,"mean_commits":46.0,"dds":0.09239130434782605,"last_synced_commit":"cf88ec4f9df03d1fad5198fb741dc48f57850da3"},"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/guillaume-be%2Frust-tokenizers","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/guillaume-be%2Frust-tokenizers/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/guillaume-be%2Frust-tokenizers/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/guillaume-be%2Frust-tokenizers/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/guillaume-be","download_url":"https://codeload.github.com/guillaume-be/rust-tokenizers/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254414454,"owners_count":22067261,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["deep-learning","rust-lang","tokenizer","transformer"],"created_at":"2024-09-24T20:18:23.563Z","updated_at":"2025-05-15T20:00:35.054Z","avatar_url":"https://github.com/guillaume-be.png","language":"Rust","funding_links":[],"categories":["🔹 **WordPiece Tokenizer Implementations**"],"sub_categories":[],"readme":"# rust-tokenizers\n\n[![Build Status](https://github.com/guillaume-be/rust-tokenizers/workflows/Build/badge.svg?event=push)](https://github.com/guillaume-be/rust-tokenizers/actions)\n[![Latest version](https://img.shields.io/crates/v/rust_tokenizers.svg)](https://crates.io/crates/rust_tokenizers)\n![License](https://img.shields.io/crates/l/rust_tokenizers.svg)\n\nRust-tokenizer offers high-performance tokenizers for modern language models, including WordPiece, Byte-Pair Encoding (BPE) and Unigram (SentencePiece) models.\nThese tokenizers are used in the [rust-bert](https://github.com/guillaume-be/rust-bert) crate.\nA broad range of tokenizers for state-of-the-art transformers architectures is included, including:\n- Sentence Piece (unigram model)\n- Sentence Piece (BPE model)\n- BERT\n- ALBERT\n- DistilBERT\n- RoBERTa\n- GPT\n- GPT2\n- ProphetNet\n- CTRL\n- Pegasus\n- MBart50\n- M2M100\n- DeBERTa\n- DeBERTa (v2)\n\nThe wordpiece based tokenizers include both single-threaded and multi-threaded processing. The Byte-Pair-Encoding tokenizers favor the use of a shared cache and are only available as single-threaded tokenizers\nUsing the tokenizers requires downloading manually the tokenizers required files (vocabulary or merge files). These can be found in the [Transformers library](https://github.com/huggingface/transformers).\n\nThe sentence piece model loads the same `.model` proto files as the [C++ library](https://github.com/google/sentencepiece)\n\n# Usage example (Rust)\n\n```rust\nuse std::path::PathBuf;\n\nuse rust_tokenizers::tokenizer::{BertTokenizer, Tokenizer, TruncationStrategy};\nuse rust_tokenizers::vocab::{BertVocab, Vocab};\n\nlet lowercase: bool = true;\nlet strip_accents: bool = true;\nlet vocab_path: PathBuf  = PathBuf::from(\"path/to/vocab\");\nlet vocab: BertVocab = BertVocab::from_file(\u0026vocab_path)?;\nlet test_sentence: Example = Example::new_from_string(\"This is a sample sentence to be tokenized\");\nlet bert_tokenizer: BertTokenizer = BertTokenizer::from_existing_vocab(vocab, lowercase, strip_accents);\n\nprintln!(\"{:?}\", bert_tokenizer.encode(\u0026test_sentence.sentence_1,\n                                       None,\n                                       128,\n                                       \u0026TruncationStrategy::LongestFirst,\n                                       0));\n```\n\n\n# Python bindings set-up\n\nRust-tokenizer requires a rust nightly build in order to use the Python API. Building from source involves the following steps:\n\n1. Install Rust and use the nightly tool chain\n2. run `python setup.py install` in the `/python-bindings` repository. This will compile the Rust library and install the python API\n3. Example use are available in the `/tests` folder, including benchmark and integration tests\n\nThe library is fully unit tested at the Rust level\n\n# Usage example (Python)\n\n```python\nfrom rust_transformers import PyBertTokenizer\nfrom transformers.modeling_bert import BertForSequenceClassification\n\nrust_tokenizer = PyBertTokenizer('bert-base-uncased-vocab.txt')\nmodel = BertForSequenceClassification.from_pretrained('bert-base-uncased', output_attentions=False).cuda()\nmodel = model.eval()\n\nsentence = '''For instance, on the planet Earth, man had always assumed that he was more intelligent than dolphins because \n              he had achieved so much—the wheel, New York, wars and so on—whilst all the dolphins had ever done was muck \n              about in the water having a good time. But conversely, the dolphins had always believed that they were far \n              more intelligent than man—for precisely the same reasons.'''\n\nfeatures = rust_tokenizer.encode(sentence, max_len=128, truncation_strategy='only_first', stride=0)\ninput_ids = torch.tensor([f.token_ids for f in features], dtype=torch.long).cuda()\n\nwith torch.no_grad():\n    output = model(all_input_ids)[0].cpu().numpy()\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fguillaume-be%2Frust-tokenizers","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fguillaume-be%2Frust-tokenizers","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fguillaume-be%2Frust-tokenizers/lists"}