{"id":31060788,"url":"https://github.com/mecanik/modern-text-tokenizer","last_synced_at":"2025-09-15T10:43:52.603Z","repository":{"id":308310489,"uuid":"1032355050","full_name":"Mecanik/Modern-Text-Tokenizer","owner":"Mecanik","description":"Modern UTF-8 aware C++ tokenizer with vocabulary support, ideal for NLP and transformer models. Header-only and zero-dependency.","archived":false,"fork":false,"pushed_at":"2025-08-07T06:43:54.000Z","size":43,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-08-29T07:57:43.993Z","etag":null,"topics":["ai","artificial-intelligence","bert","deep-learning","distilbert","header-only","high-performance","machine-learning","modern-cpp","natural-language-processing","nlp","preprocessing","text-analysis","text-encoding","text-processing","text-tokenization","tokenizer","transformer","vocabulary"],"latest_commit_sha":null,"homepage":"https://mecanik.dev/en/","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Mecanik.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":".github/FUNDING.yml","license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null},"funding":{"github":"Mecanik"}},"created_at":"2025-08-05T07:27:55.000Z","updated_at":"2025-08-07T06:43:57.000Z","dependencies_parsed_at":"2025-08-05T09:44:42.276Z","dependency_job_id":null,"html_url":"https://github.com/Mecanik/Modern-Text-Tokenizer","commit_stats":null,"previous_names":["mecanik/modern-text-tokenizer"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/Mecanik/Modern-Text-Tokenizer","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Mecanik%2FModern-Text-Tokenizer","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Mecanik%2FModern-Text-Tokenizer/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Mecanik%2FModern-Text-Tokenizer/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Mecanik%2FModern-Text-Tokenizer/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Mecanik","download_url":"https://codeload.github.com/Mecanik/Modern-Text-Tokenizer/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Mecanik%2FModern-Text-Tokenizer/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":275245659,"owners_count":25430803,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-09-15T02:00:09.272Z","response_time":75,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","artificial-intelligence","bert","deep-learning","distilbert","header-only","high-performance","machine-learning","modern-cpp","natural-language-processing","nlp","preprocessing","text-analysis","text-encoding","text-processing","text-tokenization","tokenizer","transformer","vocabulary"],"created_at":"2025-09-15T10:43:50.222Z","updated_at":"2025-09-15T10:43:52.590Z","avatar_url":"https://github.com/Mecanik.png","language":"C++","funding_links":["https://github.com/sponsors/Mecanik"],"categories":[],"sub_categories":[],"readme":"# Modern C++ Text Tokenizer for NLP and Machine Learning\n\nA high-performance, header-only C++17/20 text tokenizer for NLP and machine learning. Supports UTF-8, vocabulary encoding, and special tokens like [CLS], [SEP]. Ideal for BERT, DistilBERT, and transformer models. No dependencies!\n\nUnlike HuggingFace Tokenizers (Python) or ICU, this is a lightweight C++ alternative with no dependencies.\n\nLooking to build a custom tokenizer vocabulary? Use [Tiny BPE Trainer](https://github.com/Mecanik/Tiny-BPE-Trainer) - a fast, header-only Byte Pair Encoding (BPE) trainer in modern C++.  \n\n[![CI](https://github.com/Mecanik/Modern-Text-Tokenizer/actions/workflows/ci.yaml/badge.svg)](https://github.com/Mecanik/Modern-Text-Tokenizer/actions/workflows/ci.yaml)\n[![License: MIT](https://img.shields.io/github/license/Mecanik/Modern-Text-Tokenizer.svg)](https://github.com/Mecanik/Modern-Text-Tokenizer/blob/main/LICENSE)\n[![C++ Standard](https://img.shields.io/badge/C%2B%2B-17%20%7C%2020-blue)](#)\n![Header-Only](https://img.shields.io/badge/Header--only-✔️-green)\n![No Dependencies](https://img.shields.io/badge/Dependencies-None-brightgreen)\n[![Last Commit](https://img.shields.io/github/last-commit/Mecanik/Modern-Text-Tokenizer)](https://github.com/Mecanik/Modern-Text-Tokenizer/commits/main)\n\n## Features\n\n- **Fast**: Zero-copy processing with `std::string_view`\n- **UTF-8 Ready**: Proper handling of Unicode without heavy dependencies\n- **Configurable**: Fluent API for customizing tokenization behavior\n- **Header-Only**: Single file, easy to integrate\n- **ASCII Optimized**: Smart handling of ASCII vs UTF-8 characters\n- **Modern C++**: Uses C++17/20 features for clean, efficient code\n- **Vocabulary Support**: Load/save vocabularies, encode/decode to token IDs\n- **Special Tokens**: Support for [CLS], [SEP], [PAD], [UNK] tokens\n- **ML Ready**: Sequence encoding for transformer models\n\n## Requirements\n\n- **C++17/20** compatible compiler (GCC 7+, Clang 5+, MSVC 2017+)\n- **No external dependencies** - uses only standard library\n\n## Quick Start\n\n```cpp\n#include \"Modern-Text-Tokenizer.hpp\"\nusing namespace MecanikDev;\n\n// Simple tokenization\nauto tokens = TextTokenizer::simple_split(\"Hello, world!\");\n\n// Advanced configuration with vocabulary\nTextTokenizer tokenizer;\n\n// Load vocabulary file\ntokenizer.load_vocab(\"vocab.txt\");\n\nauto token_ids = tokenizer.encode(\"Hello, world!\");\n\nstd::string decoded = tokenizer.decode(token_ids);\n```\n\n## API Reference\n\n### Basic Usage\n\n```cpp\n// Static method for simple whitespace splitting\nstd::vector\u003cstd::string\u003e tokens = TextTokenizer::simple_split(text);\n\n// Full configurability\nTextTokenizer tokenizer;\nstd::vector\u003cstd::string\u003e tokens = tokenizer.tokenize(text);\n```\n\n### Configuration Methods\n\nAll configuration methods return `TextTokenizer\u0026` for method chaining:\n\n```cpp\nusing namespace MecanikDev;\n\nTextTokenizer tokenizer;\ntokenizer\n    .set_lowercase(true)           // Convert to lowercase\n    .set_keep_punctuation(true)    // Keep punctuation as separate tokens\n    .set_split_on_punctuation(true) // Split on punctuation marks\n    .add_delimiter(',')            // Add custom delimiter\n    .add_delimiters(\".,!?\")        // Add multiple delimiters\n    .set_special_tokens(\"[UNK]\", \"[PAD]\", \"[CLS]\", \"[SEP]\"); // Configure special tokens\n```\n\n### Vocabulary Methods\n\n```cpp\n// Load vocabulary from file\ntokenizer.load_vocab(\"vocab.txt\");\n\n// Build vocabulary from training texts\nstd::vector\u003cstd::string\u003e training_texts = {\"Hello world\", \"Machine learning\", ...};\ntokenizer.build_vocab_from_text(training_texts, 2, 30000); // min_freq=2, max_size=30000\n\n// Save vocabulary\ntokenizer.save_vocab(\"my_vocab.txt\");\n\n// Encoding and decoding\nauto token_ids = tokenizer.encode(\"Hello world\");\nstd::string text = tokenizer.decode(token_ids);\n\n// Sequence encoding for ML models\nauto sequence_ids = tokenizer.encode_sequence(\"Hello world\", 512, true); // max_len=512, add_special_tokens=true\n```\n\n### Utility Methods\n\n```cpp\n// Count tokens without storing them (memory efficient)\nsize_t count = tokenizer.count_tokens(text);\n\n// Vocabulary information\nsize_t vocab_size = tokenizer.vocab_size();\nbool has_vocab = tokenizer.has_vocab();\n\n// Special token IDs\nint unk_id = tokenizer.get_unk_id();\nint pad_id = tokenizer.get_pad_id();\nint cls_id = tokenizer.get_cls_id();\nint sep_id = tokenizer.get_sep_id();\n```\n\n## Examples\n\n### Basic Text Processing\n\n```cpp\nusing namespace MecanikDev;\n\nstd::string text = \"Natural language processing is amazing!\";\n\n// [\"Natural\", \"language\", \"processing\", \"is\", \"amazing!\"]\nauto tokens = TextTokenizer::simple_split(text);\n```\n\n### Building and Using Vocabulary\n\n```cpp\n// Create tokenizer and build vocabulary from training data\nTextTokenizer tokenizer;\nstd::vector\u003cstd::string\u003e training_texts = {\n    \"The quick brown fox jumps\",\n    \"Machine learning is fascinating\",\n    \"Natural language processing rocks\"\n};\n\ntokenizer\n    .set_lowercase(true)\n    .set_split_on_punctuation(true)\n    .build_vocab_from_text(training_texts, 1, 1000);\n\n// Save vocabulary for later use\ntokenizer.save_vocab(\"my_vocab.txt\");\n\n// Encode text to token IDs\nauto ids = tokenizer.encode(\"Machine learning rocks!\");\n// Example: [1, 156, 234, 445, 2] where 1=[CLS], 2=[SEP], etc.\n\n// Decode back to text\nstd::string decoded = tokenizer.decode(ids);\n```\n\n### ML Model Integration\n\n```cpp\n// Load pre-trained vocabulary\nTextTokenizer tokenizer;\ntokenizer.load_vocab(\"bert_vocab.txt\");\n\n// Prepare sequence for BERT-style model\nauto input_ids = tokenizer.encode_sequence(\n    \"Hello world! How are you?\", \n    128,    // max_length\n    true    // add_special_tokens ([CLS] and [SEP])\n);\n\n// Result: [101, 7592, 2088, 999, 2129, 2024, 2017, 1029, 102, ...]\n//         [CLS] Hello world !   How   are  you  ?   [SEP] ...\n```\n\n### Preprocessing for ML\n\n```cpp\nusing namespace MecanikDev;\n\nTextTokenizer preprocessor;\npreprocessor\n    .set_lowercase(true)\n    .set_split_on_punctuation(true);\n\n// [\"hello\", \"world\"]\nauto tokens = preprocessor.tokenize(\"Hello, World!\");\n```\n\n### Keeping Punctuation for Analysis\n\n```cpp\nTextTokenizer analyzer;\nanalyzer\n    .set_keep_punctuation(true)\n    .set_split_on_punctuation(true);\n\n// [\"What\", \"?\", \"!\", \"Really\", \"?\"]\nauto tokens = analyzer.tokenize(\"What?! Really?\");\n```\n\n### Custom Delimiters\n\n```cpp\nTextTokenizer csv_tokenizer;\ncsv_tokenizer.add_delimiters(\",;|\");\n\n// [\"name\", \"age\", \"city\", \"country\"]\nauto fields = csv_tokenizer.tokenize(\"name,age;city|country\");\n```\n\n### Unicode Support\n\n```cpp\nusing namespace MecanikDev;\n\nstd::string multilingual = \"Hello 世界 🌍 مرحبا\";\nauto tokens = TextTokenizer::simple_split(multilingual);\n// [\"Hello\", \"世界\", \"🌍\", \"مرحبا\"]\n\n// Lowercase preserves non-ASCII characters\nauto lower_tokens = TextTokenizer()\n    .set_lowercase(true)\n    .tokenize(\"Hello 世界\");\n// [\"hello\", \"世界\"] - Chinese characters preserved\n```\n\n### Loading DistilBERT Vocabulary\n\n```bash\n# Download the DistilBERT vocabulary\ncurl -o vocab.txt https://huggingface.co/distilbert/distilbert-base-uncased/raw/main/vocab.txt\n\n# Or using wget\nwget https://huggingface.co/distilbert/distilbert-base-uncased/raw/main/vocab.txt\n```\n\n```cpp\nusing namespace MecanikDev;\n\n// Load DistilBERT vocabulary\nTextTokenizer tokenizer;\nif (tokenizer.load_vocab(\"vocab.txt\")) {\n    std::cout \u003c\u003c \"Loaded \" \u003c\u003c tokenizer.vocab_size() \u003c\u003c \" tokens\" \u003c\u003c std::endl;\n    \n    // Configure for DistilBERT-style tokenization\n    tokenizer\n        .set_lowercase(true)           // DistilBERT uses lowercase\n        .set_split_on_punctuation(true)\n        .set_keep_punctuation(true);\n    \n    // Test encoding\n    auto token_ids = tokenizer.encode(\"Hello, world!\");\n    // Result: [7592, 1010, 2088, 999] (example IDs)\n    \n    // Encode with special tokens for ML\n    auto sequence = tokenizer.encode_sequence(\"Hello, world!\", 512, true);\n    // Result: [101, 7592, 1010, 2088, 999, 102] ([CLS] + tokens + [SEP])\n    \n    // Decode back\n    std::string text = tokenizer.decode(token_ids);\n    // Result: \"hello , world !\"\n}\n```\n\n## Architecture\n\n### Design Principles\n\n1. **Zero Dependencies**: No ICU, Boost, or other heavy libraries\n2. **UTF-8 Safe**: Detects UTF-8 boundaries without corrupting multibyte sequences\n3. **ASCII Optimized**: Fast path for ASCII operations (case conversion, punctuation)\n4. **Memory Efficient**: Minimal allocations during tokenization\n5. **Configurable**: Fluent interface for different use cases\n\n### Performance Characteristics\n\n- **Time Complexity**: O(n) where n is input length\n- **Space Complexity**: O(t) where t is number of tokens\n- **UTF-8 Handling**: O(1) character boundary detection\n- **Memory**: Uses `string_view` for zero-copy input processing\n\n## Performance\n\nBenchmark results on a typical text corpus:\n\n```\nPerformance test with 174000 characters\n\nResults:\n  Tokenization: 2159 μs (22000 tokens)\n  Encoding:     1900 μs\n  Decoding:     430 μs\n  Total time:   4.49 ms\n  Throughput:   36.97 MB/s\n```\n\n*Benchmark on AMD Ryzen 9 5900X, compiled with -O3.*\n\n## Building\n\n### Single File Integration\n\nSimply include the header:\n\n```cpp\n#include \"Modern-Text-Tokenizer.hpp\"\n```\n\n### CMake Integration\n\n```cmake\n# Add to your CMakeLists.txt\nadd_executable(your_app main.cpp Modern-Text-Tokenizer.hpp)\ntarget_compile_features(your_app PRIVATE cxx_std_17)\n```\n\n### Compilation Example\n\n```bash\ng++ -std=c++17 -O3 -o tokenizer_demo main.cpp\nclang++ -std=c++17 -O3 -o tokenizer_demo main.cpp\n```\n\n## Testing\n\nThe included demo shows various tokenization scenarios:\n\n```bash\n./tokenizer_demo\n```\n\nExpected output includes:\n- Basic tokenization examples\n- Unicode handling demonstration\n- Performance benchmarks\n- Configuration examples\n\n## Roadmap\n\n### Planned Features\n\n- [ ] **Regex Support**: Pattern-based tokenization\n- [ ] **Streaming API**: Process large files without loading into memory\n- [ ] **Parallel Processing**: Multi-threaded batch tokenization\n- [ ] **Custom Normalizers**: User-defined text preprocessing\n- [ ] **Subword Tokenization**: BPE/WordPiece support\n- [ ] **Benchmark Suite**: Comprehensive performance testing\n\n### Future Considerations\n\n- [ ] **C++20 Features**: Ranges, concepts, and modules\n- [ ] **SIMD Optimization**: Vectorized string processing\n- [ ] **Memory Mapping**: For huge file processing\n- [ ] **Language Detection**: Automatic handling of different scripts\n\n## Contributing\n\nContributions welcome! Areas of interest:\n\n1. **Performance Optimization**: SIMD, better algorithms\n2. **Unicode Enhancement**: Better normalization without ICU\n3. **Testing**: More edge cases and benchmarks\n4. **Documentation**: Examples and tutorials\n\n## License\n\nMIT License - see LICENSE file for details.\n\n## Acknowledgments\n\n- Inspired by modern tokenization libraries like HuggingFace Tokenizers\n- UTF-8 handling techniques from various C++ Unicode resources\n- Performance optimizations learned from high-performance text processing\n\n---\n\n**⭐ Star this repo if you find it useful!**\n\nBuilt with ❤️ for the C++ and NLP community\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmecanik%2Fmodern-text-tokenizer","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmecanik%2Fmodern-text-tokenizer","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmecanik%2Fmodern-text-tokenizer/lists"}