{"id":31060790,"url":"https://github.com/mecanik/tiny-bpe-trainer","last_synced_at":"2025-09-15T10:43:58.544Z","repository":{"id":308670965,"uuid":"1033645017","full_name":"Mecanik/Tiny-BPE-Trainer","owner":"Mecanik","description":"Lightweight, header-only Byte Pair Encoding (BPE) trainer in modern C++17. Produces HuggingFace-compatible vocabularies for transformers and integrates with Modern Text Tokenizer.","archived":false,"fork":false,"pushed_at":"2025-08-08T04:34:39.000Z","size":34,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-08-29T07:57:44.013Z","etag":null,"topics":["bpe","byte-pair-encoding","c17","deep-learning","header-only","huggingface","machine-learning","modern-cpp","natural-language-processing","nlp","no-dependencies","text-processing","tokenization","tokenizer","transformers","vocabulary"],"latest_commit_sha":null,"homepage":"https://mecanik.dev/en/","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Mecanik.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":".github/FUNDING.yml","license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null},"funding":{"github":"Mecanik"}},"created_at":"2025-08-07T06:16:12.000Z","updated_at":"2025-08-08T04:34:42.000Z","dependencies_parsed_at":"2025-08-07T08:42:33.967Z","dependency_job_id":null,"html_url":"https://github.com/Mecanik/Tiny-BPE-Trainer","commit_stats":null,"previous_names":["mecanik/tiny-bpe-trainer"],"tags_count":3,"template":false,"template_full_name":null,"purl":"pkg:github/Mecanik/Tiny-BPE-Trainer","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Mecanik%2FTiny-BPE-Trainer","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Mecanik%2FTiny-BPE-Trainer/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Mecanik%2FTiny-BPE-Trainer/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Mecanik%2FTiny-BPE-Trainer/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Mecanik","download_url":"https://codeload.github.com/Mecanik/Tiny-BPE-Trainer/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Mecanik%2FTiny-BPE-Trainer/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":275245665,"owners_count":25430803,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-09-15T02:00:09.272Z","response_time":75,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bpe","byte-pair-encoding","c17","deep-learning","header-only","huggingface","machine-learning","modern-cpp","natural-language-processing","nlp","no-dependencies","text-processing","tokenization","tokenizer","transformers","vocabulary"],"created_at":"2025-09-15T10:43:54.865Z","updated_at":"2025-09-15T10:43:58.528Z","avatar_url":"https://github.com/Mecanik.png","language":"C++","funding_links":["https://github.com/sponsors/Mecanik"],"categories":[],"sub_categories":[],"readme":"# Tiny BPE Trainer – A Fast and Lightweight BPE Trainer in C++\n\nA lightweight, header-only **Byte Pair Encoding (BPE)** trainer implemented in modern C++17/20. \n\nTrain your own tokenizer vocabularies compatible with HuggingFace Transformers or use them with [Modern Text Tokenizer](https://github.com/Mecanik/Modern-Text-Tokenizer) for fast, production-ready tokenization in C++.\n\n[![CI](https://github.com/Mecanik/Tiny-BPE-Trainer/actions/workflows/ci.yaml/badge.svg)](https://github.com/Mecanik/Tiny-BPE-Trainer/actions/workflows/ci.yaml)\n[![License: MIT](https://img.shields.io/github/license/Mecanik/Tiny-BPE-Trainer)](https://github.com/Mecanik/Tiny-BPE-Trainer/blob/main/LICENSE)\n[![C++ Standard](https://img.shields.io/badge/C%2B%2B-17%20%7C%2020-blue)](#)\n![Header-Only](https://img.shields.io/badge/Header--only-✔️-green)\n![No Dependencies](https://img.shields.io/badge/Dependencies-None-brightgreen)\n[![Last Commit](https://img.shields.io/github/last-commit/Mecanik/Tiny-BPE-Trainer)](https://github.com/Mecanik/Tiny-BPE-Trainer/commits/main)\n\n## Features\n\n- **Full BPE Algorithm**: Train subword vocabularies from scratch\n- **Header-Only**: Single file, zero external dependencies\n- **High Performance**: Optimized C++ implementation\n- **HuggingFace Compatible**: Outputs `vocab.txt` and `merges.txt` files\n- **Multiple Formats**: Supports plain text and JSONL input\n- **Configurable**: Lowercase, punctuation splitting, normalization\n- **CLI Ready**: Complete command-line interface\n- **UTF-8 Safe**: Proper Unicode character handling\n\n## Requirements\n\n- **C++17/20** compatible compiler (GCC 7+, Clang 5+, MSVC 2017+)\n- **No external dependencies** - uses only standard library\n\n## Quick Start\n\n### Include the Header\n\n```cpp\n#include \"Tiny-BPE-Trainer.hpp\"\nusing namespace MecanikDev;\n```\n\n### Build the CLI\n\n```bash\ng++ -std=c++17 -O3 -o Tiny-BPE-Trainer Tiny-BPE-Trainer.cpp\n# or\nclang++ -std=c++17 -O3 -o Tiny-BPE-Trainer Tiny-BPE-Trainer.cpp\n```\n\n### Basic Training\n\n```cpp\n// Initialize trainer\nTinyBPETrainer trainer;\ntrainer\n    .set_lowercase(true)\n    .set_split_punctuation(true)\n    .set_normalize_whitespace(true);\n\n// Train from text file\nif (trainer.train_from_file(\"corpus.txt\", 16000, 2)) {\n    // Save HuggingFace-compatible files\n    trainer.save_vocab(\"vocab.txt\");\n    trainer.save_merges(\"merges.txt\");\n    \n    // Show statistics\n    trainer.print_stats();\n}\n```\n\n### Test Tokenization\n\n```cpp\n// Test the trained tokenizer\nauto tokens = trainer.tokenize_test(\"Hello, world!\");\n// Result: [\"Hello\", \",\", \"world\", \"!\u003c/w\u003e\"]\n```\n\n## Command Line Interface\n\n### Basic Usage\n\n```bash\n# Quick demo\n./Tiny-BPE-Trainer --demo\n\n# Train from text file\n./Tiny-BPE-Trainer -i corpus.txt -v 16000 -o my_tokenizer\n\n# Train from JSONL dataset\n./Tiny-BPE-Trainer -i dataset.jsonl --jsonl -v 32000\n\n# Test tokenization\n./Tiny-BPE-Trainer --test \"Hello, world! This is a test.\"\n```\n\n### All Options\n\n```bash\nOptions:\n  -i, --input \u003cfile\u003e      Input text file or JSONL file\n  -o, --output \u003cprefix\u003e   Output file prefix (default: \"tokenizer\")\n  -v, --vocab-size \u003cnum\u003e  Vocabulary size (default: 32000)  \n  -m, --min-freq \u003cnum\u003e    Minimum frequency for merges (default: 2)\n  --jsonl                 Input is JSONL format\n  --text-field \u003cfield\u003e    JSONL text field name (default: \"text\")\n  --no-lowercase          Don't convert to lowercase\n  --no-punct-split        Don't split punctuation\n  --demo                  Run demo with sample data\n  --test \u003ctext\u003e           Test tokenization on given text\n```\n\n## Training Examples\n\n### Small Dataset (1MB)\n```bash\n./Tiny-BPE-Trainer -i small_corpus.txt -v 8000 -m 2 -o small_tokenizer\n# Expected: ~30 seconds, 8K vocabulary\n```\n\n### Medium Dataset (100MB)\n```bash\n./Tiny-BPE-Trainer -i medium_corpus.txt -v 32000 -m 5 -o medium_tokenizer  \n# Expected: ~10 minutes, 32K vocabulary\n```\n\n### Large Dataset (1GB+)\n```bash\n./Tiny-BPE-Trainer -i large_corpus.txt -v 50000 -m 10 -o large_tokenizer\n# Expected: ~1-2 hours, 50K vocabulary\n```\n\n### JSONL Dataset\n```bash\n./Tiny-BPE-Trainer -i dataset.jsonl --jsonl --text-field content -v 32000\n```\n\n### Plain Text\n```\nThe quick brown fox jumps over the lazy dog.\nMachine learning is a subset of artificial intelligence.\nNatural language processing enables computers to understand human language.\n```\n\n### JSONL Format\n```jsonl\n{\"id\": 1, \"text\": \"The quick brown fox jumps over the lazy dog.\"}\n{\"id\": 2, \"text\": \"Machine learning is a subset of artificial intelligence.\"}\n{\"id\": 3, \"text\": \"Natural language processing enables computers.\"}\n```\n\n### Downloading Corpus with Python (HuggingFace Datasets)\n\nWant to train on real world text like **IMDB reviews**, **Wikipedia**, or **news articles**?\n\nYou can use the Python script `download_dataset.py` to download datasets from [HuggingFace Datasets Hub](https://huggingface.co/datasets), and export them into plain `.txt` or `.jsonl` format that works directly with Tiny BPE Trainer.\n\nInstall the requirements first:\n\n```bash\npip install datasets pandas pyarrow\n```\n\n#### Save as Plain Text (corpus.txt)\n\n```python\nfrom datasets import load_dataset\n\n# Load dataset (choose from \"imdb\", \"ag_news\", \"wikitext\", etc.)\ndataset = load_dataset(\"imdb\", split=\"train\")\n\nwith open(\"corpus.txt\", \"w\", encoding=\"utf-8\") as f:\n    for example in dataset:\n        text = example.get(\"text\") or example.get(\"content\")\n        f.write(text.replace(\"\\n\", \" \").strip() + \"\\n\")\n```\n\n#### Save as JSONL (corpus.jsonl)\n\n```python\nimport json\nfrom datasets import load_dataset\n\n# Load dataset (choose from \"imdb\", \"ag_news\", \"wikitext\", etc.)\ndataset = load_dataset(\"imdb\", split=\"train\")\n\nwith open(\"corpus.jsonl\", \"w\", encoding=\"utf-8\") as f:\n    for i, example in enumerate(dataset):\n        f.write(json.dumps({\"id\": i, \"text\": example[\"text\"]}) + \"\\n\")\n```\n\n#### Train with Tiny BPE Trainer\n\n```bash\n# Using plain text\n./Tiny-BPE-Trainer -i corpus.txt -v 16000 -m 2 -o imdb_tokenizer\n\n# Using JSONL\n./Tiny-BPE-Trainer -i corpus.jsonl --jsonl -v 16000 -o imdb_tokenizer\n```\n\n## Output Files\n\n### vocab.txt (HuggingFace Compatible)\n```\n\u003c|endoftext|\u003e\n\u003c|unk|\u003e\n\u003c|pad|\u003e  \n\u003c|mask|\u003e\n!\n\"\n#\n...\nthe\nof\nand\ning\u003c/w\u003e\ner\u003c/w\u003e\n...\n```\n\n### merges.txt (BPE Rules)\n```\n#version: 0.2\ni n\nt h\nth e\ne r\n...\n```\n\n## API Reference\n\n### Core Methods\n\n```cpp\nclass TinyBPETrainer {\n    // Configuration\n    TinyBPETrainer\u0026 set_lowercase(bool enable);\n    TinyBPETrainer\u0026 set_split_punctuation(bool enable);  \n    TinyBPETrainer\u0026 set_normalize_whitespace(bool enable);\n    TinyBPETrainer\u0026 set_special_tokens(eos, unk, pad, mask);\n    \n    // Training\n    bool train_from_file(filepath, vocab_size=32000, min_freq=2);\n    bool train_from_jsonl(filepath, text_field=\"text\", vocab_size=32000, min_freq=2);\n    \n    // Output\n    bool save_vocab(vocab_path);\n    bool save_merges(merges_path);\n    void print_stats();\n    \n    // Testing  \n    std::vector\u003cstd::string\u003e tokenize_test(text);\n};\n```\n\n### Configuration Options\n\n```cpp\nTinyBPETrainer trainer;\n\ntrainer\n    .set_lowercase(true)              // Convert to lowercase\n    .set_split_punctuation(true)      // Split on punctuation  \n    .set_normalize_whitespace(true)   // Normalize whitespace\n    .set_special_tokens(              // Custom special tokens\n        \"\u003c|endoftext|\u003e\", \n        \"\u003c|unk|\u003e\", \n        \"\u003c|pad|\u003e\", \n        \"\u003c|mask|\u003e\"\n    );\n```\n\n## Integration with Tokenizers\n\n### Use with Modern Text Tokenizer\n\n```cpp\n#include \"Modern-Text-Tokenizer.hpp\" // Tokenizer\n#include \"Tiny-BPE-Trainer.hpp\"    // BPE trainer\n\nusing namespace MecanikDev;\n\n// Train BPE vocabulary\nTinyBPETrainer trainer;\ntrainer.train_from_file(\"corpus.txt\", 16000);\ntrainer.save_vocab(\"my_vocab.txt\");\ntrainer.save_merges(\"my_merges.txt\");\n\n// Use with tokenizer \nTextTokenizer tokenizer;\ntokenizer.load_vocab(\"my_vocab.txt\");\nauto token_ids = tokenizer.encode(\"Hello, world!\");\n```\n\n### Use with HuggingFace\n\n```python\n# Python - load in HuggingFace Tokenizers\nfrom tokenizers import Tokenizer\nfrom tokenizers.models import BPE\n\n# Load our trained BPE\ntokenizer = Tokenizer(BPE(\n    vocab=\"my_vocab.txt\", \n    merges=\"my_merges.txt\"\n))\n\ntokens = tokenizer.encode(\"Hello, world!\")\n```\n\n## Performance\n\n```bash\nStarting BPE training...\n   Input: imdb.txt\n   Format: Plain text\n   Vocab size: 32000\n   Min frequency: 2\n   Output prefix: tokenizer\nReading corpus from: imdb.txt\nProcessed 33157823 characters, 6952632 words\nUnique word forms: 106008\nInitial vocabulary size: 240\nStarting BPE training...\n    ...\nBPE training completed!\n   Final vocabulary size: 32000\n   Total merges: 31760\n   Training time: 1962 seconds\nSaved vocabulary (32000 tokens) to: tokenizer_vocab.txt\nSaved merges (31760 rules) to: tokenizer_merges.txt\n\nTraining completed successfully!\n   Total time: 1966 seconds\n\nTraining Statistics:\n   Characters processed: 33157823\n   Words processed: 6952632\n   Final vocab size: 32000\n   BPE merges: 31760\n   Compression ratio: 0.0010\n```\n\n*Benchmark on AMD Ryzen 9 5900X, compiled with -O3.*\n\n## Algorithm Details\n\n### BPE Training Process\n\n1. **Preprocessing**\n   - Normalize whitespace  \n   - Convert to lowercase (optional)\n   - Split punctuation (optional)\n\n2. **Character Initialization**\n   ```\n   \"hello\" → [\"h\", \"e\", \"l\", \"l\", \"o\", \"\u003c/w\u003e\"]\n   ```\n\n3. **Iterative Merging**\n   ```\n   Most frequent pair: \"l\" + \"l\" → \"ll\"\n   \"hello\" → [\"h\", \"e\", \"ll\", \"o\", \"\u003c/w\u003e\"]\n   ```\n\n4. **Vocabulary Building**\n   - Characters: `h`, `e`, `l`, `o`, `\u003c/w\u003e`\n   - Merges: `ll`, `he`, `ell`, `hello`\n   - Special tokens: `\u003c|unk|\u003e`, `\u003c|pad|\u003e`, etc.\n\n### Key Features\n\n- **Subword Units**: Handles unknown words through decomposition\n- **Frequency-Based**: Most common patterns get merged first  \n- **Deterministic**: Same corpus always produces same vocabulary\n- **Compression**: Reduces vocabulary size vs. word-level tokenization\n\n## Troubleshooting\n\n### Common Issues\n\n**\"Training failed\" Error**\n```bash\n# Check file exists and is readable\nls -la corpus.txt\nfile corpus.txt\n\n# Try smaller vocabulary size\n./Tiny-BPE-Trainer -i corpus.txt -v 8000 -m 1\n```\n\n**Slow Training**\n```bash\n# Increase minimum frequency\n./Tiny-BPE-Trainer -i corpus.txt -v 32000 -m 10\n\n# Use smaller corpus for testing\nhead -n 10000 large_corpus.txt \u003e small_test.txt\n```\n\n**Memory Issues**\n```bash\n# Monitor memory usage\ntop -p $(pgrep Tiny-BPE-Trainer)\n\n# Reduce vocabulary size\n./Tiny-BPE-Trainer -i corpus.txt -v 16000\n```\n\n### Performance Tips\n\n1. **Start Small**: Test with small corpus and vocabulary first\n2. **Adjust min_frequency**: Higher values = faster training, smaller vocab\n3. **Preprocessing**: Clean your corpus for better results\n4. **Incremental**: Train smaller models first, then scale up\n\n## Roadmap\n\n### Planned Features\n\n- [ ] **Parallel Training**: Multi-threaded BPE training\n- [ ] **Streaming Mode**: Process huge files without loading into memory  \n- [ ] **Advanced Preprocessing**: Custom regex patterns, language-specific rules\n- [ ] **Evaluation Metrics**: Compression ratio, OOV handling statistics\n- [ ] **Visualization**: Plot vocabulary growth, merge frequency distributions\n- [ ] **Export Formats**: SentencePiece, custom binary formats\n\n### Future Considerations\n\n- [ ] **Tokenizer Integration**: Seamless loading of trained BPE models\n- [ ] **HuggingFace Plugin**: Direct integration with transformers library\n- [ ] **TensorFlow/PyTorch**: C++ ops for training integration\n\n## Contributing\n\nWe welcome contributions! Areas of interest:\n\n1. **Performance**: SIMD optimizations, better algorithms\n2. **Features**: New preprocessing options, export formats\n3. **Testing**: More edge cases, different languages\n4. **Documentation**: Tutorials, examples, use cases\n\n## License\n\nMIT License - see LICENSE file for details.\n\n## Acknowledgments\n\n- Inspired by open-source libraries like [SentencePiece](https://github.com/google/sentencepiece) and [HuggingFace Tokenizers](https://github.com/huggingface/tokenizers)\n- Format compatibility modeled after HuggingFace's `vocab.txt` and `merges.txt` outputs\n- Based on the original [Byte Pair Encoding paper](https://arxiv.org/abs/1508.07909) by Sennrich\n- UTF-8 safety and normalization techniques informed by modern C++ text processing resources\n\n## Learn More\n\n- [BPE Paper](https://arxiv.org/abs/1508.07909) - Original Byte Pair Encoding paper\n- [Neural Machine Translation of Rare Words with Subword Units](https://arxiv.org/abs/1508.07909)\n- [SentencePiece](https://github.com/google/sentencepiece) - Google's implementation  \n- [HuggingFace Tokenizers](https://github.com/huggingface/tokenizers) - Fast tokenization library\n\n---\n\n**⭐ Star this repo if you find it useful!**\n\nBuilt with ❤️ for the C++ and NLP community","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmecanik%2Ftiny-bpe-trainer","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmecanik%2Ftiny-bpe-trainer","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmecanik%2Ftiny-bpe-trainer/lists"}