https://github.com/mecanik/modern-text-tokenizer

Modern UTF-8 aware C++ tokenizer with vocabulary support, ideal for NLP and transformer models. Header-only and zero-dependency.
https://github.com/mecanik/modern-text-tokenizer

ai artificial-intelligence bert deep-learning distilbert header-only high-performance machine-learning modern-cpp natural-language-processing nlp preprocessing text-analysis text-encoding text-processing text-tokenization tokenizer transformer vocabulary

Last synced: 23 days ago
JSON representation

Modern UTF-8 aware C++ tokenizer with vocabulary support, ideal for NLP and transformer models. Header-only and zero-dependency.

Host: GitHub
URL: https://github.com/mecanik/modern-text-tokenizer
Owner: Mecanik
License: mit
Created: 2025-08-05T07:27:55.000Z (2 months ago)
Default Branch: main
Last Pushed: 2025-08-07T06:43:54.000Z (2 months ago)
Last Synced: 2025-08-29T07:57:43.993Z (about 1 month ago)
Topics: ai, artificial-intelligence, bert, deep-learning, distilbert, header-only, high-performance, machine-learning, modern-cpp, natural-language-processing, nlp, preprocessing, text-analysis, text-encoding, text-processing, text-tokenization, tokenizer, transformer, vocabulary
Language: C++
Homepage: https://mecanik.dev/en/
Size: 42 KB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- Funding: .github/FUNDING.yml
- License: LICENSE

Awesome Lists containing this project

README

          # Modern C++ Text Tokenizer for NLP and Machine Learning

A high-performance, header-only C++17/20 text tokenizer for NLP and machine learning. Supports UTF-8, vocabulary encoding, and special tokens like [CLS], [SEP]. Ideal for BERT, DistilBERT, and transformer models. No dependencies!

Unlike HuggingFace Tokenizers (Python) or ICU, this is a lightweight C++ alternative with no dependencies.

Looking to build a custom tokenizer vocabulary? Use [Tiny BPE Trainer](https://github.com/Mecanik/Tiny-BPE-Trainer) - a fast, header-only Byte Pair Encoding (BPE) trainer in modern C++.  

[![CI](https://github.com/Mecanik/Modern-Text-Tokenizer/actions/workflows/ci.yaml/badge.svg)](https://github.com/Mecanik/Modern-Text-Tokenizer/actions/workflows/ci.yaml)

[![License: MIT](https://img.shields.io/github/license/Mecanik/Modern-Text-Tokenizer.svg)](https://github.com/Mecanik/Modern-Text-Tokenizer/blob/main/LICENSE)

[![C++ Standard](https://img.shields.io/badge/C%2B%2B-17%20%7C%2020-blue)](#)

![Header-Only](https://img.shields.io/badge/Header--only-✔️-green)

![No Dependencies](https://img.shields.io/badge/Dependencies-None-brightgreen)

[![Last Commit](https://img.shields.io/github/last-commit/Mecanik/Modern-Text-Tokenizer)](https://github.com/Mecanik/Modern-Text-Tokenizer/commits/main)

## Features

- **Fast**: Zero-copy processing with `std::string_view`

- **UTF-8 Ready**: Proper handling of Unicode without heavy dependencies

- **Configurable**: Fluent API for customizing tokenization behavior

- **Header-Only**: Single file, easy to integrate

- **ASCII Optimized**: Smart handling of ASCII vs UTF-8 characters

- **Modern C++**: Uses C++17/20 features for clean, efficient code

- **Vocabulary Support**: Load/save vocabularies, encode/decode to token IDs

- **Special Tokens**: Support for [CLS], [SEP], [PAD], [UNK] tokens

- **ML Ready**: Sequence encoding for transformer models

## Requirements

- **C++17/20** compatible compiler (GCC 7+, Clang 5+, MSVC 2017+)

- **No external dependencies** - uses only standard library

## Quick Start

```cpp

#include "Modern-Text-Tokenizer.hpp"

using namespace MecanikDev;

// Simple tokenization

auto tokens = TextTokenizer::simple_split("Hello, world!");

// Advanced configuration with vocabulary

TextTokenizer tokenizer;

// Load vocabulary file

tokenizer.load_vocab("vocab.txt");

auto token_ids = tokenizer.encode("Hello, world!");

std::string decoded = tokenizer.decode(token_ids);

```

## API Reference

### Basic Usage

```cpp

// Static method for simple whitespace splitting

std::vector tokens = TextTokenizer::simple_split(text);

// Full configurability

TextTokenizer tokenizer;

std::vector tokens = tokenizer.tokenize(text);

```

### Configuration Methods

All configuration methods return `TextTokenizer&` for method chaining:

```cpp

using namespace MecanikDev;

TextTokenizer tokenizer;

tokenizer

    .set_lowercase(true)           // Convert to lowercase

    .set_keep_punctuation(true)    // Keep punctuation as separate tokens

    .set_split_on_punctuation(true) // Split on punctuation marks

    .add_delimiter(',')            // Add custom delimiter

    .add_delimiters(".,!?")        // Add multiple delimiters

    .set_special_tokens("[UNK]", "[PAD]", "[CLS]", "[SEP]"); // Configure special tokens

```

### Vocabulary Methods

```cpp

// Load vocabulary from file

tokenizer.load_vocab("vocab.txt");

// Build vocabulary from training texts

std::vector training_texts = {"Hello world", "Machine learning", ...};

tokenizer.build_vocab_from_text(training_texts, 2, 30000); // min_freq=2, max_size=30000

// Save vocabulary

tokenizer.save_vocab("my_vocab.txt");

// Encoding and decoding

auto token_ids = tokenizer.encode("Hello world");

std::string text = tokenizer.decode(token_ids);

// Sequence encoding for ML models

auto sequence_ids = tokenizer.encode_sequence("Hello world", 512, true); // max_len=512, add_special_tokens=true

```

### Utility Methods

```cpp

// Count tokens without storing them (memory efficient)

size_t count = tokenizer.count_tokens(text);

// Vocabulary information

size_t vocab_size = tokenizer.vocab_size();

bool has_vocab = tokenizer.has_vocab();

// Special token IDs

int unk_id = tokenizer.get_unk_id();

int pad_id = tokenizer.get_pad_id();

int cls_id = tokenizer.get_cls_id();

int sep_id = tokenizer.get_sep_id();

```

## Examples

### Basic Text Processing

```cpp

using namespace MecanikDev;

std::string text = "Natural language processing is amazing!";

// ["Natural", "language", "processing", "is", "amazing!"]

auto tokens = TextTokenizer::simple_split(text);

```

### Building and Using Vocabulary

```cpp

// Create tokenizer and build vocabulary from training data

TextTokenizer tokenizer;

std::vector training_texts = {

    "The quick brown fox jumps",

    "Machine learning is fascinating",

    "Natural language processing rocks"

};

tokenizer

    .set_lowercase(true)

    .set_split_on_punctuation(true)

    .build_vocab_from_text(training_texts, 1, 1000);

// Save vocabulary for later use

tokenizer.save_vocab("my_vocab.txt");

// Encode text to token IDs

auto ids = tokenizer.encode("Machine learning rocks!");

// Example: [1, 156, 234, 445, 2] where 1=[CLS], 2=[SEP], etc.

// Decode back to text

std::string decoded = tokenizer.decode(ids);

```

### ML Model Integration

```cpp

// Load pre-trained vocabulary

TextTokenizer tokenizer;

tokenizer.load_vocab("bert_vocab.txt");

// Prepare sequence for BERT-style model

auto input_ids = tokenizer.encode_sequence(

    "Hello world! How are you?", 

    128,    // max_length

    true    // add_special_tokens ([CLS] and [SEP])

);

// Result: [101, 7592, 2088, 999, 2129, 2024, 2017, 1029, 102, ...]

//         [CLS] Hello world !   How   are  you  ?   [SEP] ...

```

### Preprocessing for ML

```cpp

using namespace MecanikDev;

TextTokenizer preprocessor;

preprocessor

    .set_lowercase(true)

    .set_split_on_punctuation(true);

// ["hello", "world"]

auto tokens = preprocessor.tokenize("Hello, World!");

```

### Keeping Punctuation for Analysis

```cpp

TextTokenizer analyzer;

analyzer

    .set_keep_punctuation(true)

    .set_split_on_punctuation(true);

// ["What", "?", "!", "Really", "?"]

auto tokens = analyzer.tokenize("What?! Really?");

```

### Custom Delimiters

```cpp

TextTokenizer csv_tokenizer;

csv_tokenizer.add_delimiters(",;|");

// ["name", "age", "city", "country"]

auto fields = csv_tokenizer.tokenize("name,age;city|country");

```

### Unicode Support

```cpp

using namespace MecanikDev;

std::string multilingual = "Hello 世界 🌍 مرحبا";

auto tokens = TextTokenizer::simple_split(multilingual);

// ["Hello", "世界", "🌍", "مرحبا"]

// Lowercase preserves non-ASCII characters

auto lower_tokens = TextTokenizer()

    .set_lowercase(true)

    .tokenize("Hello 世界");

// ["hello", "世界"] - Chinese characters preserved

```

### Loading DistilBERT Vocabulary

```bash

# Download the DistilBERT vocabulary

curl -o vocab.txt https://huggingface.co/distilbert/distilbert-base-uncased/raw/main/vocab.txt

# Or using wget

wget https://huggingface.co/distilbert/distilbert-base-uncased/raw/main/vocab.txt

```

```cpp

using namespace MecanikDev;

// Load DistilBERT vocabulary

TextTokenizer tokenizer;

if (tokenizer.load_vocab("vocab.txt")) {

    std::cout << "Loaded " << tokenizer.vocab_size() << " tokens" << std::endl;

    

    // Configure for DistilBERT-style tokenization

    tokenizer

        .set_lowercase(true)           // DistilBERT uses lowercase

        .set_split_on_punctuation(true)

        .set_keep_punctuation(true);

    

    // Test encoding

    auto token_ids = tokenizer.encode("Hello, world!");

    // Result: [7592, 1010, 2088, 999] (example IDs)

    

    // Encode with special tokens for ML

    auto sequence = tokenizer.encode_sequence("Hello, world!", 512, true);

    // Result: [101, 7592, 1010, 2088, 999, 102] ([CLS] + tokens + [SEP])

    

    // Decode back

    std::string text = tokenizer.decode(token_ids);

    // Result: "hello , world !"

}

```

## Architecture

### Design Principles

1. **Zero Dependencies**: No ICU, Boost, or other heavy libraries

2. **UTF-8 Safe**: Detects UTF-8 boundaries without corrupting multibyte sequences

3. **ASCII Optimized**: Fast path for ASCII operations (case conversion, punctuation)

4. **Memory Efficient**: Minimal allocations during tokenization

5. **Configurable**: Fluent interface for different use cases

### Performance Characteristics

- **Time Complexity**: O(n) where n is input length

- **Space Complexity**: O(t) where t is number of tokens

- **UTF-8 Handling**: O(1) character boundary detection

- **Memory**: Uses `string_view` for zero-copy input processing

## Performance

Benchmark results on a typical text corpus:

```

Performance test with 174000 characters

Results:

  Tokenization: 2159 μs (22000 tokens)

  Encoding:     1900 μs

  Decoding:     430 μs

  Total time:   4.49 ms

  Throughput:   36.97 MB/s

```

*Benchmark on AMD Ryzen 9 5900X, compiled with -O3.*

## Building

### Single File Integration

Simply include the header:

```cpp

#include "Modern-Text-Tokenizer.hpp"

```

### CMake Integration

```cmake

# Add to your CMakeLists.txt

add_executable(your_app main.cpp Modern-Text-Tokenizer.hpp)

target_compile_features(your_app PRIVATE cxx_std_17)

```

### Compilation Example

```bash

g++ -std=c++17 -O3 -o tokenizer_demo main.cpp

clang++ -std=c++17 -O3 -o tokenizer_demo main.cpp

```

## Testing

The included demo shows various tokenization scenarios:

```bash

./tokenizer_demo

```

Expected output includes:

- Basic tokenization examples

- Unicode handling demonstration

- Performance benchmarks

- Configuration examples

## Roadmap

### Planned Features

- [ ] **Regex Support**: Pattern-based tokenization

- [ ] **Streaming API**: Process large files without loading into memory

- [ ] **Parallel Processing**: Multi-threaded batch tokenization

- [ ] **Custom Normalizers**: User-defined text preprocessing

- [ ] **Subword Tokenization**: BPE/WordPiece support

- [ ] **Benchmark Suite**: Comprehensive performance testing

### Future Considerations

- [ ] **C++20 Features**: Ranges, concepts, and modules

- [ ] **SIMD Optimization**: Vectorized string processing

- [ ] **Memory Mapping**: For huge file processing

- [ ] **Language Detection**: Automatic handling of different scripts

## Contributing

Contributions welcome! Areas of interest:

1. **Performance Optimization**: SIMD, better algorithms

2. **Unicode Enhancement**: Better normalization without ICU

3. **Testing**: More edge cases and benchmarks

4. **Documentation**: Examples and tutorials

## License

MIT License - see LICENSE file for details.

## Acknowledgments

- Inspired by modern tokenization libraries like HuggingFace Tokenizers

- UTF-8 handling techniques from various C++ Unicode resources

- Performance optimizations learned from high-performance text processing

---

**⭐ Star this repo if you find it useful!**

Built with ❤️ for the C++ and NLP community

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/mecanik/modern-text-tokenizer

Awesome Lists containing this project

README