https://github.com/scientist-labs/tokenkit

Fast, Rust-backed word-level tokenization for Ruby. Unlike subword tokenizers (BPE, WordPiece) designed for LLMs, TokenKit provides linguistic tokenization for search engines, text mining, and NLP pipelines—preserving domain-specific patterns like gene names, measurements, and technical terms while handling Unicode correctly.
https://github.com/scientist-labs/tokenkit
nlp ruby tokenizer
Last synced: 9 months ago
JSON representation
Host: GitHub
URL: https://github.com/scientist-labs/tokenkit
Owner: scientist-labs
License: mit
Created: 2025-09-28T13:23:56.000Z (10 months ago)
Default Branch: main
Last Pushed: 2025-09-29T20:13:21.000Z (10 months ago)
Last Synced: 2025-10-05T18:50:42.670Z (9 months ago)
Topics: nlp, ruby, tokenizer
Language: Ruby
Homepage:
Size: 998 KB
Stars: 2
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE.txt
- Code of conduct: CODE_OF_CONDUCT.md
Awesome Lists containing this project

README

          

Fast, Rust-backed word-level tokenization for Ruby with pattern preservation.

TokenKit is a Ruby wrapper around Rust's [unicode-segmentation](https://github.com/unicode-rs/unicode-segmentation) crate, providing lightweight, Unicode-aware tokenization designed for NLP pipelines, search applications, and text processing where you need consistent, high-quality word segmentation.

## Quickstart

```ruby

# Install the gem

gem install tokenkit

# Or add to your Gemfile

gem 'tokenkit'

```

```ruby

require 'tokenkit'

# Basic tokenization - handles Unicode, contractions, accents

TokenKit.tokenize("Hello, world! café can't")

# => ["hello", "world", "café", "can't"]

# Preserve domain-specific terms even when lowercasing

TokenKit.configure do |config|

  config.lowercase = true

  config.preserve_patterns = [

    /\d+ug/i,           # Measurements: 100ug

    /[A-Z][A-Z0-9]+/    # Gene names: BRCA1, TP53

  ]

end

TokenKit.tokenize("Patient received 100ug for BRCA1 study")

# => ["patient", "received", "100ug", "for", "BRCA1", "study"]

```

## Features

- **Thirteen tokenization strategies**: whitespace, unicode (recommended), custom regex patterns, sentence, grapheme, keyword, edge n-gram, n-gram, path hierarchy, URL/email-aware, character group, letter, and lowercase

- **Pattern preservation**: Keep domain-specific terms (gene names, measurements, antibodies) intact even with case normalization

- **Fast**: Rust-backed implementation (~100K docs/sec)

- **Thread-safe**: Safe for concurrent use

- **Simple API**: Configure once, use everywhere

- **Zero dependencies**: Pure Ruby API with Rust extension

## Tokenization Strategies

### Unicode (Recommended)

Uses Unicode word segmentation for proper handling of contractions, accents, and multi-language text.

**✅ Supports `preserve_patterns`**

```ruby

TokenKit.configure do |config|

  config.strategy = :unicode

  config.lowercase = true

end

TokenKit.tokenize("Don't worry about café!")

# => ["don't", "worry", "about", "café"]

```

### Whitespace

Simple whitespace splitting.

**✅ Supports `preserve_patterns`**

```ruby

TokenKit.configure do |config|

  config.strategy = :whitespace

  config.lowercase = true

end

TokenKit.tokenize("hello world")

# => ["hello", "world"]

```

### Pattern (Custom Regex)

Custom tokenization using regex patterns.

**✅ Supports `preserve_patterns`**

```ruby

TokenKit.configure do |config|

  config.strategy = :pattern

  config.regex = /[\w-]+/  # Keep words and hyphens

  config.lowercase = true

end

TokenKit.tokenize("anti-CD3 antibody")

# => ["anti-cd3", "antibody"]

```

### Sentence

Splits text into sentences using Unicode sentence boundaries.

**✅ Supports `preserve_patterns`** (preserves patterns within each sentence)

```ruby

TokenKit.configure do |config|

  config.strategy = :sentence

  config.lowercase = false

end

TokenKit.tokenize("Hello world! How are you? I am fine.")

# => ["Hello world! ", "How are you? ", "I am fine."]

```

Useful for document-level processing, sentence embeddings, or paragraph analysis.

### Grapheme

Splits text into grapheme clusters (user-perceived characters).

**⚠️ Does NOT support `preserve_patterns`** (patterns will be lowercased if `lowercase: true`)

```ruby

TokenKit.configure do |config|

  config.strategy = :grapheme

  config.grapheme_extended = true  # Use extended grapheme clusters (default)

  config.lowercase = false

end

TokenKit.tokenize("👨‍👩‍👧‍👦café")

# => ["👨‍👩‍👧‍👦", "c", "a", "f", "é"]

```

Perfect for handling emoji, combining characters, and complex scripts. Set `grapheme_extended = false` for legacy grapheme boundaries.

### Keyword

Treats entire input as a single token (no splitting).

**⚠️ Does NOT support `preserve_patterns`** (patterns will be lowercased if `lowercase: true`)

```ruby

TokenKit.configure do |config|

  config.strategy = :keyword

  config.lowercase = false

end

TokenKit.tokenize("PROD-2024-ABC-001")

# => ["PROD-2024-ABC-001"]

```

Ideal for exact matching of SKUs, IDs, product codes, or category names where splitting would lose meaning.

### Edge N-gram (Search-as-you-type)

Generates prefixes from the beginning of words for autocomplete functionality.

**⚠️ Does NOT support `preserve_patterns`** (patterns will be lowercased if `lowercase: true`)

```ruby

TokenKit.configure do |config|

  config.strategy = :edge_ngram

  config.min_gram = 2        # Minimum prefix length

  config.max_gram = 10       # Maximum prefix length

  config.lowercase = true

end

TokenKit.tokenize("laptop")

# => ["la", "lap", "lapt", "lapto", "laptop"]

```

Essential for autocomplete, type-ahead search, and prefix matching. At index time, generate edge n-grams of your product names or search terms.

### N-gram (Fuzzy Matching)

Generates all substring n-grams (sliding window) for fuzzy matching and misspelling tolerance.

**⚠️ Does NOT support `preserve_patterns`** (patterns will be lowercased if `lowercase: true`)

```ruby

TokenKit.configure do |config|

  config.strategy = :ngram

  config.min_gram = 2        # Minimum n-gram length

  config.max_gram = 3        # Maximum n-gram length

  config.lowercase = true

end

TokenKit.tokenize("quick")

# => ["qu", "ui", "ic", "ck", "qui", "uic", "ick"]

```

Perfect for fuzzy search, typo tolerance, and partial matching. Unlike edge n-grams which only generate prefixes, n-grams generate all possible substrings.

### Path Hierarchy (Hierarchical Navigation)

Creates tokens for each level of a path hierarchy.

**⚠️ Partially supports `preserve_patterns`** (has limitations with hierarchical structure)

```ruby

TokenKit.configure do |config|

  config.strategy = :path_hierarchy

  config.delimiter = "/"     # Use "\\" for Windows paths

  config.lowercase = false

end

TokenKit.tokenize("/usr/local/bin/ruby")

# => ["/usr", "/usr/local", "/usr/local/bin", "/usr/local/bin/ruby"]

# Works for category hierarchies too

TokenKit.tokenize("electronics/computers/laptops")

# => ["electronics", "electronics/computers", "electronics/computers/laptops"]

```

Perfect for filesystem paths, URL structures, category hierarchies, and breadcrumb navigation.

### URL/Email-Aware (Web Content)

Preserves URLs and email addresses as single tokens while tokenizing surrounding text.

**✅ Supports `preserve_patterns`** (preserves patterns alongside URLs/emails)

```ruby

TokenKit.configure do |config|

  config.strategy = :url_email

  config.lowercase = true

end

TokenKit.tokenize("Contact support@example.com or visit https://example.com")

# => ["contact", "support@example.com", "or", "visit", "https://example.com"]

```

Essential for user-generated content, customer support messages, product descriptions with links, and social media text.

### Character Group (Fast Custom Splitting)

Splits text based on a custom set of characters (faster than regex for simple delimiters).

**⚠️ Partially supports `preserve_patterns`** (works best with whitespace delimiters; non-whitespace delimiters may have issues)

```ruby

TokenKit.configure do |config|

  config.strategy = :char_group

  config.split_on_chars = ",;"  # Split on commas and semicolons

  config.lowercase = false

end

TokenKit.tokenize("apple,banana;cherry")

# => ["apple", "banana", "cherry"]

# CSV parsing

TokenKit.tokenize("John Doe,30,Software Engineer")

# => ["John Doe", "30", "Software Engineer"]

```

Ideal for structured data (CSV, TSV), log parsing, and custom delimiter-based formats. Default split characters are ` \t\n\r` (whitespace).

### Letter (Language-Agnostic)

Splits on any non-letter character (simpler than Unicode tokenizer, no special handling for contractions).

**✅ Supports `preserve_patterns`**

```ruby

TokenKit.configure do |config|

  config.strategy = :letter

  config.lowercase = true

end

TokenKit.tokenize("hello-world123test")

# => ["hello", "world", "test"]

# Handles multiple scripts

TokenKit.tokenize("Hello-世界-test")

# => ["hello", "世界", "test"]

```

Great for noisy text, mixed scripts, and cases where you want aggressive splitting on any non-letter character.

### Lowercase (Efficient Case Normalization)

Like the Letter tokenizer but always lowercases in a single pass (more efficient than letter + lowercase filter).

**✅ Supports `preserve_patterns`** (preserved patterns maintain original case despite always lowercasing)

```ruby

TokenKit.configure do |config|

  config.strategy = :lowercase

  # Note: config.lowercase setting is ignored - this tokenizer ALWAYS lowercases

end

TokenKit.tokenize("HELLO-WORLD")

# => ["hello", "world"]

# Case-insensitive search indexing

TokenKit.tokenize("User-Agent: Mozilla/5.0")

# => ["user", "agent", "mozilla"]

```

**⚠️ Important**: The `:lowercase` strategy **always** lowercases text, regardless of the `config.lowercase` setting. If you need control over lowercasing, use the `:letter` strategy instead with `config.lowercase = true/false`.

Perfect for case-insensitive search indexing, normalizing product codes, and cleaning social media text. Handles Unicode correctly, including characters that lowercase to multiple characters (e.g., Turkish İ).

## Pattern Preservation

Preserve domain-specific terms even when lowercasing.

**Fully Supported by:** Unicode, Pattern, Whitespace, Letter, Lowercase, Sentence, and URL/Email tokenizers.

**Partially Supported by:** Character Group (works best with whitespace delimiters) and Path Hierarchy (limitations with hierarchical structure) tokenizers.

**Not Supported by:** Grapheme, Keyword, Edge N-gram, and N-gram tokenizers.

```ruby

TokenKit.configure do |config|

  config.strategy = :unicode

  config.lowercase = true

  config.preserve_patterns = [

    /\d+(ug|mg|ml|units)/i,  # Measurements: 100ug, 50mg

    /anti-cd\d+/i,            # Antibodies: Anti-CD3, anti-CD28

    /[A-Z][A-Z0-9]+/          # Gene names: BRCA1, TP53, EGFR

  ]

end

text = "Patient received 100ug Anti-CD3 with BRCA1 mutation"

tokens = TokenKit.tokenize(text)

# => ["patient", "received", "100ug", "Anti-CD3", "with", "BRCA1", "mutation"]

```

Pattern matches maintain their original case despite `lowercase=true`.

### Regex Flags

TokenKit supports Ruby regex flags for both `preserve_patterns` and the `:pattern` strategy:

```ruby

# Case-insensitive matching (i flag)

TokenKit.configure do |config|

  config.preserve_patterns = [/gene-\d+/i]

end

TokenKit.tokenize("Found GENE-123 and gene-456")

# => ["found", "GENE-123", "and", "gene-456"]

# Multiline mode (m flag) - dot matches newlines

TokenKit.configure do |config|

  config.strategy = :pattern

  config.regex = /test./m

end

# Extended mode (x flag) - allows comments and whitespace

pattern = /

  \w+       # word characters

  @         # at sign

  \w+\.\w+  # domain.tld

/x

TokenKit.configure do |config|

  config.preserve_patterns = [pattern]

end

# Combine flags

TokenKit.configure do |config|

  config.preserve_patterns = [/code-\d+/im]  # case-insensitive + multiline

end

```

Supported flags:

- `i` - Case-insensitive matching

- `m` - Multiline mode (`.` matches newlines)

- `x` - Extended mode (ignore whitespace, allow comments)

Flags work with both Regexp objects and string patterns passed to `:pattern` strategy.

## Configuration

### Global Configuration

```ruby

TokenKit.configure do |config|

  config.strategy = :unicode              # :whitespace, :unicode, :pattern, :sentence, :grapheme, :keyword, :edge_ngram, :ngram, :path_hierarchy, :url_email, :char_group, :letter, :lowercase

  config.lowercase = true                 # Normalize to lowercase

  config.remove_punctuation = false       # Remove punctuation from tokens

  config.preserve_patterns = []           # Regex patterns to preserve

  # Strategy-specific options

  config.regex = /\w+/                    # Only for :pattern strategy

  config.grapheme_extended = true         # Only for :grapheme strategy (default: true)

  config.min_gram = 2                     # For :edge_ngram and :ngram strategies (default: 2)

  config.max_gram = 10                    # For :edge_ngram and :ngram strategies (default: 10)

  config.delimiter = "/"                  # Only for :path_hierarchy strategy (default: "/")

  config.split_on_chars = " \t\n\r"       # Only for :char_group strategy (default: whitespace)

end

```

### Per-Call Options

Override global config for specific calls:

```ruby

# Override general options

TokenKit.tokenize("BRCA1 Gene", lowercase: false)

# => ["BRCA1", "Gene"]

# Override strategy-specific options

TokenKit.tokenize("laptop", strategy: :edge_ngram, min_gram: 3, max_gram: 5)

# => ["lap", "lapt", "lapto"]

TokenKit.tokenize("C:\\Windows\\System", strategy: :path_hierarchy, delimiter: "\\")

# => ["C:", "C:\\Windows", "C:\\Windows\\System"]

# Combine multiple overrides

TokenKit.tokenize(

  "TEST",

  strategy: :edge_ngram,

  min_gram: 2,

  max_gram: 3,

  lowercase: false

)

# => ["TE", "TES"]

```

All strategy-specific options can be overridden per-call:

- `:pattern` - `regex: /pattern/`

- `:grapheme` - `extended: true/false`

- `:edge_ngram` - `min_gram: n, max_gram: n`

- `:ngram` - `min_gram: n, max_gram: n`

- `:path_hierarchy` - `delimiter: "/"`

- `:char_group` - `split_on_chars: ",;"`

### Get Current Config

```ruby

config = TokenKit.config_hash

# Returns a Configuration object with accessor methods

config.strategy           # => :unicode

config.lowercase          # => true

config.remove_punctuation # => false

config.preserve_patterns  # => [...]

# Strategy predicates

config.edge_ngram?        # => false

config.ngram?             # => false

config.pattern?           # => false

config.grapheme?          # => false

config.path_hierarchy?    # => false

config.char_group?        # => false

config.letter?            # => false

config.lowercase?         # => false

# Strategy-specific accessors

config.min_gram           # => 2 (for edge_ngram and ngram)

config.max_gram           # => 10 (for edge_ngram and ngram)

config.delimiter          # => "/" (for path_hierarchy)

config.split_on_chars     # => " \t\n\r" (for char_group)

config.extended           # => true (for grapheme)

config.regex              # => "..." (for pattern)

# Convert to hash if needed

config.to_h

# => {"strategy" => "unicode", "lowercase" => true, ...}

```

### Reset to Defaults

```ruby

TokenKit.reset

```

## Use Cases

### Biotech/Life Sciences

```ruby

TokenKit.configure do |config|

  config.strategy = :unicode

  config.lowercase = true

  config.preserve_patterns = [

    /\d+(ug|mg|ml|ul|units)/i,  # Measurements

    /anti-[a-z0-9-]+/i,          # Antibodies

    /[A-Z]{2,10}/,               # Gene names (CDK10, BRCA1, TP53)

    /cd\d+/i,                    # Cell markers (CD3, CD4, CD8)

    /ig[gmaed]/i                 # Immunoglobulins (IgG, IgM)

  ]

end

text = "Anti-CD3 IgG antibody 100ug for BRCA1 research"

tokens = TokenKit.tokenize(text)

# => ["Anti-CD3", "IgG", "antibody", "100ug", "for", "BRCA1", "research"]

```

### E-commerce/Catalogs

```ruby

TokenKit.configure do |config|

  config.strategy = :unicode

  config.lowercase = true

  config.preserve_patterns = [

    /\$\d+(\.\d{2})?/,          # Prices: $99.99

    /\d+(-\d+)+/,               # SKUs: 123-456-789

    /\d+(mm|cm|inch)/i          # Dimensions: 10mm, 5cm

  ]

end

text = "Widget $49.99 SKU: 123-456 size: 10cm"

tokens = TokenKit.tokenize(text)

# => ["widget", "$49.99", "sku", "123-456", "size", "10cm"]

```

### Search Applications

```ruby

# Exact matching with case normalization

TokenKit.configure do |config|

  config.strategy = :lowercase

  config.lowercase = true

end

# Index time: normalize documents

doc_tokens = TokenKit.tokenize("Product Code: ABC-123")

# => ["product", "code", "abc"]

# Query time: normalize user input

query_tokens = TokenKit.tokenize("product abc")

# => ["product", "abc"]

# Fuzzy matching with n-grams

TokenKit.configure do |config|

  config.strategy = :ngram

  config.min_gram = 2

  config.max_gram = 4

  config.lowercase = true

end

# Index time: generate n-grams

TokenKit.tokenize("search")

# => ["se", "ea", "ar", "rc", "ch", "sea", "ear", "arc", "rch", "sear", "earc", "arch"]

# Query time: typo "serch" still has significant overlap

TokenKit.tokenize("serch")

# => ["se", "er", "rc", "ch", "ser", "erc", "rch", "serc", "erch"]

# Overlap: ["se", "rc", "ch", "rch"] allows matching despite typo

# Autocomplete with edge n-grams

TokenKit.configure do |config|

  config.strategy = :edge_ngram

  config.min_gram = 2

  config.max_gram = 10

end

TokenKit.tokenize("laptop")

# => ["la", "lap", "lapt", "lapto", "laptop"]

# Matches "la", "lap", "lapt" as user types

```

## Performance

TokenKit has been extensively optimized for production use:

- **Unicode tokenization**: ~870K tokens/sec (baseline)

- **Pattern preservation**: ~410K tokens/sec with 4 patterns (was 3.6K/sec before v0.3.0 optimizations)

- **Memory efficient**: Pre-allocated buffers and in-place operations

- **Thread-safe**: Cached instances with mutex protection, safe for concurrent use

- **110x speedup**: For pattern-heavy workloads through intelligent caching

Key optimizations:

- Regex patterns compiled once and cached (not per-tokenization)

- String allocations minimized through index-based operations

- Tokenizer instances reused across calls

- In-place post-processing for lowercase and punctuation removal

See the [Performance Guide](docs/PERFORMANCE.md) for detailed benchmarks and optimization techniques.

## Integration

TokenKit is designed to work with other gems in the scientist-labs ecosystem:

- **PhraseKit**: Use TokenKit for consistent phrase extraction

- **SpellKit**: Tokenize before spell correction

- **red-candle**: Tokenize before NER/embeddings

## Documentation

- [API Documentation](https://rubydoc.info/gems/tokenkit) - Full API reference

- [Architecture Guide](docs/ARCHITECTURE.md) - Internal design and structure

- [Performance Guide](docs/PERFORMANCE.md) - Benchmarks and optimization details

### Generating Documentation Locally

```bash

# Install documentation dependencies

bundle install

# Generate YARD documentation

bundle exec yard doc

# Open documentation in browser

open doc/index.html

```

## Development

```bash

# Setup

bundle install

bundle exec rake compile

# Run tests

bundle exec rspec

# Run tests with coverage

COVERAGE=true bundle exec rspec

# Run linter

bundle exec standardrb

# Run benchmarks

ruby benchmarks/tokenizer_benchmark.rb

# Build gem

gem build tokenkit.gemspec

```

## Requirements

- Ruby >= 3.1.0

- Rust toolchain (for building from source)

## License

MIT License. See [LICENSE.txt](LICENSE.txt) for details.

## Contributing

Bug reports and pull requests are welcome on GitHub at https://github.com/scientist-labs/tokenkit.

This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the [code of conduct](CODE_OF_CONDUCT.md).

## Credits

Built with:

- [Magnus](https://github.com/matsadler/magnus) for Ruby-Rust bindings

- [unicode-segmentation](https://github.com/unicode-rs/unicode-segmentation) for Unicode word boundaries

- [linkify](https://github.com/robinst/linkify) for robust URL and email detection

- [regex](https://github.com/rust-lang/regex) for pattern matching
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/scientist-labs/tokenkit

Awesome Lists containing this project

README