{"id":31629461,"url":"https://github.com/scientist-labs/tokenkit","last_synced_at":"2025-10-06T21:03:39.597Z","repository":{"id":317052760,"uuid":"1065808154","full_name":"scientist-labs/tokenkit","owner":"scientist-labs","description":"Fast, Rust-backed word-level tokenization for Ruby. Unlike subword tokenizers (BPE, WordPiece) designed for LLMs, TokenKit provides linguistic tokenization for search engines, text mining, and NLP   pipelines—preserving domain-specific patterns like gene names, measurements, and technical terms while handling Unicode correctly.","archived":false,"fork":false,"pushed_at":"2025-09-29T20:13:21.000Z","size":1022,"stargazers_count":2,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-10-05T18:50:42.670Z","etag":null,"topics":["nlp","ruby","tokenizer"],"latest_commit_sha":null,"homepage":"","language":"Ruby","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/scientist-labs.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-09-28T13:23:56.000Z","updated_at":"2025-09-29T20:13:25.000Z","dependencies_parsed_at":"2025-09-28T15:43:11.830Z","dependency_job_id":null,"html_url":"https://github.com/scientist-labs/tokenkit","commit_stats":null,"previous_names":["scientist-labs/tokenkit"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/scientist-labs/tokenkit","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scientist-labs%2Ftokenkit","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scientist-labs%2Ftokenkit/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scientist-labs%2Ftokenkit/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scientist-labs%2Ftokenkit/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/scientist-labs","download_url":"https://codeload.github.com/scientist-labs/tokenkit/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scientist-labs%2Ftokenkit/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":278679347,"owners_count":26027054,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-06T02:00:05.630Z","response_time":65,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["nlp","ruby","tokenizer"],"created_at":"2025-10-06T21:01:48.458Z","updated_at":"2025-10-06T21:03:39.591Z","avatar_url":"https://github.com/scientist-labs.png","language":"Ruby","funding_links":[],"categories":["Ruby"],"sub_categories":[],"readme":"\u003cimg src=\"/docs/assets/tokenkit-wide.png\" alt=\"tokenkit\" height=\"120px\"\u003e\n\nFast, Rust-backed word-level tokenization for Ruby with pattern preservation.\n\nTokenKit is a Ruby wrapper around Rust's [unicode-segmentation](https://github.com/unicode-rs/unicode-segmentation) crate, providing lightweight, Unicode-aware tokenization designed for NLP pipelines, search applications, and text processing where you need consistent, high-quality word segmentation.\n\n## Quickstart\n\n```ruby\n# Install the gem\ngem install tokenkit\n\n# Or add to your Gemfile\ngem 'tokenkit'\n```\n\n```ruby\nrequire 'tokenkit'\n\n# Basic tokenization - handles Unicode, contractions, accents\nTokenKit.tokenize(\"Hello, world! café can't\")\n# =\u003e [\"hello\", \"world\", \"café\", \"can't\"]\n\n# Preserve domain-specific terms even when lowercasing\nTokenKit.configure do |config|\n  config.lowercase = true\n  config.preserve_patterns = [\n    /\\d+ug/i,           # Measurements: 100ug\n    /[A-Z][A-Z0-9]+/    # Gene names: BRCA1, TP53\n  ]\nend\n\nTokenKit.tokenize(\"Patient received 100ug for BRCA1 study\")\n# =\u003e [\"patient\", \"received\", \"100ug\", \"for\", \"BRCA1\", \"study\"]\n```\n\n## Features\n\n- **Thirteen tokenization strategies**: whitespace, unicode (recommended), custom regex patterns, sentence, grapheme, keyword, edge n-gram, n-gram, path hierarchy, URL/email-aware, character group, letter, and lowercase\n- **Pattern preservation**: Keep domain-specific terms (gene names, measurements, antibodies) intact even with case normalization\n- **Fast**: Rust-backed implementation (~100K docs/sec)\n- **Thread-safe**: Safe for concurrent use\n- **Simple API**: Configure once, use everywhere\n- **Zero dependencies**: Pure Ruby API with Rust extension\n\n## Tokenization Strategies\n\n### Unicode (Recommended)\n\nUses Unicode word segmentation for proper handling of contractions, accents, and multi-language text.\n\n**✅ Supports `preserve_patterns`**\n\n```ruby\nTokenKit.configure do |config|\n  config.strategy = :unicode\n  config.lowercase = true\nend\n\nTokenKit.tokenize(\"Don't worry about café!\")\n# =\u003e [\"don't\", \"worry\", \"about\", \"café\"]\n```\n\n### Whitespace\n\nSimple whitespace splitting.\n\n**✅ Supports `preserve_patterns`**\n\n```ruby\nTokenKit.configure do |config|\n  config.strategy = :whitespace\n  config.lowercase = true\nend\n\nTokenKit.tokenize(\"hello world\")\n# =\u003e [\"hello\", \"world\"]\n```\n\n### Pattern (Custom Regex)\n\nCustom tokenization using regex patterns.\n\n**✅ Supports `preserve_patterns`**\n\n```ruby\nTokenKit.configure do |config|\n  config.strategy = :pattern\n  config.regex = /[\\w-]+/  # Keep words and hyphens\n  config.lowercase = true\nend\n\nTokenKit.tokenize(\"anti-CD3 antibody\")\n# =\u003e [\"anti-cd3\", \"antibody\"]\n```\n\n### Sentence\n\nSplits text into sentences using Unicode sentence boundaries.\n\n**✅ Supports `preserve_patterns`** (preserves patterns within each sentence)\n\n```ruby\nTokenKit.configure do |config|\n  config.strategy = :sentence\n  config.lowercase = false\nend\n\nTokenKit.tokenize(\"Hello world! How are you? I am fine.\")\n# =\u003e [\"Hello world! \", \"How are you? \", \"I am fine.\"]\n```\n\nUseful for document-level processing, sentence embeddings, or paragraph analysis.\n\n### Grapheme\n\nSplits text into grapheme clusters (user-perceived characters).\n\n**⚠️ Does NOT support `preserve_patterns`** (patterns will be lowercased if `lowercase: true`)\n\n```ruby\nTokenKit.configure do |config|\n  config.strategy = :grapheme\n  config.grapheme_extended = true  # Use extended grapheme clusters (default)\n  config.lowercase = false\nend\n\nTokenKit.tokenize(\"👨‍👩‍👧‍👦café\")\n# =\u003e [\"👨‍👩‍👧‍👦\", \"c\", \"a\", \"f\", \"é\"]\n```\n\nPerfect for handling emoji, combining characters, and complex scripts. Set `grapheme_extended = false` for legacy grapheme boundaries.\n\n### Keyword\n\nTreats entire input as a single token (no splitting).\n\n**⚠️ Does NOT support `preserve_patterns`** (patterns will be lowercased if `lowercase: true`)\n\n```ruby\nTokenKit.configure do |config|\n  config.strategy = :keyword\n  config.lowercase = false\nend\n\nTokenKit.tokenize(\"PROD-2024-ABC-001\")\n# =\u003e [\"PROD-2024-ABC-001\"]\n```\n\nIdeal for exact matching of SKUs, IDs, product codes, or category names where splitting would lose meaning.\n\n### Edge N-gram (Search-as-you-type)\n\nGenerates prefixes from the beginning of words for autocomplete functionality.\n\n**⚠️ Does NOT support `preserve_patterns`** (patterns will be lowercased if `lowercase: true`)\n\n```ruby\nTokenKit.configure do |config|\n  config.strategy = :edge_ngram\n  config.min_gram = 2        # Minimum prefix length\n  config.max_gram = 10       # Maximum prefix length\n  config.lowercase = true\nend\n\nTokenKit.tokenize(\"laptop\")\n# =\u003e [\"la\", \"lap\", \"lapt\", \"lapto\", \"laptop\"]\n```\n\nEssential for autocomplete, type-ahead search, and prefix matching. At index time, generate edge n-grams of your product names or search terms.\n\n### N-gram (Fuzzy Matching)\n\nGenerates all substring n-grams (sliding window) for fuzzy matching and misspelling tolerance.\n\n**⚠️ Does NOT support `preserve_patterns`** (patterns will be lowercased if `lowercase: true`)\n\n```ruby\nTokenKit.configure do |config|\n  config.strategy = :ngram\n  config.min_gram = 2        # Minimum n-gram length\n  config.max_gram = 3        # Maximum n-gram length\n  config.lowercase = true\nend\n\nTokenKit.tokenize(\"quick\")\n# =\u003e [\"qu\", \"ui\", \"ic\", \"ck\", \"qui\", \"uic\", \"ick\"]\n```\n\nPerfect for fuzzy search, typo tolerance, and partial matching. Unlike edge n-grams which only generate prefixes, n-grams generate all possible substrings.\n\n### Path Hierarchy (Hierarchical Navigation)\n\nCreates tokens for each level of a path hierarchy.\n\n**⚠️ Partially supports `preserve_patterns`** (has limitations with hierarchical structure)\n\n```ruby\nTokenKit.configure do |config|\n  config.strategy = :path_hierarchy\n  config.delimiter = \"/\"     # Use \"\\\\\" for Windows paths\n  config.lowercase = false\nend\n\nTokenKit.tokenize(\"/usr/local/bin/ruby\")\n# =\u003e [\"/usr\", \"/usr/local\", \"/usr/local/bin\", \"/usr/local/bin/ruby\"]\n\n# Works for category hierarchies too\nTokenKit.tokenize(\"electronics/computers/laptops\")\n# =\u003e [\"electronics\", \"electronics/computers\", \"electronics/computers/laptops\"]\n```\n\nPerfect for filesystem paths, URL structures, category hierarchies, and breadcrumb navigation.\n\n### URL/Email-Aware (Web Content)\n\nPreserves URLs and email addresses as single tokens while tokenizing surrounding text.\n\n**✅ Supports `preserve_patterns`** (preserves patterns alongside URLs/emails)\n\n```ruby\nTokenKit.configure do |config|\n  config.strategy = :url_email\n  config.lowercase = true\nend\n\nTokenKit.tokenize(\"Contact support@example.com or visit https://example.com\")\n# =\u003e [\"contact\", \"support@example.com\", \"or\", \"visit\", \"https://example.com\"]\n```\n\nEssential for user-generated content, customer support messages, product descriptions with links, and social media text.\n\n### Character Group (Fast Custom Splitting)\n\nSplits text based on a custom set of characters (faster than regex for simple delimiters).\n\n**⚠️ Partially supports `preserve_patterns`** (works best with whitespace delimiters; non-whitespace delimiters may have issues)\n\n```ruby\nTokenKit.configure do |config|\n  config.strategy = :char_group\n  config.split_on_chars = \",;\"  # Split on commas and semicolons\n  config.lowercase = false\nend\n\nTokenKit.tokenize(\"apple,banana;cherry\")\n# =\u003e [\"apple\", \"banana\", \"cherry\"]\n\n# CSV parsing\nTokenKit.tokenize(\"John Doe,30,Software Engineer\")\n# =\u003e [\"John Doe\", \"30\", \"Software Engineer\"]\n```\n\nIdeal for structured data (CSV, TSV), log parsing, and custom delimiter-based formats. Default split characters are ` \\t\\n\\r` (whitespace).\n\n### Letter (Language-Agnostic)\n\nSplits on any non-letter character (simpler than Unicode tokenizer, no special handling for contractions).\n\n**✅ Supports `preserve_patterns`**\n\n```ruby\nTokenKit.configure do |config|\n  config.strategy = :letter\n  config.lowercase = true\nend\n\nTokenKit.tokenize(\"hello-world123test\")\n# =\u003e [\"hello\", \"world\", \"test\"]\n\n# Handles multiple scripts\nTokenKit.tokenize(\"Hello-世界-test\")\n# =\u003e [\"hello\", \"世界\", \"test\"]\n```\n\nGreat for noisy text, mixed scripts, and cases where you want aggressive splitting on any non-letter character.\n\n### Lowercase (Efficient Case Normalization)\n\nLike the Letter tokenizer but always lowercases in a single pass (more efficient than letter + lowercase filter).\n\n**✅ Supports `preserve_patterns`** (preserved patterns maintain original case despite always lowercasing)\n\n```ruby\nTokenKit.configure do |config|\n  config.strategy = :lowercase\n  # Note: config.lowercase setting is ignored - this tokenizer ALWAYS lowercases\nend\n\nTokenKit.tokenize(\"HELLO-WORLD\")\n# =\u003e [\"hello\", \"world\"]\n\n# Case-insensitive search indexing\nTokenKit.tokenize(\"User-Agent: Mozilla/5.0\")\n# =\u003e [\"user\", \"agent\", \"mozilla\"]\n```\n\n**⚠️ Important**: The `:lowercase` strategy **always** lowercases text, regardless of the `config.lowercase` setting. If you need control over lowercasing, use the `:letter` strategy instead with `config.lowercase = true/false`.\n\nPerfect for case-insensitive search indexing, normalizing product codes, and cleaning social media text. Handles Unicode correctly, including characters that lowercase to multiple characters (e.g., Turkish İ).\n\n## Pattern Preservation\n\nPreserve domain-specific terms even when lowercasing.\n\n**Fully Supported by:** Unicode, Pattern, Whitespace, Letter, Lowercase, Sentence, and URL/Email tokenizers.\n\n**Partially Supported by:** Character Group (works best with whitespace delimiters) and Path Hierarchy (limitations with hierarchical structure) tokenizers.\n\n**Not Supported by:** Grapheme, Keyword, Edge N-gram, and N-gram tokenizers.\n\n```ruby\nTokenKit.configure do |config|\n  config.strategy = :unicode\n  config.lowercase = true\n  config.preserve_patterns = [\n    /\\d+(ug|mg|ml|units)/i,  # Measurements: 100ug, 50mg\n    /anti-cd\\d+/i,            # Antibodies: Anti-CD3, anti-CD28\n    /[A-Z][A-Z0-9]+/          # Gene names: BRCA1, TP53, EGFR\n  ]\nend\n\ntext = \"Patient received 100ug Anti-CD3 with BRCA1 mutation\"\ntokens = TokenKit.tokenize(text)\n# =\u003e [\"patient\", \"received\", \"100ug\", \"Anti-CD3\", \"with\", \"BRCA1\", \"mutation\"]\n```\n\nPattern matches maintain their original case despite `lowercase=true`.\n\n### Regex Flags\n\nTokenKit supports Ruby regex flags for both `preserve_patterns` and the `:pattern` strategy:\n\n```ruby\n# Case-insensitive matching (i flag)\nTokenKit.configure do |config|\n  config.preserve_patterns = [/gene-\\d+/i]\nend\n\nTokenKit.tokenize(\"Found GENE-123 and gene-456\")\n# =\u003e [\"found\", \"GENE-123\", \"and\", \"gene-456\"]\n\n# Multiline mode (m flag) - dot matches newlines\nTokenKit.configure do |config|\n  config.strategy = :pattern\n  config.regex = /test./m\nend\n\n# Extended mode (x flag) - allows comments and whitespace\npattern = /\n  \\w+       # word characters\n  @         # at sign\n  \\w+\\.\\w+  # domain.tld\n/x\n\nTokenKit.configure do |config|\n  config.preserve_patterns = [pattern]\nend\n\n# Combine flags\nTokenKit.configure do |config|\n  config.preserve_patterns = [/code-\\d+/im]  # case-insensitive + multiline\nend\n```\n\nSupported flags:\n- `i` - Case-insensitive matching\n- `m` - Multiline mode (`.` matches newlines)\n- `x` - Extended mode (ignore whitespace, allow comments)\n\nFlags work with both Regexp objects and string patterns passed to `:pattern` strategy.\n\n## Configuration\n\n### Global Configuration\n\n```ruby\nTokenKit.configure do |config|\n  config.strategy = :unicode              # :whitespace, :unicode, :pattern, :sentence, :grapheme, :keyword, :edge_ngram, :ngram, :path_hierarchy, :url_email, :char_group, :letter, :lowercase\n  config.lowercase = true                 # Normalize to lowercase\n  config.remove_punctuation = false       # Remove punctuation from tokens\n  config.preserve_patterns = []           # Regex patterns to preserve\n\n  # Strategy-specific options\n  config.regex = /\\w+/                    # Only for :pattern strategy\n  config.grapheme_extended = true         # Only for :grapheme strategy (default: true)\n  config.min_gram = 2                     # For :edge_ngram and :ngram strategies (default: 2)\n  config.max_gram = 10                    # For :edge_ngram and :ngram strategies (default: 10)\n  config.delimiter = \"/\"                  # Only for :path_hierarchy strategy (default: \"/\")\n  config.split_on_chars = \" \\t\\n\\r\"       # Only for :char_group strategy (default: whitespace)\nend\n```\n\n### Per-Call Options\n\nOverride global config for specific calls:\n\n```ruby\n# Override general options\nTokenKit.tokenize(\"BRCA1 Gene\", lowercase: false)\n# =\u003e [\"BRCA1\", \"Gene\"]\n\n# Override strategy-specific options\nTokenKit.tokenize(\"laptop\", strategy: :edge_ngram, min_gram: 3, max_gram: 5)\n# =\u003e [\"lap\", \"lapt\", \"lapto\"]\n\nTokenKit.tokenize(\"C:\\\\Windows\\\\System\", strategy: :path_hierarchy, delimiter: \"\\\\\")\n# =\u003e [\"C:\", \"C:\\\\Windows\", \"C:\\\\Windows\\\\System\"]\n\n# Combine multiple overrides\nTokenKit.tokenize(\n  \"TEST\",\n  strategy: :edge_ngram,\n  min_gram: 2,\n  max_gram: 3,\n  lowercase: false\n)\n# =\u003e [\"TE\", \"TES\"]\n```\n\nAll strategy-specific options can be overridden per-call:\n- `:pattern` - `regex: /pattern/`\n- `:grapheme` - `extended: true/false`\n- `:edge_ngram` - `min_gram: n, max_gram: n`\n- `:ngram` - `min_gram: n, max_gram: n`\n- `:path_hierarchy` - `delimiter: \"/\"`\n- `:char_group` - `split_on_chars: \",;\"`\n\n### Get Current Config\n\n```ruby\nconfig = TokenKit.config_hash\n# Returns a Configuration object with accessor methods\n\nconfig.strategy           # =\u003e :unicode\nconfig.lowercase          # =\u003e true\nconfig.remove_punctuation # =\u003e false\nconfig.preserve_patterns  # =\u003e [...]\n\n# Strategy predicates\nconfig.edge_ngram?        # =\u003e false\nconfig.ngram?             # =\u003e false\nconfig.pattern?           # =\u003e false\nconfig.grapheme?          # =\u003e false\nconfig.path_hierarchy?    # =\u003e false\nconfig.char_group?        # =\u003e false\nconfig.letter?            # =\u003e false\nconfig.lowercase?         # =\u003e false\n\n# Strategy-specific accessors\nconfig.min_gram           # =\u003e 2 (for edge_ngram and ngram)\nconfig.max_gram           # =\u003e 10 (for edge_ngram and ngram)\nconfig.delimiter          # =\u003e \"/\" (for path_hierarchy)\nconfig.split_on_chars     # =\u003e \" \\t\\n\\r\" (for char_group)\nconfig.extended           # =\u003e true (for grapheme)\nconfig.regex              # =\u003e \"...\" (for pattern)\n\n# Convert to hash if needed\nconfig.to_h\n# =\u003e {\"strategy\" =\u003e \"unicode\", \"lowercase\" =\u003e true, ...}\n```\n\n### Reset to Defaults\n\n```ruby\nTokenKit.reset\n```\n\n## Use Cases\n\n### Biotech/Life Sciences\n\n```ruby\nTokenKit.configure do |config|\n  config.strategy = :unicode\n  config.lowercase = true\n  config.preserve_patterns = [\n    /\\d+(ug|mg|ml|ul|units)/i,  # Measurements\n    /anti-[a-z0-9-]+/i,          # Antibodies\n    /[A-Z]{2,10}/,               # Gene names (CDK10, BRCA1, TP53)\n    /cd\\d+/i,                    # Cell markers (CD3, CD4, CD8)\n    /ig[gmaed]/i                 # Immunoglobulins (IgG, IgM)\n  ]\nend\n\ntext = \"Anti-CD3 IgG antibody 100ug for BRCA1 research\"\ntokens = TokenKit.tokenize(text)\n# =\u003e [\"Anti-CD3\", \"IgG\", \"antibody\", \"100ug\", \"for\", \"BRCA1\", \"research\"]\n```\n\n### E-commerce/Catalogs\n\n```ruby\nTokenKit.configure do |config|\n  config.strategy = :unicode\n  config.lowercase = true\n  config.preserve_patterns = [\n    /\\$\\d+(\\.\\d{2})?/,          # Prices: $99.99\n    /\\d+(-\\d+)+/,               # SKUs: 123-456-789\n    /\\d+(mm|cm|inch)/i          # Dimensions: 10mm, 5cm\n  ]\nend\n\ntext = \"Widget $49.99 SKU: 123-456 size: 10cm\"\ntokens = TokenKit.tokenize(text)\n# =\u003e [\"widget\", \"$49.99\", \"sku\", \"123-456\", \"size\", \"10cm\"]\n```\n\n### Search Applications\n\n```ruby\n# Exact matching with case normalization\nTokenKit.configure do |config|\n  config.strategy = :lowercase\n  config.lowercase = true\nend\n\n# Index time: normalize documents\ndoc_tokens = TokenKit.tokenize(\"Product Code: ABC-123\")\n# =\u003e [\"product\", \"code\", \"abc\"]\n\n# Query time: normalize user input\nquery_tokens = TokenKit.tokenize(\"product abc\")\n# =\u003e [\"product\", \"abc\"]\n\n# Fuzzy matching with n-grams\nTokenKit.configure do |config|\n  config.strategy = :ngram\n  config.min_gram = 2\n  config.max_gram = 4\n  config.lowercase = true\nend\n\n# Index time: generate n-grams\nTokenKit.tokenize(\"search\")\n# =\u003e [\"se\", \"ea\", \"ar\", \"rc\", \"ch\", \"sea\", \"ear\", \"arc\", \"rch\", \"sear\", \"earc\", \"arch\"]\n\n# Query time: typo \"serch\" still has significant overlap\nTokenKit.tokenize(\"serch\")\n# =\u003e [\"se\", \"er\", \"rc\", \"ch\", \"ser\", \"erc\", \"rch\", \"serc\", \"erch\"]\n# Overlap: [\"se\", \"rc\", \"ch\", \"rch\"] allows matching despite typo\n\n# Autocomplete with edge n-grams\nTokenKit.configure do |config|\n  config.strategy = :edge_ngram\n  config.min_gram = 2\n  config.max_gram = 10\nend\n\nTokenKit.tokenize(\"laptop\")\n# =\u003e [\"la\", \"lap\", \"lapt\", \"lapto\", \"laptop\"]\n# Matches \"la\", \"lap\", \"lapt\" as user types\n```\n\n## Performance\n\nTokenKit has been extensively optimized for production use:\n\n- **Unicode tokenization**: ~870K tokens/sec (baseline)\n- **Pattern preservation**: ~410K tokens/sec with 4 patterns (was 3.6K/sec before v0.3.0 optimizations)\n- **Memory efficient**: Pre-allocated buffers and in-place operations\n- **Thread-safe**: Cached instances with mutex protection, safe for concurrent use\n- **110x speedup**: For pattern-heavy workloads through intelligent caching\n\nKey optimizations:\n- Regex patterns compiled once and cached (not per-tokenization)\n- String allocations minimized through index-based operations\n- Tokenizer instances reused across calls\n- In-place post-processing for lowercase and punctuation removal\n\nSee the [Performance Guide](docs/PERFORMANCE.md) for detailed benchmarks and optimization techniques.\n\n## Integration\n\nTokenKit is designed to work with other gems in the scientist-labs ecosystem:\n\n- **PhraseKit**: Use TokenKit for consistent phrase extraction\n- **SpellKit**: Tokenize before spell correction\n- **red-candle**: Tokenize before NER/embeddings\n\n## Documentation\n\n- [API Documentation](https://rubydoc.info/gems/tokenkit) - Full API reference\n- [Architecture Guide](docs/ARCHITECTURE.md) - Internal design and structure\n- [Performance Guide](docs/PERFORMANCE.md) - Benchmarks and optimization details\n\n### Generating Documentation Locally\n\n```bash\n# Install documentation dependencies\nbundle install\n\n# Generate YARD documentation\nbundle exec yard doc\n\n# Open documentation in browser\nopen doc/index.html\n```\n\n## Development\n\n```bash\n# Setup\nbundle install\nbundle exec rake compile\n\n# Run tests\nbundle exec rspec\n\n# Run tests with coverage\nCOVERAGE=true bundle exec rspec\n\n# Run linter\nbundle exec standardrb\n\n# Run benchmarks\nruby benchmarks/tokenizer_benchmark.rb\n\n# Build gem\ngem build tokenkit.gemspec\n```\n\n## Requirements\n\n- Ruby \u003e= 3.1.0\n- Rust toolchain (for building from source)\n\n## License\n\nMIT License. See [LICENSE.txt](LICENSE.txt) for details.\n\n## Contributing\n\nBug reports and pull requests are welcome on GitHub at https://github.com/scientist-labs/tokenkit.\n\nThis project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the [code of conduct](CODE_OF_CONDUCT.md).\n\n## Credits\n\nBuilt with:\n- [Magnus](https://github.com/matsadler/magnus) for Ruby-Rust bindings\n- [unicode-segmentation](https://github.com/unicode-rs/unicode-segmentation) for Unicode word boundaries\n- [linkify](https://github.com/robinst/linkify) for robust URL and email detection\n- [regex](https://github.com/rust-lang/regex) for pattern matching\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fscientist-labs%2Ftokenkit","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fscientist-labs%2Ftokenkit","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fscientist-labs%2Ftokenkit/lists"}