{"id":50474177,"url":"https://github.com/le0pard/tre_regex","last_synced_at":"2026-06-01T12:02:33.418Z","repository":{"id":354879275,"uuid":"1225750419","full_name":"le0pard/tre_regex","owner":"le0pard","description":null,"archived":false,"fork":false,"pushed_at":"2026-04-30T17:01:42.000Z","size":39,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-04-30T18:11:44.053Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Ruby","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/le0pard.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-04-30T15:45:15.000Z","updated_at":"2026-04-30T17:02:11.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/le0pard/tre_regex","commit_stats":null,"previous_names":["le0pard/tre_regex"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/le0pard/tre_regex","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/le0pard%2Ftre_regex","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/le0pard%2Ftre_regex/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/le0pard%2Ftre_regex/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/le0pard%2Ftre_regex/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/le0pard","download_url":"https://codeload.github.com/le0pard/tre_regex/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/le0pard%2Ftre_regex/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33773782,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-01T02:00:06.963Z","response_time":115,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-06-01T12:02:31.219Z","updated_at":"2026-06-01T12:02:33.412Z","avatar_url":"https://github.com/le0pard.png","language":"Ruby","funding_links":[],"categories":[],"sub_categories":[],"readme":"# TreRegex [![Ruby Checks](https://github.com/le0pard/tre_regex/actions/workflows/main.yml/badge.svg)](https://github.com/le0pard/tre_regex/actions/workflows/main.yml)\n\n`TreRegex` provides a high-performance Ruby interface to the [TRE](https://github.com/laurikari/tre) C library using FFI. It brings robust approximate (fuzzy) regular expression matching to Ruby, featuring multi-byte Unicode string safety, and granular error limits\n\n## Why?\n\nStandard regular expressions are strictly exact. If you are searching text containing typos, OCR errors, or variations in spelling, standard `Regexp` will fail.\n\nWhile Ruby has built-in string distance metrics (like Levenshtein distance), they usually require comparing whole strings against other whole strings. `TreRegex` solves this by allowing you to search for a pattern *within* a larger body of text while permitting a configurable number of errors (insertions, deletions, and substitutions).\n\n## Features\n\n* **Approximate Matching**: Find matches even if the target string has missing, extra, or substituted characters.\n* **Granular Control**: Set strict limits on `max_errors`, or fine-tune by specific error types (`max_insertions`, `max_deletions`, `max_substitutions`).\n* **Multi-byte Unicode Safety**: Transparently maps underlying C byte-offsets back to native Ruby character indices (e.g., emojis won't break your offsets).\n\n## Installation\n\nAdd this line to your application's Gemfile:\n\n```ruby\ngem 'tre_regex'\n```\n\nAnd then execute:\n\n```bash\n$ bundle install\n```\n\nOr install it directly:\n\n```bash\n$ gem install tre_regex\n```\n\n## Usage\n\n### Basic Matching\n\nCreate a new `TreRegex::Regex` object and use `exec` or `test?` to search text\n\n```ruby\nrequire 'tre_regex'\n\nregex = TreRegex::Regex.new('apple', ignore_case: true)\n\n# Simple boolean check\nregex.test?('I ate an APPLE today')\n# =\u003e true\n\n# Get detailed match data\nresult = regex.exec('I ate an apple today')\n# =\u003e {\n#      :match =\u003e \"apple\",\n#      :submatches =\u003e [],\n#      :index =\u003e 9,\n#      :end_index =\u003e 14,\n#      :cost =\u003e 0,\n#      :errors =\u003e {:insertions=\u003e0, :deletions=\u003e0, :substitutions=\u003e0}\n#    }\n```\n\n### Fuzzy Matching\n\nYou can configure fuzziness by passing options directly to the `exec` method\n\n```ruby\nregex = TreRegex::Regex.new('apple')\n\n# Allow up to 1 error of any kind\nregex.exec('I ate an aple', max_errors: 1)\n# =\u003e {match: \"aple\", submatches: [], index: 9, end_index: 13, cost: 1, errors: {insertions: 0, deletions: 1, substitutions: 0}}\n\n# Allow substitutions, but explicitly forbid deletions\nregex.exec('I ate an aple', max_substitutions: 1, max_deletions: 0)\n# =\u003e nil\n```\n\n### Finding All Matches\n\nUse `match_all` to find every occurrence of a pattern in a string. It can take a block or return an `Enumerator`\n\n```ruby\nregex = TreRegex::Regex.new('cat')\n\n# Returns an array of match hashes\nregex.match_all('cat, cot, cut', max_errors: 1).to_a\n# =\u003e [\n#  {match: \"cat\", submatches: [], index: 0, end_index: 3, cost: 0, errors: {insertions: 0, deletions: 0, substitutions: 0}},\n#  {match: \"cot\", submatches: [], index: 5, end_index: 8, cost: 1, errors: {insertions: 0, deletions: 0, substitutions: 1}},\n#  {match: \"cut\", submatches: [], index: 10, end_index: 13, cost: 1, errors: {insertions: 0, deletions: 0, substitutions: 1}}\n# ]\n```\n\n### Capture Groups (Submatches)\n\n`TreRegex` fully supports standard POSIX capture groups using parentheses `()`. Whenever a match is found, any captured data is returned as an array of strings under the `:submatches` key in the result hash.\n\nIf your pattern does not contain any capture groups, `:submatches` will simply return an empty array `[]`.\n\n```ruby\nregex = TreRegex::Regex.new('I love (ruby|python)')\nresult = regex.exec('I love ruby a lot')\n\n# The captured group is extracted exactly as it was matched\nresult[:submatches] # =\u003e [\"ruby\"]\n```\n\n#### Multiple and Optional Groups\n\nYou can define multiple capture groups, and they will be returned in the array in the exact order they appear in the pattern.\n\nIf you use an optional capture group `?` that does not end up matching anything in the target text, `TreRegex` will safely insert a `nil` in its place in the array to maintain the correct index order.\n\n```ruby\n# The first group (cat) is optional. The second group (dog) is required.\nregex = TreRegex::Regex.new('(cat)?(dog)')\n\nresult = regex.exec('dog')\n# =\u003e {match: \"dog\", submatches: [nil, \"dog\"], index: 0, end_index: 3, cost: 0, errors: {insertions: 0, deletions: 0, substitutions: 0}}\n```\n\n#### Fuzzy Capture Groups\n\nOne of the most powerful features of `TreRegex` is that capture groups respect your fuzzy matching rules! If a typo occurs *inside* a capture group, the `:submatches` array will return the actual typed text with the typo included.\n\n```ruby\nregex = TreRegex::Regex.new('I ate an (apple)')\n\n# We allow 1 error. The user typed 'aple' (1 deletion).\nresult = regex.exec('I ate an aple', max_errors: 1)\n\nresult[:submatches] # =\u003e [\"aple\"]\n```\n\n#### The 9-Group Limit\n\nFor memory safety and performance during FFI allocation, `TreRegex` allocates a strict maximum of 10 slots per match. Because the first slot is always reserved for the full regex match itself, the engine will only extract a maximum of **9 capture groups** per match.\n\nIf your pattern contains 10 or more capture groups `()`, the regex will still compile and match perfectly, but any captured groups beyond the 9th one will be safely ignored and omitted from the `:submatches` array.\n\n## Configuration Options\n\n`TreRegex` provides fine-grained control over how patterns are compiled and how fuzzy matching constraints are applied.\n\n### Initialization Options\n\nWhen creating a new `TreRegex::Regex` object, you can pass options to modify how the pattern is compiled:\n\n* **`ignore_case`** *(Boolean)*: If `true`, the regex will match characters regardless of their case (equivalent to the `/i` flag in standard Ruby regex). Default is `false`.\n\n```ruby\n# Fails because case doesn't match\nexact_regex = TreRegex::Regex.new('ruby')\nexact_regex.test?('RUBY') # =\u003e false\n\n# Succeeds using the ignore_case flag\ncase_regex = TreRegex::Regex.new('ruby', ignore_case: true)\ncase_regex.test?('RUBY') # =\u003e true\n```\n\n### Fuzzy Matching Options\n\nWhen calling `exec`, `test?`, or `match_all`, you can pass a hash of fuzzy matching options. If no options are provided, `TreRegex` forces an **exact match** (0 errors allowed).\n\n#### Error Limits\n\nThese options strictly limit the number of specific operations required to transform the pattern into the matched string.\n\n* **`max_errors`** *(Integer)*: The total maximum number of combined errors (insertions + deletions + substitutions) allowed for a match.\n* **`max_insertions`** *(Integer)*: The maximum number of extra characters allowed in the searched text. *(e.g., Pattern `cat` matching `cart` is 1 insertion)*.\n* **`max_deletions`** *(Integer)*: The maximum number of missing characters in the searched text. *(e.g., Pattern `cat` matching `ct` is 1 deletion)*.\n* **`max_substitutions`** *(Integer)*: The maximum number of swapped characters. *(e.g., Pattern `cat` matching `cot` is 1 substitution)*.\n\n\u003e **Note:** If you specify granular limits (like `max_deletions: 1`) but omit `max_errors`, the gem will automatically calculate the maximum allowed errors so you don't accidentally trigger an unlimited fuzzy search.\n\n```ruby\nregex = TreRegex::Regex.new('banana')\n\n# Allow up to 2 typos of any kind\nregex.exec('bananana', max_errors: 2) # =\u003e matches \"bananana\" (2 insertions)\nregex.exec('bnnna', max_errors: 2)    # =\u003e matches \"bnnna\" (2 deletions)\nregex.exec('bonono', max_errors: 2)   # =\u003e matches \"bonono\" (2 substitutions)\n\n# Another example\nregex = TreRegex::Regex.new('library')\n\n# Allow 1 deletion, but STRICTLY 0 substitutions and 0 insertions\nregex.exec('librry', max_deletions: 1, max_substitutions: 0, max_insertions: 0)\n# =\u003e matches \"librry\"\n\n# This fails because 'lubrary' requires a substitution, which we set to 0\nregex.exec('lubrary', max_deletions: 1, max_substitutions: 0, max_insertions: 0)\n# =\u003e nil\n```\n\n#### Cost and Weights\n\nInstead of hard limits, you can assign different \"costs\" to different types of errors. This is useful if you want to penalize certain typos more heavily than others.\n\n* **`max_cost`** *(Integer)*: The maximum total cost allowed for a match to be considered successful.\n* **`weight_insertion`** *(Integer)*: The cost penalty for each inserted character.\n* **`weight_deletion`** *(Integer)*: The cost penalty for each deleted character.\n* **`weight_substitution`** *(Integer)*: The cost penalty for each substituted character.\n\n```ruby\nregex = TreRegex::Regex.new('algorithm')\n\n# We allow a maximum cost of 2.\n# Missing/extra characters cost 1 point.\n# Wrong characters cost 3 points.\noptions = {\n  max_cost: 2,\n  weight_deletion: 1,\n  weight_insertion: 1,\n  weight_substitution: 3\n}\n\n# 'algoritm' has 1 deletion. Cost = 1. (Passes, 1 \u003c 2)\nregex.test?('algoritm', options) # =\u003e true\n\n# 'algorethm' has 1 substitution. Cost = 3. (Fails, 3 \u003e 2)\nregex.test?('algorethm', options) # =\u003e false\n```\n\n## Gotchas \u0026 Best Practices\n\n### The \"Empty Match\" Phenomenon\n\nBecause `TreRegex` relies on strict mathematical edit distances, you must be careful when setting `max_errors` to a value that is **greater than or equal to the length of your pattern**.\n\nIf you allow 3 errors on a 3-letter word, the engine considers *deleting all 3 characters* to be a valid mathematical match (cost = 3). This will result in an unexpected match against an empty string (`\"\"`).\n\n```ruby\nregex = TreRegex::Regex.new('cat')\n\n# We allow 3 errors on a 3-letter word.\n# The engine matches \"cow\" (2 substitutions)...\n# but it also matches \"\" at the end of the string (3 deletions)!\nregex.match_all('cot, cow', max_errors: 3).to_a\n# =\u003e [\n#  {match: \"cot\", submatches: [], index: 0, end_index: 3, cost: 1, errors: {insertions: 0, deletions: 0, substitutions: 1}},\n#  {match: \"cow\", submatches: [], index: 5, end_index: 8, cost: 2, errors: {insertions: 0, deletions: 0, substitutions: 2}},\n#  {match: \"\", submatches: [], index: 8, end_index: 8, cost: 3, errors: {insertions: 0, deletions: 3, substitutions: 0}}\n# ]\n```\n\n**Best Practice**: if you need a high `max_errors` limit but want to prevent the engine from matching empty strings, explicitly cap the `max_deletions` option so that at least one character of your pattern must survive\n\n```ruby\n# Allow 3 total errors, but strictly forbid the engine from deleting more than 2 characters\nregex.match_all('cot, cow', max_errors: 3, max_deletions: 2).to_a\n# =\u003e [\n#  {match: \"cot\", submatches: [], index: 0, end_index: 3, cost: 1, errors: {insertions: 0, deletions: 0, substitutions: 1}},\n#  {match: \"cow\", submatches: [], index: 5, end_index: 8, cost: 2, errors: {insertions: 0, deletions: 0, substitutions: 2}}\n# ] # The empty match is mathematically prevented\n```\n\n### POSIX vs. PCRE Syntax\n\nRuby’s built-in `Regexp` engine uses a PCRE-like syntax (Onigmo), which supports advanced features like lookaheads `(?=...)`, lookbehinds, and backreferences.\n\nThe underlying TRE C-library uses **POSIX Extended Regular Expressions (ERE)**. While it supports standard regex features (character classes `[a-z]`, quantifiers `*`, `+`, `?`, and grouping), it **does not** support Perl-specific extensions.\n\n```ruby\n# Valid TRE syntax\nTreRegex::Regex.new('(cat|dog)s?')\n\n# INVALID: Lookarounds are not supported by POSIX ERE\nTreRegex::Regex.new('cat(?=s)') # Failed to compile regex pattern: cat(?=s) (TreRegex::Error)\n```\n\n### The Performance Cost of Extreme Fuzziness\n\nFuzzy matching is inherently more computationally expensive than exact matching. The TRE algorithm scales based on the length of the string and the number of allowed errors.\n\nIf you are searching a massive block of text (like a whole book) and set `max_errors: 10`, the engine has to calculate an enormous number of branching possibilities.\n\n**Best Practice**: Keep your error limits tight and realistic. An error limit of 1 to 3 is usually perfect for catching typos. If you need to allow a massive number of errors, consider breaking the target text into smaller chunks (like sentences or words) before matching.\n\n### Unicode Character Indices vs. Byte Offsets\n\nIn C, strings are just arrays of bytes. An emoji like 🍎 takes up 4 bytes, which often breaks indexing when C-libraries pass data back to Ruby.\n\n`TreRegex` handles this for you under the hood. The `:index` and `:end_index` returned in the match hash are strictly mapped to **Ruby character indices**, not raw byte offsets.\n\n**Best Practice**: You can safely use the returned indices directly with standard Ruby string slicing, even if the text is filled with emojis or multi-byte characters. Do not use them with `String#byteslice`\n\n```ruby\nregex = TreRegex::Regex.new('apple')\ntarget = 'I ate 🍎 and an aple'\n\nresult = regex.exec(target, max_errors: 1)\n# =\u003e {match: \"aple\", submatches: [], index: 15, end_index: 19, cost: 1, errors: {insertions: 0, deletions: 1, substitutions: 0}}\n\n# This is 100% safe and will correctly return \"aple\"\ntarget[result[:index]...result[:end_index]]\n```\n\n### Overlapping Matches in `match_all`\n\nWhen using `match_all`, be aware that the engine consumes the string as it matches. By default, standard regex engines (including TRE) do not return overlapping matches.\n\nIf you search for `\"ana\"` in `\"banana\"`, it will only match the first `\"ana\"`. Once it consumes those characters, it moves on to the remaining `\"na\"`.\n\n```ruby\nregex = TreRegex::Regex.new('ana')\n\n# Returns 1 match, not 2!\nregex.match_all('banana').to_a\n# =\u003e [{match: \"ana\", submatches: [], index: 1, end_index: 4, cost: 0, errors: {insertions: 0, deletions: 0, substitutions: 0}}]\n```\n\nIf you need to find overlapping fuzzy matches, you will need to manually step through the string by advancing your starting index by 1 character after each search.\n\n## Development\n\nBecause `TreRegex` compiles the underlying TRE C-library from source, you must have standard C-compilation and `autotools` dependencies installed on your machine before running the setup script\n\n**Ubuntu / Debian Linux**\n\n```bash\nsudo apt-get update\nsudo apt-get install build-essential autoconf automake libtool gettext autopoint pkg-config\n```\n\n**macOS**\n\nThen, install the autotools suite via [Homebrew](https://brew.sh/):\n```bash\nbrew install autoconf automake libtool gettext pkg-config\n```\n\nAfter checking out the repo, run `bin/setup` to install dependencies. Then, run `rake spec` to run the tests. You can also run `bin/console` for an interactive prompt that will allow you to experiment.\n\nTo install this gem onto your local machine, run `bundle exec rake install`. To release a new version, update the version number in `version.rb`, and then run `bundle exec rake release`, which will create a git tag for the version, push git commits and the created tag, and push the `.gem` file to [rubygems.org](https://rubygems.org).\n\n## License\n\nThe gem is available as open source under the terms of the MIT License.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fle0pard%2Ftre_regex","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fle0pard%2Ftre_regex","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fle0pard%2Ftre_regex/lists"}