{"id":23441895,"url":"https://github.com/genivia/fuzzymatcher","last_synced_at":"2025-04-13T10:43:40.370Z","repository":{"id":114467896,"uuid":"267367841","full_name":"Genivia/FuzzyMatcher","owner":"Genivia","description":"Fast fuzzy regex matcher: specify max edit distance to find approximate matches. FuzzyMatcher is now included in RE/flex.","archived":false,"fork":false,"pushed_at":"2025-02-04T14:26:56.000Z","size":69,"stargazers_count":36,"open_issues_count":0,"forks_count":6,"subscribers_count":4,"default_branch":"master","last_synced_at":"2025-03-27T02:08:09.057Z","etag":null,"topics":["edit-distance","fuzzy","fuzzy-matching","fuzzy-search","levenshtein-distance","regex"],"latest_commit_sha":null,"homepage":"https://github.com/Genivia/RE-flex","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-3-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Genivia.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE.txt","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-05-27T16:15:12.000Z","updated_at":"2025-02-04T14:26:59.000Z","dependencies_parsed_at":"2024-06-07T18:58:42.404Z","dependency_job_id":"4c1bc2bc-b70f-46ac-b5f9-83da4cf4e7ab","html_url":"https://github.com/Genivia/FuzzyMatcher","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Genivia%2FFuzzyMatcher","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Genivia%2FFuzzyMatcher/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Genivia%2FFuzzyMatcher/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Genivia%2FFuzzyMatcher/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Genivia","download_url":"https://codeload.github.com/Genivia/FuzzyMatcher/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248701991,"owners_count":21148111,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["edit-distance","fuzzy","fuzzy-matching","fuzzy-search","levenshtein-distance","regex"],"created_at":"2024-12-23T17:19:13.640Z","updated_at":"2025-04-13T10:43:40.345Z","avatar_url":"https://github.com/Genivia.png","language":"C++","funding_links":[],"categories":[],"sub_categories":[],"readme":"FuzzyMatcher\n============\n\nA C++ class extension of the [RE/flex](https://github.com/Genivia/RE-flex)\nMatcher class for efficient fuzzy matching and fuzzy search with regex patterns.\nRegex patterns are of the POSIX ERE type, but also support Unicode matching,\nlazy quantifiers, word boundaries and lookaheads.\n\n- specify max error as a parameter, i.e. the max edit distance or\n  [Levenshstein distance](https://en.wikipedia.org/wiki/Levenshtein_distance)\n\n- regex patterns are compiled into DFA VM opcodes for speed\n\n- practically linear execution time in the length of the input, using\n  DFA-based matching with minimal backtracking limited by the specified max\n  error parameter\n\n- supports the full RE/flex regex pattern syntax, which is POSIX-based with\n  many additions: \u003chttps://www.genivia.com/doc/reflex/html/#reflex-patterns\u003e\n\n- no group captures (yet), except for top-level sub-pattern group captures,\n  e.g. `(foo)|(bar)|(baz)` but not `(foo(bar))`\n\n- newlines (`\\n`) and NUL (`\\0`) characters are never deleted or substituted\n  to ensure that fuzzy matches do not extend the pattern match beyond the\n  number of lines specified by the regex pattern\n\n- quote regex patterns with `\\Q` and `\\E` for fuzzy string matching and search\n\n- FuzzyMatcher is used in the [ugrep](https://github.com/Genivia/ugrep) project\n\nRequires\n--------\n\n[RE-Flex](https://github.com/Genivia/RE-flex) version 4.0 or greater, because\nof regex pattern analysis and translation updates to RE/flex 4.0 that are\nalso used by [ugrep 5.0](https://github.com/Genivia/ugrep).\n\nExamples\n--------\n\npattern    | max | fuzzy `find()` matches            | but not\n---------- | --- | --------------------------------- | -------------------------\n`abc`      | 1   | `abc`, `ab`, `ac`, `axc`, `axbc`  | `a`, `axx`, `axbxc`, `bc`\n`año`      | 1   | `año`, `ano`, `ao`                | `anno`, `ño`\n`ab_cd`    | 2   | `ab_cd`, `ab-cd`, `ab Cd`, `abCd` | `ab\\ncd`, `Ab_cd`, `Abcd`\n`a[0-9]+z` | 1   | `a1z`, `a123z`, `az`, `axz`       | `axxz`, `A123z`, `123z`\n\nNote that the first character of the pattern must match when searching a corpus\nwith the `find()` method.  By contrast, the `matches()` method to match a\ncorpus from start to end does not impose this requirement:\n\npattern    | max | fuzzy `matches()` matches                            | but not\n---------- | --- | ---------------------------------------------------- | -------------------------\n`abc`      | 1   | `abc`, `ab`, `ac`, `Abc`, `xbc` `bc`, `axc`, `axbc`  | `a`, `axx`, `Ab`, `axbxc`\n`año`      | 1   | `año`, `Año`, `ano`, `ao`, `ño`                      | `anno`\n`ab_cd`    | 2   | `ab_cd`, `Ab_Cd`, `ab-cd`, `ab Cd`, `Ab_cd`, `abCd`  | `ab\\ncd`, `AbCd`\n`a[0-9]+z` | 1   | `a1z`, `A1z`, `a123z`, `az`, `Az`, `axz`, `123z`     | `axxz`\n\nOptimizations\n-------------\n\nFuzzy `find()` and `split()` make a second pass over a fuzzy-matched pattern\nwhen the match has a nonzero error.  This second pass checks if an exact match\nexists or if a better match exists that overlaps with the first pattern found.\nFor example, the pattern `abc` is found to fuzzy match all of the text `aabc`\nwith one error (an extra `a`).  The second pass of `find()` detects an exact\nmatch after skipping the first `a`.  Likewise, the pattern `abc` is found to\nfuzzy match `ababc` with a match for `aba` with one error (substitution of `c`\nby an `a`).  The second pass of `find()` detects an exact match after skipping\n`ab` in the text.  This approach is faster than minimizing the edit distance\nwhen searching text, while returning exact matches when possible.\n\nUsage\n-----\n\n### Fuzzy searching\n\n    #include \"fuzzymatcher.h\"\n\n    // MAX:   optional maximum edit distance, default is 1, up to 255\n    // INPUT: a string, wide string, FILE*, or std::istream object\n    reflex::FuzzyMatcher matcher(\"PATTERN\", [MAX,] INPUT);\n\n    // find all pattern matches in the input\n    while (matcher.find())\n    {\n      std::cout \u003c\u003c matcher.text() \u003c\u003c '\\n'  // show each fuzzy match\n      std::cout \u003c\u003c matcher.edits() \u003c\u003c '\\n' // edit distance (when \u003e 0 not guaranteed minimal)\n    }\n\nSee the [RE/flex user guide](https://www.genivia.com/doc/reflex/html/#regex-methods)\nfor the full list of `Matcher` class methods available to extract match info.\nThe `edits()` method is a `FuzzyMatcher` extension of the `Matcher` class.\n\n### Fuzzy matching\n\n    #include \"fuzzymatcher.h\"\n\n    // match the whole input (here in one go with a temporary fuzzy matcher object)\n    if (reflex::FuzzyMatcher(\"PATTERN\", [MAX,] INPUT).matches())\n    {\n      std::cout \u003c\u003c \"fuzzy pattern matched\\n\";\n    }\n\n### Fuzzy splitting (text between matches)\n\n    #include \"fuzzymatcher.h\"\n\n    reflex::FuzzyMatcher matcher(\"PATTERN\", [MAX,] INPUT);\n\n    // split the input into parts separated by pattern matches\n    while (matcher.split())\n    {\n      std::cout \u003c\u003c matcher.text() \u003c\u003c '\\n' // show text between fuzzy matches\n    }\n\n### Character insertion, deletion and substitution\n\nThe `MAX` parameter may be combined with one or more of the following flags:\n\n- `reflex::FuzzyMatcher::INS` insertions allow extra character(s) in the input\n- `reflex::FuzzyMatcher::DEL` deletions allow missing character(s) in the input\n- `reflex::FuzzyMatcher::SUB` substitutions count as one edit\n- `reflex::FuzzyMatcher::BIN` ASCII/binary fuzzy matching (default is Unicode with Unicode pattern converter, see below)\n\nFor example, to allow approximate pattern matches to include up to three\ncharacter insertions, but no deletions or substitutions (allowing insertions\nonly is actually the most efficient fuzzy matching possible):\n\n    reflex::FuzzyMatcher matcher(regex, 3 | reflex::FuzzyMatcher::INS, INPUT);\n\nTo allow up to three insertions or deletions (note that a substitution counts\nas two edits: one insertion and one deletion):\n\n    reflex::FuzzyMatcher matcher(regex, 3 | reflex::FuzzyMatcher::INS | reflex::FuzzyMatcher::DEL, INPUT);\n\nWhen no flags are specified with `MAX`, fuzzy matching is performed with\ninsertions, deletions, and substitutions, each counting as one edit.\n\n### Full Unicode support\n\nTo support full Unicode pattern matching, such as `\\p` Unicode character\nclasses, convert the regex pattern before using it as follows:\n\n    std::string regex(reflex::Matcher::convert(\"PATTERN\", reflex::convert_flag::unicode));\n    reflex::FuzzyMatcher matcher(regex, [MAX,] INPUT);\n\n### Static regex patterns\n\nFixed patterns should be constructed (and optionally Unicode converted) just\nonce statically to avoid repeated construction, e.g. in the body of loops and\nin frequently executed functions:\n\n    static const reflex::Pattern pattern(reflex::Matcher::convert(\"PATTERN\", reflex::convert_flag::unicode));\n    reflex::FuzzyMatcher matcher(pattern, [MAX,] INPUT);\n\nCompiling\n---------\n\nAssuming `reflex` dir with RE/flex source code is locally built:\n\n    c++ -o myapp myapp.cpp -Ireflex/include reflex/lib/libreflex.a\n\nWhen the `libreflex` library is built and installed:\n\n    c++ -o myapp myapp.cpp -lreflex\n\nTesting\n-------\n\n    $ make ftest\n    $ ./ftest 'ab_cd' 'abCd' 2\n    matches(): match (2 edits)\n    find():    'abCd' at 0 (2 edits)\n    split():   '' at 0 (2 edits)\n    split():   '' at 4 (0 edits)\n\nLicense\n-------\n\nBSD-3\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgenivia%2Ffuzzymatcher","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgenivia%2Ffuzzymatcher","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgenivia%2Ffuzzymatcher/lists"}