{"id":19946106,"url":"https://github.com/quantco/multiregex","last_synced_at":"2025-05-03T16:32:50.557Z","repository":{"id":37959802,"uuid":"460134570","full_name":"Quantco/multiregex","owner":"Quantco","description":"Quickly match many regexes against a string","archived":false,"fork":false,"pushed_at":"2025-04-07T04:09:49.000Z","size":410,"stargazers_count":30,"open_issues_count":4,"forks_count":2,"subscribers_count":11,"default_branch":"main","last_synced_at":"2025-04-17T17:40:06.316Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-3-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Quantco.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.rst","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":".github/CODEOWNERS","security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-02-16T18:46:30.000Z","updated_at":"2025-04-01T20:27:35.000Z","dependencies_parsed_at":"2023-02-13T00:01:20.447Z","dependency_job_id":"c5c7094b-7abd-4ccc-89c1-ee7272773037","html_url":"https://github.com/Quantco/multiregex","commit_stats":{"total_commits":41,"total_committers":3,"mean_commits":"13.666666666666666","dds":"0.24390243902439024","last_synced_commit":"cb84fadbdb1635a376cb9ad3232698f06b009168"},"previous_names":[],"tags_count":3,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Quantco%2Fmultiregex","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Quantco%2Fmultiregex/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Quantco%2Fmultiregex/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Quantco%2Fmultiregex/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Quantco","download_url":"https://codeload.github.com/Quantco/multiregex/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252216099,"owners_count":21713099,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-13T00:28:18.914Z","updated_at":"2025-05-03T16:32:50.147Z","avatar_url":"https://github.com/Quantco.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# multiregex\n\n[![CI](https://img.shields.io/github/actions/workflow/status/quantco/multiregex/ci.yml?style=flat-square\u0026branch=main)](https://github.com/quantco/multiregex/actions/workflows/ci.yml)\n[![conda-forge](https://img.shields.io/conda/vn/conda-forge/multiregex?logoColor=white\u0026logo=conda-forge\u0026style=flat-square)](https://prefix.dev/channels/conda-forge/packages/multiregex)\n[![pypi-version](https://img.shields.io/pypi/v/multiregex.svg?logo=pypi\u0026logoColor=white\u0026style=flat-square)](https://pypi.org/project/multiregex)\n[![python-version](https://img.shields.io/pypi/pyversions/multiregex?logoColor=white\u0026logo=python\u0026style=flat-square)](https://pypi.org/project/multiregex)\n\nQuickly match many regexes against a string. Provides 2-10x speedups over naïve regex matching.\n\n## Introduction\n\nSee [this introductory blog post](https://tech.quantco.com/2022/07/31/multiregex.html).\n\n## Installation\n\nThis project is managed by [pixi](https://pixi.sh).\nYou can install the package in development mode using:\n\n```bash\ngit clone https://github.com/quantco/multiregex\ncd multiregex\n\npixi run pre-commit-install\npixi run postinstall\npixi run test\n```\n\n## Usage\n\n```py\nimport multiregex\n\n# Create matcher from multiple regexes.\nmy_patterns = [r\"\\w+@\\w+\\.com\", r\"\\w\\.com\"]\nmatcher = multiregex.RegexMatcher(my_patterns)\n\n# Run `re.search` for all regexes.\n# Returns a set of matches as (re.Pattern, re.Match) tuples.\nmatcher.search(\"john.doe@example.com\")\n# =\u003e [(re.compile('\\\\w+@\\\\w+\\\\.com'), \u003cre.Match ... 'doe@example.com'\u003e),\n#     (re.compile('\\\\w+\\\\.com'), \u003cre.Match ... 'example.com'\u003e)]\n\n# Same as above, but with `re.match`.\nmatcher.match(...)\n# Same as above, but with `re.fullmatch`.\nmatcher.fullmatch(...)\n```\n\n### Custom prematchers\n\nTo be able to quickly match many regexes against a string, `multiregex` uses\n\"prematchers\" under the hood. Prematchers are lists of non-regex strings of which\nat least one can be assumed to be present in the haystack if the corresponding regex matches.\nAs an example, a valid prematcher of `r\"\\w+\\.com\"` could be `[\".com\"]` and a valid\nprematcher of `r\"(B|b)aNäNa\"` could be `[\"b\"]` or `[\"anäna\"]`.\nNote that prematchers must be all-lowercase (in order for `multiregex` to be able to support `re.IGNORECASE`).\n\nYou will likely have to provide your own prematchers for all but the simplest\nregex patterns:\n\n```py\nmultiregex.RegexMatcher([r\"\\d+\"])\n# =\u003e ValueError: Could not generate prematcher : '\\\\d+'\n```\n\nTo provide custom prematchers, pass `(pattern, prematchers)` tuples:\n\n```py\nmultiregex.RegexMatcher([(r\"\\d+\", map(str, range(10)))])\n```\n\nTo use a mixture of automatic and custom prematchers, pass `prematchers=None`:\n\n```py\nmatcher = multiregex.RegexMatcher([(r\"\\d+\", map(str, range(10))), (r\"\\w+\\.com\", None)])\nmatcher.prematchers\n# =\u003e {(re.compile('\\\\d+'), {'0', '1', '2', '3', '4', '5', '6', '7', '8', '9'}),\n#     (re.compile('\\\\w+\\\\.com'), {'com'})}\n```\n\n### Disabling prematchers\n\nTo disable prematching for certain pattern entirely (ie., always run the regex\nwithout first running any prematchers), pass an empty list of prematchers:\n\n```py\nmultiregex.RegexMatcher([(r\"super complicated regex\", [])])\n```\n\n### Profiling prematchers\n\nTo check if your prematchers are effective, you can use the built-in prematcher \"profiler\":\n\n```py\nyyyy_mm_dd = r\"(19|20)\\d\\d-\\d\\d-\\d\\d\"  # Default prematchers: {'-'}\nmatcher = multiregex.RegexMatcher([yyyy_mm_dd], count_prematcher_false_positives=True)\nfor string in my_benchmark_dataset:\n    matcher.search(string)\nprint(matcher.format_prematcher_false_positives())\n# =\u003e For example:\n# FP count | FP rate | Pattern / Prematchers\n# ---------+---------+----------------------\n#      137 |    0.72 | (19|20)\\d\\d-\\d\\d-\\d\\d / {'-'}\n```\n\nIn this example, there were 137 input strings that were matched positive by the prematcher but negative by the regex.\nIn other words, the prematcher failed to prevent slow regex evaluation in 72% of the cases.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fquantco%2Fmultiregex","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fquantco%2Fmultiregex","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fquantco%2Fmultiregex/lists"}