{"id":20749101,"url":"https://github.com/g-research/ahocorasick_rs","last_synced_at":"2025-05-15T17:08:05.351Z","repository":{"id":37089790,"uuid":"358014052","full_name":"G-Research/ahocorasick_rs","owner":"G-Research","description":"Check for multiple patterns in a single string at the same time: a fast Aho-Corasick algorithm for Python","archived":false,"fork":false,"pushed_at":"2025-05-14T12:56:01.000Z","size":307,"stargazers_count":183,"open_issues_count":4,"forks_count":13,"subscribers_count":17,"default_branch":"main","last_synced_at":"2025-05-15T17:07:55.873Z","etag":null,"topics":["aho-corasick","pattern-matching","python","rust"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/G-Research.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-04-14T19:06:11.000Z","updated_at":"2025-05-06T11:53:39.000Z","dependencies_parsed_at":"2023-02-16T10:01:48.578Z","dependency_job_id":"efa4e0c1-3fe2-4184-93c9-8d419fedcb11","html_url":"https://github.com/G-Research/ahocorasick_rs","commit_stats":{"total_commits":147,"total_committers":3,"mean_commits":49.0,"dds":0.3945578231292517,"last_synced_commit":"839a84f828b0caa24f2b19f7ee202d21cf501ff6"},"previous_names":[],"tags_count":25,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/G-Research%2Fahocorasick_rs","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/G-Research%2Fahocorasick_rs/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/G-Research%2Fahocorasick_rs/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/G-Research%2Fahocorasick_rs/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/G-Research","download_url":"https://codeload.github.com/G-Research/ahocorasick_rs/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254384988,"owners_count":22062422,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["aho-corasick","pattern-matching","python","rust"],"created_at":"2024-11-17T08:21:03.262Z","updated_at":"2025-05-15T17:08:00.342Z","avatar_url":"https://github.com/G-Research.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# ahocorasick_rs: Quickly search for multiple substrings at once\n\n`ahocorasick_rs` allows you to search for multiple substrings (\"patterns\") in a given string (\"haystack\") using variations of the [Aho-Corasick algorithm](https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm).\n\nIn particular, it's implemented as a wrapper of the Rust [`aho-corasick`](https://docs.rs/aho-corasick/) library, and provides a faster alternative to the [`pyahocorasick`](https://pyahocorasick.readthedocs.io/) library.\n\nFound any problems or have any questions? [File an issue on the GitHub project](https://github.com/G-Research/ahocorasick_rs).\n\n* [Quickstart](#quickstart)\n* [Choosing the matching algorithm](#matching)\n* [Additional configuration: speed and memory usage tradeoffs](#configuration2)\n* [Implementation details](#implementation)\n* [Benchmarks](#benchmarks)\n\n## Quickstart \u003ca name=\"quickstart\"\u003e\u003c/a\u003e\n\nThe `ahocorasick_rs` library allows you to search for multiple strings (\"patterns\") within a haystack, or alternatively search multiple bytes.\nFor example, let's install the library:\n\n```shell-session\n$ pip install ahocorasick-rs\n```\n\n### Searching strings\n\nWe can construct a `AhoCorasick` object:\n\n```python\n\u003e\u003e\u003e import ahocorasick_rs\n\u003e\u003e\u003e patterns = [\"hello\", \"world\", \"fish\"]\n\u003e\u003e\u003e haystack = \"this is my first hello world. hello!\"\n\u003e\u003e\u003e ac = ahocorasick_rs.AhoCorasick(patterns)\n```\n\nYou can construct a `AhoCorasick` object from any iterable (including generators), not just lists:\n\n```python\n\u003e\u003e\u003e ac = ahocorasick_rs.AhoCorasick((p.lower() for p in patterns))\n```\n\n`AhoCorasick.find_matches_as_indexes()` returns a list of tuples, each tuple being:\n\n1. The index of the found pattern inside the list of patterns.\n2. The start index of the pattern inside the haystack.\n3. The end index of the pattern inside the haystack.\n\n```python\n\u003e\u003e\u003e ac.find_matches_as_indexes(haystack)\n[(0, 17, 22), (1, 23, 28), (0, 30, 35)]\n\u003e\u003e\u003e patterns[0], patterns[1], patterns[0]\n('hello', 'world', 'hello')\n\u003e\u003e\u003e haystack[17:22], haystack[23:28], haystack[30:35]\n('hello', 'world', 'hello')\n```\n\n`find_matches_as_strings()` returns a list of found patterns:\n\n```python\n\u003e\u003e\u003e ac.find_matches_as_strings(haystack)\n['hello', 'world', 'hello']\n```\n\n### Searching `bytes` and other similar objects\n\nYou can also search `bytes`, `bytearray`, `memoryview`, and other objects supporting the Python buffer API.\n\n```python\n\u003e\u003e\u003e patterns = [b\"hello\", b\"world\"]\n\u003e\u003e\u003e ac = ahocorasick_rs.BytesAhoCorasick(patterns)\n\u003e\u003e\u003e haystack = b\"hello world\"\n\u003e\u003e\u003e ac.find_matches_as_indexes(b\"hello world\")\n[(0, 0, 5), (1, 6, 11)]\n\u003e\u003e\u003e patterns[0], patterns[1]\n(b'hello', b'world')\n\u003e\u003e\u003e haystack[0:5], haystack[6:11]\n(b'hello', b'world')\n```\n\nThe `find_matches_as_strings()` API is not supported by `BytesAhoCorasick`.\n\n## Choosing the matching algorithm \u003ca name=\"matching\"\u003e\u003c/a\u003e\n\n### Match kind\n\nThere are three ways you can configure matching in cases where multiple patterns overlap, supported by both `AhoCorasick` and `BytesAhoCorasick` objects.\nFor a more in-depth explanation, see the [underlying Rust library's documentation of matching](https://docs.rs/aho-corasick/latest/aho_corasick/enum.MatchKind.html).\n\nAssume we have this starting point:\n\n```python\n\u003e\u003e\u003e from ahocorasick_rs import AhoCorasick, MatchKind\n```\n\n#### `Standard` (the default)\n\nThis returns the pattern that matches first, semantically-speaking.\nThis is the default matching pattern.\n\n```python\n\u003e\u003e\u003e ac AhoCorasick([\"disco\", \"disc\", \"discontent\"])\n\u003e\u003e\u003e ac.find_matches_as_strings(\"discontent\")\n['disc']\n\u003e\u003e\u003e ac = AhoCorasick([\"b\", \"abcd\"])\n\u003e\u003e\u003e ac.find_matches_as_strings(\"abcdef\")\n['b']\n```\n\nIn this case `disc` will match before `disco` or `discontent`.\n\nSimilarly, `b` will match before `abcd` because it ends earlier in the haystack than `abcd` does:\n\n```python\n\u003e\u003e\u003e ac = AhoCorasick([\"b\", \"abcd\"])\n\u003e\u003e\u003e ac.find_matches_as_strings(\"abcdef\")\n['b']\n```\n\n#### `LeftmostFirst`\n\nThis returns the leftmost-in-the-haystack matching pattern that appears first in _the list of given patterns_.\nThat means the order of patterns makes a difference:\n\n```python\n\u003e\u003e\u003e ac = AhoCorasick([\"disco\", \"disc\"], matchkind=MatchKind.LeftmostFirst)\n\u003e\u003e\u003e ac.find_matches_as_strings(\"discontent\")\n['disco']\n\u003e\u003e\u003e ac = AhoCorasick([\"disc\", \"disco\"], matchkind=MatchKind.LeftmostFirst)\n['disc']\n```\n\nHere we see `abcd` matched first, because it starts before `b`:\n\n```python\n\u003e\u003e\u003e ac = AhoCorasick([\"b\", \"abcd\"], matchkind=MatchKind.LeftmostFirst)\n\u003e\u003e\u003e ac.find_matches_as_strings(\"abcdef\")\n['abcd']\n```\n##### `LeftmostLongest`\n\nThis returns the leftmost-in-the-haystack matching pattern that is longest:\n\n```python\n\u003e\u003e\u003e ac = AhoCorasick([\"disco\", \"disc\", \"discontent\"], matchkind=MatchKind.LeftmostLongest)\n\u003e\u003e\u003e ac.find_matches_as_strings(\"discontent\")\n['discontent']\n```\n\n### Overlapping matches\n\nYou can get all overlapping matches, instead of just one of them, but only if you stick to the default matchkind, `MatchKind.Standard`.\nAgain, this is supported by both `AhoCorasick` and `BytesAhoCorasick`.\n\n```python\n\u003e\u003e\u003e from ahocorasick_rs import AhoCorasick\n\u003e\u003e\u003e patterns = [\"winter\", \"onte\", \"disco\", \"discontent\"]\n\u003e\u003e\u003e ac = AhoCorasick(patterns)\n\u003e\u003e\u003e ac.find_matches_as_strings(\"discontent\", overlapping=True)\n['disco', 'onte', 'discontent']\n```\n\n## Additional configuration: speed and memory usage tradeoffs \u003ca name=\"configuration2\"\u003e\u003c/a\u003e\n\n### Algorithm implementations: trading construction speed, memory, and performance (`AhoCorasick` and `BytesAhoCorasick`)\n\nYou can choose the type of underlying automaton to use, with different performance tradeoffs.\nThe short version: if you want maximum matching speed, and you don't have too many patterns, try the `Implementation.DFA` implementation and see if it helps.\n\nThe underlying Rust library supports [four choices](https://docs.rs/aho-corasick/latest/aho_corasick/struct.AhoCorasickBuilder.html#method.kind), which are exposed as follows:\n\n* `None` uses a heuristic to choose the \"best\" Aho-Corasick implementation for the given patterns, balancing construction time, memory usage, and matching speed.\n  This is the default.\n* `Implementation.NoncontiguousNFA`: A noncontiguous NFA is the fastest to be built, has moderate memory usage and is typically the slowest to execute a search.\n* `Implementation.ContiguousNFA`: A contiguous NFA is a little slower to build than a noncontiguous NFA, has excellent memory usage and is typically a little slower than a DFA for a search.\n* `Implementation.DFA`: A DFA is very slow to build, uses exorbitant amounts of memory, but will typically execute searches the fastest.\n\n```python\n\u003e\u003e\u003e from ahocorasick_rs import AhoCorasick, Implementation\n\u003e\u003e\u003e ac = AhoCorasick([\"disco\", \"disc\"], implementation=Implementation.DFA)\n```\n\n### Trading memory for speed (`AhoCorasick` only)\n\nIf you use ``find_matches_as_strings()``, there are two ways strings can be constructed: from the haystack, or by caching the patterns on the object.\nThe former takes more work, the latter uses more memory if the patterns would otherwise have been garbage-collected.\nYou can control the behavior by using the `store_patterns` keyword argument to `AhoCorasick()`.\n\n* ``AhoCorasick(..., store_patterns=None)``: The default.\n  Use a heuristic (currently, whether the total of pattern string lengths is less than 4096 characters) to decide whether to store patterns or not.\n* ``AhoCorasick(..., store_patterns=True)``: Keep references to the patterns, potentially speeding up ``find_matches_as_strings()`` at the cost of using more memory.\n  If this uses large amounts of memory this might actually slow things down due to pressure on the CPU memory cache, and/or the performance benefit might be overwhelmed by the algorithm's search time.\n* ``AhoCorasick(..., store_patterns=False)``: Don't keep references to the patterns, saving some memory but potentially slowing down ``find_matches_as_strings()``, especially when there are only a small number of patterns and you are searching a small haystack.\n\n## Implementation details \u003ca name=\"implementation\"\u003e\u003c/a\u003e\n\n* Matching on strings releases the GIL, to enable concurrency.\n  Matching on bytes does not currently release the GIL for memory-safety reasons, unless the haystack type is `bytes`.\n* Not all features from the underlying library are exposed; if you would like additional features, please [file an issue](https://github.com/g-research/ahocorasick_rs/issues/new) or submit a PR.\n\n## Benchmarks \u003ca name=\"benchmarks\"\u003e\u003c/a\u003e\n\nAs with any benchmark, real-world results will differ based on your particular situation.\nIf performance is important to your application, measure the alternatives yourself!\n\nThat being said, I've seen `ahocorasick_rs` run 1.5× to 7× as fast as `pyahocorasick`, depending on the options used.\nYou can run the included benchmarks, if you want, to see some comparative results locally.\nClone the repository, then:\n\n```\npip install pytest-benchmark ahocorasick_rs pyahocorasick\npytest benchmarks/\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fg-research%2Fahocorasick_rs","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fg-research%2Fahocorasick_rs","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fg-research%2Fahocorasick_rs/lists"}