{"id":19437671,"url":"https://github.com/mganss/ahocorasick","last_synced_at":"2025-04-09T16:10:41.737Z","repository":{"id":29402220,"uuid":"32937595","full_name":"mganss/AhoCorasick","owner":"mganss","description":"Aho-Corasick multi-string search for .NET and SQL Server.","archived":false,"fork":false,"pushed_at":"2025-02-22T16:31:06.000Z","size":260,"stargazers_count":60,"open_issues_count":0,"forks_count":11,"subscribers_count":5,"default_branch":"master","last_synced_at":"2025-04-09T16:10:40.079Z","etag":null,"topics":["aho-corasick","multi-string","sql-clr","sql-server","string-search"],"latest_commit_sha":null,"homepage":"","language":"C#","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mganss.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2015-03-26T16:03:39.000Z","updated_at":"2025-02-28T06:57:51.000Z","dependencies_parsed_at":"2024-04-25T05:28:16.246Z","dependency_job_id":"3016f34d-649e-4793-88a3-ab70bfc6bbf9","html_url":"https://github.com/mganss/AhoCorasick","commit_stats":{"total_commits":104,"total_committers":4,"mean_commits":26.0,"dds":0.6730769230769231,"last_synced_commit":"866113658535e7f5844b52bbec4d72fc362044e5"},"previous_names":[],"tags_count":77,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mganss%2FAhoCorasick","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mganss%2FAhoCorasick/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mganss%2FAhoCorasick/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mganss%2FAhoCorasick/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mganss","download_url":"https://codeload.github.com/mganss/AhoCorasick/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248065285,"owners_count":21041872,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["aho-corasick","multi-string","sql-clr","sql-server","string-search"],"created_at":"2024-11-10T15:15:29.470Z","updated_at":"2025-04-09T16:10:41.716Z","avatar_url":"https://github.com/mganss.png","language":"C#","funding_links":[],"categories":[],"sub_categories":[],"readme":"# AhoCorasick\n\n[![Version](https://img.shields.io/nuget/v/AhoCorasick.svg)](https://www.nuget.org/packages/AhoCorasick)\n[![Build status](https://ci.appveyor.com/api/projects/status/b8lxercfn9spio95/branch/master?svg=true)](https://ci.appveyor.com/project/mganss/ahocorasick/branch/master)\n[![Coverage Status](https://coveralls.io/repos/mganss/AhoCorasick/badge.svg?branch=master\u0026service=github)](https://coveralls.io/github/mganss/AhoCorasick?branch=master)\n[![netstandard2.0](https://img.shields.io/badge/netstandard-2.0-brightgreen.svg)](https://img.shields.io/badge/netstandard-2.0-brightgreen.svg)\n[![net40](https://img.shields.io/badge/net-40-brightgreen.svg)](https://img.shields.io/badge/net-40-brightgreen.svg)\n\nThis is an implementation of the [Aho-Corasick](https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_string_matching_algorithm) string matching algorithm for .NET (netstandard2.0 and net40) and SQL Server (SQL CLR). Mostly ported from [xudejian/aho-corasick](https://github.com/xudejian/aho-corasick) in CoffeeScript.\n\n## Usage\n\n```C#\nvar ac = new AhoCorasick(\"a\", \"ab\", \"bab\", \"bc\", \"bca\", \"c\", \"caa\");\nvar results = ac.Search(\"abccab\").ToList();\n\nAssert.AreEqual(0, results[0].Index); // index into the searched text\nAssert.AreEqual(\"a\", results[0].Word); // matched word\n// ...\n```\n\nor\n\n```C#\nvar results = \"abccab\".Contains(\"a\", \"ab\", \"bab\", \"bc\", \"bca\", \"c\", \"caa\").ToList();\n```\n\n### Custom char comparison\n\nYou can optionally supply an `IEqualityComparer\u003cchar\u003e` to perform custom char comparisons when searching for substrings. Several implementations with comparers that mirror `StringComparer` are included.\n\n```C#\nvar results = \"AbCcab\".Contains(CharComparer.OrdinalIgnoreCase, \"a\", \"ab\", \"c\").ToList();\n```\n\n## SQL CLR Functions\n\nThere are also several SQL CLR user defined functions that can be used to perform fast substring matching\nin Microsoft SQL Server. To use this:\n\n1. Make sure you have [enabled CLR integration](https://msdn.microsoft.com/en-us/library/ms131048.aspx)\n2. Execute [AhoCorasick.SqlClr_Create.sql](AhoCorasick.SqlClr/dist/AhoCorasick.SqlClr_Create.sql)\n\nFor one-off queries, you can use the functions that rebuild the trie on each query, e.g.\n\n```SQL\nselect top(100) * from Posts P\nwhere dbo.ContainsWords((select Word from Words for xml raw, root('root')), P.Body, 'o') = 1\n```\n\nThe words to match are always supplied as XML where the values are taken from the first attribute of all elements directly beneath the root node. Be careful to select the word column as the only or first column otherwise you'll end up matching the wrong words. The XML in the example above looks like this:\n\n```XML\n\u003croot\u003e\n  \u003crow Word=\"Aachen\" /\u003e\n  \u003crow Word=\"Aaliyah\" /\u003e\n  \u003crow Word=\"aardvark\" /\u003e\n  ...\n\u003c/root\u003e\n```\n\n[Here's more](https://www.simple-talk.com/sql/learn-sql-server/using-the-for-xml-clause-to-return-query-results-as-xml/) about FOR XML.\n\nThe last parameter in the function indicates the culture to use since there is no way to use SQL Server collations in SQL CLR code. Values can be:\n\n|Value|Character comparison|\n|-----|--------------------|\n|c|Current Culture|\n|n|Invariant Culture|\n|o or Empty|Ordinal|\n|Culture name, e.g. \"de-de\"|Specific [.NET Culture](https://msdn.microsoft.com/en-us/library/system.globalization.cultureinfo.name.aspx)|\n\nThe culture identifier can be suffixed by `:i` indicating case-insensitive matching.\n\n### Static objects\n\nThe function in the example above has the problem that the trie is rebuilt for each query even though the input always stays the same. To overcome this problem, there are a number of functions to manage the creation and destruction of static objects whose handles can be saved in SQL variables. Example:\n\n```SQL\ndeclare @ac nvarchar(32);\nset @ac = dbo.CreateAhoCorasick((select Word from Words for xml raw, root('root')), 'en-us:i');\nselect * from Posts P\nwhere dbo.ContainsWordsByObject(P.Body, @ac) = 1;\n```\n\nThis is a lot faster than the first example because the trie is created only once and then reused for each row in the query. The handle (@ac) is a hash value generated from the words to match and the culture. The corresponding object is saved in a static dictionary. You can list the currently active objects using `dbo.ListAhoCorasick()`, remove all objects using `dbo.ClearAhoCorasick()` or remove only one object using `dbo.DeleteAhoCorasick(@ac)`.\n\n### Getting all matches\n\nThe examples above only checked if the words occurred in the queried texts. If you want to get the matched words and the indexes where they occur in the queried texts you can use the supplied table-valued functions. For example:\n\n```SQL\ndeclare @ac nvarchar(32);\nset @ac = dbo.CreateAhoCorasick((select Word from Words for xml raw, root('root')), 'o');\nselect top(100) * from Posts P\ncross apply dbo.ContainsWordsTableByObject(P.Body, @ac) W\n```\n\nThis will return a table such as this:\n\n|ID   |Body   |Index   |Word   |\n|---|---|---|---|\n|1 |What factors related...|5|factor|\n|1 |What factors related...|6|actor|\n|1 |What factors related...|5|factors|\n|...|\n\n### Word boundaries\n\nThere are also functions that return only matches occuring at word boundaries: `dbo.ContainsWordsBoundedByObject()` and `dbo.ContainsWordsBoundedTableByObject()`. Word boundaries here are the same as [`\\b` in regexes](http://www.regular-expressions.info/wordboundaries.html), i.e. matches will occur as if words were specified as `\\bword\\b`.\n\n### Forcing parallelism\n\nAlthough these kinds of queries lend themselves very well to parallel execution, SQL Server tends to overestimate the cost of parallel queries and builds non-parallel plans most of the time where user defined functions are involved. You can force a parallel plan by using a trace flag (more about this [here](http://sqlblog.com/blogs/paul_white/archive/2011/12/23/forcing-a-parallel-query-execution-plan.aspx)):\n\n```SQL\ndeclare @ac nvarchar(32);\nset @ac = dbo.CreateAhoCorasick((select Word from Words for xml raw, root('root')), 'en-us:i');\nselect * from Posts P\nwhere dbo.ContainsWordsBoundedByObject(P.Body, @ac) = 1\nOPTION (RECOMPILE, QUERYTRACEON 8649)\n```\n\nParallel operators are identified by a yellow badge with two arrows in the query plan.\n\n### Performance\n\nHere's a benchmark searching for ~5000 words (average length 7) in ~250,000 texts (average length ~900):\n\n|SQL|AhoCorasick|\n|---|-----------|\n|560s|7s|\n\nThe SQL query used was this:\n\n```SQL\nselect * from Posts P\nwhere exists (select * from Words W where CHARINDEX(W.Word, P.Text) \u003e 0)\n```\n\n#### But I can simply use full-text search\n\nNo. The [CONTAINS](https://msdn.microsoft.com/en-us/library/ms187787.aspx) predicate can only search for a single literal or variable at a time. You can't use it in a join or subquery to search for a column value of a table in the query, i.e. this won't work:\n\n```SQL\nselect * from Posts P\nwhere exists (select * from Words W where CONTAINS(P.Text, W.Word))\n```\n\nIf you know of a way to make this work using FTS (perhaps using a cursor?) let me know.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmganss%2Fahocorasick","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmganss%2Fahocorasick","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmganss%2Fahocorasick/lists"}