{"id":16837850,"url":"https://github.com/hankcs/ahocorasickdoublearraytrie","last_synced_at":"2025-05-14T15:07:42.027Z","repository":{"id":30059751,"uuid":"33609151","full_name":"hankcs/AhoCorasickDoubleArrayTrie","owner":"hankcs","description":"An extremely fast implementation of Aho Corasick algorithm based on Double Array Trie.","archived":false,"fork":false,"pushed_at":"2021-11-24T17:09:04.000Z","size":3210,"stargazers_count":966,"open_issues_count":31,"forks_count":294,"subscribers_count":59,"default_branch":"master","last_synced_at":"2025-04-04T23:02:07.621Z","etag":null,"topics":["aho-corasick","algorithm","doublearraytrie","fast","java"],"latest_commit_sha":null,"homepage":"http://www.hankcs.com/program/algorithm/aho-corasick-double-array-trie.html","language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/hankcs.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2015-04-08T13:40:30.000Z","updated_at":"2025-03-24T15:24:58.000Z","dependencies_parsed_at":"2022-07-18T04:00:34.156Z","dependency_job_id":null,"html_url":"https://github.com/hankcs/AhoCorasickDoubleArrayTrie","commit_stats":null,"previous_names":[],"tags_count":6,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hankcs%2FAhoCorasickDoubleArrayTrie","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hankcs%2FAhoCorasickDoubleArrayTrie/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hankcs%2FAhoCorasickDoubleArrayTrie/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hankcs%2FAhoCorasickDoubleArrayTrie/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/hankcs","download_url":"https://codeload.github.com/hankcs/AhoCorasickDoubleArrayTrie/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248505871,"owners_count":21115354,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["aho-corasick","algorithm","doublearraytrie","fast","java"],"created_at":"2024-10-13T12:19:04.930Z","updated_at":"2025-04-12T01:50:34.322Z","avatar_url":"https://github.com/hankcs.png","language":"Java","readme":"AhoCorasickDoubleArrayTrie\n============\n[![Maven Central](https://maven-badges.herokuapp.com/maven-central/com.hankcs/aho-corasick-double-array-trie/badge.svg)](https://maven-badges.herokuapp.com/maven-central/com.hankcs/aho-corasick-double-array-trie/)\n[![GitHub release](https://img.shields.io/github/release/hankcs/AhoCorasickDoubleArrayTrie.svg)](https://github.com/hankcs/AhoCorasickDoubleArrayTrie/releases)\n[![License](https://img.shields.io/badge/license-Apache%202-4EB1BA.svg)](https://www.apache.org/licenses/LICENSE-2.0.html)\n\nAn extremely fast implementation of Aho Corasick algorithm based on Double Array Trie structure. Its speed is 5 to 9 times of naive implementations, perhaps it's the fastest implementation so far ;-)\n\nIntroduction\n------------\nYou may heard that Aho-Corasick algorithm is fast for parsing text with a huge dictionary, for example:\n* looking for certain words in texts in order to URL link or emphasize them\n* adding semantics to plain text\n* checking against a dictionary to see if syntactic errors were made\n\nBut most implementation use a `TreeMap\u003cCharacter, State\u003e` to store the *goto* structure, which costs `O(lg(t))` time, `t` is the largest amount of a word's common prefixes. The final complexity is `O(n * lg(t))`, absolutely `t \u003e 2`, so `n * lg(t) \u003e n `. The others used a `HashMap`, which wasted too much memory, and still remained slowly.\n\nI improved it by replacing the `XXXMap` to a Double Array Trie, whose time complexity is just `O(1)`, thus we get a total complexity of exactly `O(n)`, and take a perfect balance of time and memory. Yes, its speed is not related to the length or language or common prefix of the words of a dictionary.\n\nThis implementation has been widely used in my [HanLP: Han Language Processing](https://github.com/hankcs/HanLP) package. I hope it can serve as a common data structure library in projects handling text or NLP task.\n\nDependency\n----------\nInclude this dependency in your POM. Be sure to check for the latest version in Maven Central.\n\n```xml\n\u003cdependency\u003e\n  \u003cgroupId\u003ecom.hankcs\u003c/groupId\u003e\n  \u003cartifactId\u003eaho-corasick-double-array-trie\u003c/artifactId\u003e\n  \u003cversion\u003e1.2.3\u003c/version\u003e\n\u003c/dependency\u003e\n```\nor include this dependency in your build.gradle.kts\n```kotlin\nimplementation(\"com.hankcs:aho-corasick-double-array-trie:1.2.2\")\n```\n\nUsage\n-----\nSetting up the `AhoCorasickDoubleArrayTrie` is a piece of cake:\n\n```java\n        // Collect test data set\n        TreeMap\u003cString, String\u003e map = new TreeMap\u003cString, String\u003e();\n        String[] keyArray = new String[]\n                {\n                        \"hers\",\n                        \"his\",\n                        \"she\",\n                        \"he\"\n                };\n        for (String key : keyArray)\n        {\n            map.put(key, key);\n        }\n        // Build an AhoCorasickDoubleArrayTrie\n        AhoCorasickDoubleArrayTrie\u003cString\u003e acdat = new AhoCorasickDoubleArrayTrie\u003cString\u003e();\n        acdat.build(map);\n        // Test it\n        final String text = \"uhers\";\n        List\u003cAhoCorasickDoubleArrayTrie.Hit\u003cString\u003e\u003e wordList = acdat.parseText(text);\n```\n\nOf course, there remains many useful methods to be discovered, feel free to try:\n* Use a `Map\u003cString, SomeObject\u003e` to assign a `SomeObject` as value to a keyword.\n* Store the `AhoCorasickDoubleArrayTrie` to disk by calling `save` method.\n* Restore the `AhoCorasickDoubleArrayTrie` from disk by calling `load` method.\n* Use it in concurrent code. `AhoCorasickDoubleArrayTrie` is thread safe after `build` method\n\nIn other situations you probably do not need a huge wordList, then please try this:\n\n```java\n        acdat.parseText(text, new AhoCorasickDoubleArrayTrie.IHit\u003cString\u003e()\n        {\n            @Override\n            public void hit(int begin, int end, String value)\n            {\n                System.out.printf(\"[%d:%d]=%s\\n\", begin, end, value);\n            }\n        });\n```\n\nor a lambda function\n\n```\n        acdat.parseText(text, (begin, end, value) -\u003e {\n            System.out.printf(\"[%d:%d]=%s\\n\", begin, end, value);\n        });\n```\n\nComparison\n-----\nI compared my AhoCorasickDoubleArrayTrie with robert-bor's aho-corasick, ACDAT represents for AhoCorasickDoubleArrayTrie and Naive represents for aho-corasick, the result is :\n\n```\nParsing English document which contains 3409283 characters, with a dictionary of 127142 words.\n               \tNaive          \tACDAT\ntime           \t607            \t102\nchar/s         \t5616611.20     \t33424343.14\nrate           \t1.00           \t5.95\n===========================================================================\nParsing Chinese document which contains 1290573 characters, with a dictionary of 146047 words.\n               \tNaive          \tACDAT\ntime           \t319            \t35\nchar/s         \t2609156.74     \t23780600.00\nrate           \t1.00           \t9.11\n===========================================================================\n```\n\nIn English test, AhoCorasickDoubleArrayTrie is 5 times faster. When it comes to Chinese, AhoCorasickDoubleArrayTrie is 9 times faster.\nThis test is conducted under i7 2.0GHz, -Xms512m -Xmx512m -Xmn256m. Feel free to re-run this test in TestAhoCorasickDoubleArrayTrie, the test data is ready for you.\n\nThanks\n-----\nThis project is inspired by [aho-corasick](https://github.com/robert-bor/aho-corasick) and [darts-clone-java](https://github.com/hiroshi-manabe/darts-clone-java).\nMany thanks!\n\nLicense\n-------\n   Licensed under the Apache License, Version 2.0 (the \"License\");\n   you may not use this file except in compliance with the License.\n   You may obtain a copy of the License at\n\n\thttp://www.apache.org/licenses/LICENSE-2.0\n\n   Unless required by applicable law or agreed to in writing, software\n   distributed under the License is distributed on an \"AS IS\" BASIS,\n   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n   See the License for the specific language governing permissions and\n   limitations under the License.\n\n\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhankcs%2Fahocorasickdoublearraytrie","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhankcs%2Fahocorasickdoublearraytrie","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhankcs%2Fahocorasickdoublearraytrie/lists"}