{"id":13672008,"url":"https://github.com/daac-tools/daachorse","last_synced_at":"2025-05-15T09:03:10.888Z","repository":{"id":36988765,"uuid":"402264091","full_name":"daac-tools/daachorse","owner":"daac-tools","description":"🐎 A fast implementation of the Aho-Corasick algorithm using the compact double-array data structure in Rust.","archived":false,"fork":false,"pushed_at":"2024-12-29T10:50:00.000Z","size":3893,"stargazers_count":213,"open_issues_count":4,"forks_count":15,"subscribers_count":4,"default_branch":"main","last_synced_at":"2025-04-14T15:00:04.149Z","etag":null,"topics":["aho-corasick","double-array","finite-state-machine","no-std","rust","search","substring-matching","text-processing"],"latest_commit_sha":null,"homepage":"https://docs.rs/daachorse","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/daac-tools.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE-APACHE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2021-09-02T02:24:58.000Z","updated_at":"2025-03-26T02:25:58.000Z","dependencies_parsed_at":"2024-06-06T00:52:25.452Z","dependency_job_id":"fe1d23e9-18b0-4d49-b8a4-ea48beedf6e1","html_url":"https://github.com/daac-tools/daachorse","commit_stats":{"total_commits":189,"total_committers":3,"mean_commits":63.0,"dds":0.328042328042328,"last_synced_commit":"811fd7021a387efb49f103e36b003d4d102e266d"},"previous_names":[],"tags_count":11,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/daac-tools%2Fdaachorse","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/daac-tools%2Fdaachorse/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/daac-tools%2Fdaachorse/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/daac-tools%2Fdaachorse/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/daac-tools","download_url":"https://codeload.github.com/daac-tools/daachorse/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254310513,"owners_count":22049468,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["aho-corasick","double-array","finite-state-machine","no-std","rust","search","substring-matching","text-processing"],"created_at":"2024-08-02T09:01:24.055Z","updated_at":"2025-05-15T09:03:10.845Z","avatar_url":"https://github.com/daac-tools.png","language":"Rust","funding_links":[],"categories":["Rust"],"sub_categories":[],"readme":"# 🐎 daachorse: Double-Array Aho-Corasick\n\nA fast implementation of the Aho-Corasick algorithm using the compact double-array data structure.\n\n[![Crates.io](https://img.shields.io/crates/v/daachorse)](https://crates.io/crates/daachorse)\n[![Documentation](https://docs.rs/daachorse/badge.svg)](https://docs.rs/daachorse)\n[![Rust](https://img.shields.io/badge/rust-1.61%2B-blue.svg?maxAge=3600)](https://github.com/daac-tools/daachorse)\n[![Build Status](https://github.com/daac-tools/daachorse/actions/workflows/rust.yml/badge.svg)](https://github.com/daac-tools/daachorse/actions)\n[![Slack](https://img.shields.io/badge/join-chat-brightgreen?logo=slack)](https://join.slack.com/t/daac-tools/shared_invite/zt-1pwwqbcz4-KxL95Nam9VinpPlzUpEGyA)\n\nThe main technical ideas behind this library appear in the following paper:\n\n\u003e Shunsuke Kanda, Koichi Akabe, and Yusuke Oda.\n\u003e [Engineering faster double-array Aho-Corasick automata](https://doi.org/10.1002/spe.3190).\n\u003e *Software: Practice and Experience (SPE)*,\n\u003e 53(6): 1332–1361, 2023\n\u003e ([arXiv](https://arxiv.org/abs/2207.13870))\n\nA Python wrapper is also available [here](https://github.com/daac-tools/python-daachorse).\n\n## Overview\n\nDaachorse is a crate for fast multiple pattern matching using the\n[Aho-Corasick algorithm](https://dl.acm.org/doi/10.1145/360825.360855), running in linear time over\nthe length of the input text. This crate uses the\n[compact double-array data structure](https://doi.org/10.1016/j.ipm.2006.04.004) for implementing\nthe pattern match automaton for time and memory efficiency. The data structure not only supports\nconstant-time state-to-state traversal but also represents each state in the space of only 12\nbytes.\n\nFor example, compared to the NFA of the [aho-corasick](https://github.com/BurntSushi/aho-corasick)\ncrate, which is the most popular Aho-Corasick implementation in Rust, Daachorse can perform pattern\nmatching **3.0–5.2 times faster** while consuming **56–60% smaller** memory when using a word\ndictionary of 675K patterns. Other experimental results are available on\n[Wiki](https://github.com/daac-tools/daachorse/wiki/Performance-Comparison).\n\n![](./figures/comparison.svg)\n\n## Requirements\n\nRust 1.61 or higher is required to build this crate.\n\n## Example usage\n\nDaachorse contains some search options, ranging from standard matching with the Aho-Corasick\nalgorithm to trickier matching. They will run very fast based on the double-array data structure\nand can be easily plugged into your application, as shown below.\n\n### Finding overlapped occurrences\n\nTo search for all occurrences of registered patterns that allow for positional overlap in the input\ntext, use `find_overlapping_iter()`. When you use `new()` for construction, the library assigns a\nunique identifier to each pattern in the input order. The match result has the byte positions of\nthe occurrence and its identifier.\n\n```rust\nuse daachorse::DoubleArrayAhoCorasick;\n\nlet patterns = vec![\"bcd\", \"ab\", \"a\"];\nlet pma = DoubleArrayAhoCorasick::new(patterns).unwrap();\n\nlet mut it = pma.find_overlapping_iter(\"abcd\");\n\nlet m = it.next().unwrap();\nassert_eq!((0, 1, 2), (m.start(), m.end(), m.value()));\n\nlet m = it.next().unwrap();\nassert_eq!((0, 2, 1), (m.start(), m.end(), m.value()));\n\nlet m = it.next().unwrap();\nassert_eq!((1, 4, 0), (m.start(), m.end(), m.value()));\n\nassert_eq!(None, it.next());\n```\n\n### Finding non-overlapped occurrences with the standard matching\n\nIf you do not want to allow positional overlap, use `find_iter()` instead.\nIt performs the search on the Aho-Corasick automaton\nand reports patterns first found in each iteration.\n\n```rust\nuse daachorse::DoubleArrayAhoCorasick;\n\nlet patterns = vec![\"bcd\", \"ab\", \"a\"];\nlet pma = DoubleArrayAhoCorasick::new(patterns).unwrap();\n\nlet mut it = pma.find_iter(\"abcd\");\n\nlet m = it.next().unwrap();\nassert_eq!((0, 1, 2), (m.start(), m.end(), m.value()));\n\nlet m = it.next().unwrap();\nassert_eq!((1, 4, 0), (m.start(), m.end(), m.value()));\n\nassert_eq!(None, it.next());\n```\n\n### Finding non-overlapped occurrences with the longest matching\n\nIf you want to search for the longest pattern without positional overlap in each iteration, use\n`leftmost_find_iter()` with specifying `MatchKind::LeftmostLongest` in the construction.\n\n```rust\nuse daachorse::{DoubleArrayAhoCorasickBuilder, MatchKind};\n\nlet patterns = vec![\"ab\", \"a\", \"abcd\"];\nlet pma = DoubleArrayAhoCorasickBuilder::new()\n    .match_kind(MatchKind::LeftmostLongest)\n    .build(\u0026patterns)\n    .unwrap();\n\nlet mut it = pma.leftmost_find_iter(\"abcd\");\n\nlet m = it.next().unwrap();\nassert_eq!((0, 4, 2), (m.start(), m.end(), m.value()));\n\nassert_eq!(None, it.next());\n```\n\n### Finding non-overlapped occurrences with the leftmost-first matching\n\nIf you want to find the earliest registered pattern among ones starting from the search position,\nuse `leftmost_find_iter()` with specifying `MatchKind::LeftmostFirst`.\n\nThis is the so-called *leftmost first match*, a tricky search option supported in the\n[aho-corasick](https://github.com/BurntSushi/aho-corasick) crate. For example, in the following\ncode, `ab` is reported because it is the earliest registered one.\n\n```rust\nuse daachorse::{DoubleArrayAhoCorasickBuilder, MatchKind};\n\nlet patterns = vec![\"ab\", \"a\", \"abcd\"];\nlet pma = DoubleArrayAhoCorasickBuilder::new()\n    .match_kind(MatchKind::LeftmostFirst)\n    .build(\u0026patterns)\n    .unwrap();\n\nlet mut it = pma.leftmost_find_iter(\"abcd\");\n\nlet m = it.next().unwrap();\nassert_eq!((0, 2, 0), (m.start(), m.end(), m.value()));\n\nassert_eq!(None, it.next());\n```\n\n### Associating arbitrary values with patterns\n\nTo build the automaton from pairs of a pattern and user-defined value, instead of assigning identifiers\nautomatically, use `with_values()`.\n\n```rust\nuse daachorse::DoubleArrayAhoCorasick;\n\nlet patvals = vec![(\"bcd\", 0), (\"ab\", 10), (\"a\", 20)];\nlet pma = DoubleArrayAhoCorasick::with_values(patvals).unwrap();\n\nlet mut it = pma.find_overlapping_iter(\"abcd\");\n\nlet m = it.next().unwrap();\nassert_eq!((0, 1, 20), (m.start(), m.end(), m.value()));\n\nlet m = it.next().unwrap();\nassert_eq!((0, 2, 10), (m.start(), m.end(), m.value()));\n\nlet m = it.next().unwrap();\nassert_eq!((1, 4, 0), (m.start(), m.end(), m.value()));\n\nassert_eq!(None, it.next());\n```\n\n### Building faster automata on multibyte characters\n\nTo build a faster automaton on multibyte characters, use `CharwiseDoubleArrayAhoCorasick` instead.\n\nThe standard version `DoubleArrayAhoCorasick` handles strings as UTF-8 sequences and defines\ntransition labels using byte values. On the other hand, `CharwiseDoubleArrayAhoCorasick` uses\nUnicode code point values, reducing the number of transitions and faster matching.\n\n```rust\nuse daachorse::CharwiseDoubleArrayAhoCorasick;\n\nlet patterns = vec![\"全世界\", \"世界\", \"に\"];\nlet pma = CharwiseDoubleArrayAhoCorasick::new(patterns).unwrap();\n\nlet mut it = pma.find_iter(\"全世界中に\");\n\nlet m = it.next().unwrap();\nassert_eq!((0, 9, 0), (m.start(), m.end(), m.value()));\n\nlet m = it.next().unwrap();\nassert_eq!((12, 15, 2), (m.start(), m.end(), m.value()));\n\nassert_eq!(None, it.next());\n```\n\n## `no_std`\n\nDaachorse has no dependency on `std` (but requires a global allocator with the `alloc` crate).\n\n## CLI\n\nThis repository contains a command-line interface named `daacfind` for searching patterns in text\nfiles.\n\n```\n% cat ./pat.txt\nfn\nconst fn\npub fn\nunsafe fn\n% find . -name \"*.rs\" | xargs cargo run --release -p daacfind -- --color=auto -nf ./pat.txt\n...\n...\n./src/errors.rs:67:    fn fmt(\u0026self, f: \u0026mut fmt::Formatter) -\u003e fmt::Result {\n./src/errors.rs:81:    fn fmt(\u0026self, f: \u0026mut fmt::Formatter) -\u003e fmt::Result {\n./src/lib.rs:115:    fn default() -\u003e Self {\n./src/lib.rs:126:    pub fn base(\u0026self) -\u003e Option\u003cu32\u003e {\n./src/lib.rs:131:    pub const fn check(\u0026self) -\u003e u8 {\n./src/lib.rs:136:    pub const fn fail(\u0026self) -\u003e u32 {\n...\n...\n```\n\n## FAQ\n\n* **Does this library support data types other than `str` and `[u8]`?\n  (e.g., structures implementing `Eq`.)**\n\n  Not supported. This library uses Aho-Corasick automata built with a\n  data structure called *double-array trie*. The algorithm on this data\n  structure works with XOR operations on the input haystack. Therefore,\n  the haystack must be a sequence of integers. This library is specially\n  optimized for `str` and `[u8]` among integer sequences.\n\n* **Does this library provide bindings to programming languages other\n  than Rust?**\n\n  We are providing [a Python binding](https://github.com/daac-tools/python-daachorse).\n  Other programming languages are not currently planned to be supported.\n  If you are interested in writing bindings, you are welcome to do so.\n  *daachorse* is free software.\n\n## Slack\n\nWe have a Slack workspace for developers and users to ask questions and discuss a variety of topics.\n\n * https://daac-tools.slack.com/\n * Please get an invitation from [here](https://join.slack.com/t/daac-tools/shared_invite/zt-1pwwqbcz4-KxL95Nam9VinpPlzUpEGyA).\n\n## License\n\nLicensed under either of\n\n * Apache License, Version 2.0\n   ([LICENSE-APACHE](LICENSE-APACHE) or http://www.apache.org/licenses/LICENSE-2.0)\n * MIT license\n   ([LICENSE-MIT](LICENSE-MIT) or http://opensource.org/licenses/MIT)\n\nat your option.\n\nIf you use this library in academic settings,\nplease cite the following paper.\n\n```\n@article{10.1002/spe.3190,\n    author = {Kanda, Shunsuke and Akabe, Koichi and Oda, Yusuke},\n    title = {Engineering faster double-array {Aho--Corasick} automata},\n    journal = {Software: Practice and Experience},\n    volume={53},\n    number={6},\n    pages={1332--1361},\n    year={2023},\n    keywords = {Aho–Corasick automata, code optimization, double-array, multiple pattern matching},\n    doi = {https://doi.org/10.1002/spe.3190},\n    url = {https://onlinelibrary.wiley.com/doi/abs/10.1002/spe.3190},\n    eprint = {https://onlinelibrary.wiley.com/doi/pdf/10.1002/spe.3190}\n}\n```\n\n## Contribution\n\nSee [the guidelines](./CONTRIBUTING.md).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdaac-tools%2Fdaachorse","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdaac-tools%2Fdaachorse","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdaac-tools%2Fdaachorse/lists"}