Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/daac-tools/daachorse
π A fast implementation of the Aho-Corasick algorithm using the compact double-array data structure in Rust.
https://github.com/daac-tools/daachorse
aho-corasick double-array finite-state-machine no-std rust search substring-matching text-processing
Last synced: about 6 hours ago
JSON representation
π A fast implementation of the Aho-Corasick algorithm using the compact double-array data structure in Rust.
- Host: GitHub
- URL: https://github.com/daac-tools/daachorse
- Owner: daac-tools
- License: apache-2.0
- Created: 2021-09-02T02:24:58.000Z (over 3 years ago)
- Default Branch: main
- Last Pushed: 2024-12-15T12:17:38.000Z (2 months ago)
- Last Synced: 2024-12-17T09:44:05.847Z (about 2 months ago)
- Topics: aho-corasick, double-array, finite-state-machine, no-std, rust, search, substring-matching, text-processing
- Language: Rust
- Homepage: https://docs.rs/daachorse
- Size: 3.71 MB
- Stars: 206
- Watchers: 4
- Forks: 14
- Open Issues: 4
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE-APACHE
Awesome Lists containing this project
README
# π daachorse: Double-Array Aho-Corasick
A fast implementation of the Aho-Corasick algorithm using the compact double-array data structure.
[![Crates.io](https://img.shields.io/crates/v/daachorse)](https://crates.io/crates/daachorse)
[![Documentation](https://docs.rs/daachorse/badge.svg)](https://docs.rs/daachorse)
[![Rust](https://img.shields.io/badge/rust-1.61%2B-blue.svg?maxAge=3600)](https://github.com/daac-tools/daachorse)
[![Build Status](https://github.com/daac-tools/daachorse/actions/workflows/rust.yml/badge.svg)](https://github.com/daac-tools/daachorse/actions)
[![Slack](https://img.shields.io/badge/join-chat-brightgreen?logo=slack)](https://join.slack.com/t/daac-tools/shared_invite/zt-1pwwqbcz4-KxL95Nam9VinpPlzUpEGyA)The main technical ideas behind this library appear in the following paper:
> Shunsuke Kanda, Koichi Akabe, and Yusuke Oda.
> [Engineering faster double-array Aho-Corasick automata](https://doi.org/10.1002/spe.3190).
> *Software: Practice and Experience (SPE)*,
> 53(6): 1332β1361, 2023
> ([arXiv](https://arxiv.org/abs/2207.13870))A Python wrapper is also available [here](https://github.com/daac-tools/python-daachorse).
## Overview
Daachorse is a crate for fast multiple pattern matching using the
[Aho-Corasick algorithm](https://dl.acm.org/doi/10.1145/360825.360855), running in linear time over
the length of the input text. This crate uses the
[compact double-array data structure](https://doi.org/10.1016/j.ipm.2006.04.004) for implementing
the pattern match automaton for time and memory efficiency. The data structure not only supports
constant-time state-to-state traversal but also represents each state in the space of only 12
bytes.For example, compared to the NFA of the [aho-corasick](https://github.com/BurntSushi/aho-corasick)
crate, which is the most popular Aho-Corasick implementation in Rust, Daachorse can perform pattern
matching **3.0β5.2 times faster** while consuming **56β60% smaller** memory when using a word
dictionary of 675K patterns. Other experimental results are available on
[Wiki](https://github.com/daac-tools/daachorse/wiki/Performance-Comparison).![](./figures/comparison.svg)
## Requirements
Rust 1.61 or higher is required to build this crate.
## Example usage
Daachorse contains some search options, ranging from standard matching with the Aho-Corasick
algorithm to trickier matching. They will run very fast based on the double-array data structure
and can be easily plugged into your application, as shown below.### Finding overlapped occurrences
To search for all occurrences of registered patterns that allow for positional overlap in the input
text, use `find_overlapping_iter()`. When you use `new()` for construction, the library assigns a
unique identifier to each pattern in the input order. The match result has the byte positions of
the occurrence and its identifier.```rust
use daachorse::DoubleArrayAhoCorasick;let patterns = vec!["bcd", "ab", "a"];
let pma = DoubleArrayAhoCorasick::new(patterns).unwrap();let mut it = pma.find_overlapping_iter("abcd");
let m = it.next().unwrap();
assert_eq!((0, 1, 2), (m.start(), m.end(), m.value()));let m = it.next().unwrap();
assert_eq!((0, 2, 1), (m.start(), m.end(), m.value()));let m = it.next().unwrap();
assert_eq!((1, 4, 0), (m.start(), m.end(), m.value()));assert_eq!(None, it.next());
```### Finding non-overlapped occurrences with the standard matching
If you do not want to allow positional overlap, use `find_iter()` instead.
It performs the search on the Aho-Corasick automaton
and reports patterns first found in each iteration.```rust
use daachorse::DoubleArrayAhoCorasick;let patterns = vec!["bcd", "ab", "a"];
let pma = DoubleArrayAhoCorasick::new(patterns).unwrap();let mut it = pma.find_iter("abcd");
let m = it.next().unwrap();
assert_eq!((0, 1, 2), (m.start(), m.end(), m.value()));let m = it.next().unwrap();
assert_eq!((1, 4, 0), (m.start(), m.end(), m.value()));assert_eq!(None, it.next());
```### Finding non-overlapped occurrences with the longest matching
If you want to search for the longest pattern without positional overlap in each iteration, use
`leftmost_find_iter()` with specifying `MatchKind::LeftmostLongest` in the construction.```rust
use daachorse::{DoubleArrayAhoCorasickBuilder, MatchKind};let patterns = vec!["ab", "a", "abcd"];
let pma = DoubleArrayAhoCorasickBuilder::new()
.match_kind(MatchKind::LeftmostLongest)
.build(&patterns)
.unwrap();let mut it = pma.leftmost_find_iter("abcd");
let m = it.next().unwrap();
assert_eq!((0, 4, 2), (m.start(), m.end(), m.value()));assert_eq!(None, it.next());
```### Finding non-overlapped occurrences with the leftmost-first matching
If you want to find the earliest registered pattern among ones starting from the search position,
use `leftmost_find_iter()` with specifying `MatchKind::LeftmostFirst`.This is the so-called *leftmost first match*, a tricky search option supported in the
[aho-corasick](https://github.com/BurntSushi/aho-corasick) crate. For example, in the following
code, `ab` is reported because it is the earliest registered one.```rust
use daachorse::{DoubleArrayAhoCorasickBuilder, MatchKind};let patterns = vec!["ab", "a", "abcd"];
let pma = DoubleArrayAhoCorasickBuilder::new()
.match_kind(MatchKind::LeftmostFirst)
.build(&patterns)
.unwrap();let mut it = pma.leftmost_find_iter("abcd");
let m = it.next().unwrap();
assert_eq!((0, 2, 0), (m.start(), m.end(), m.value()));assert_eq!(None, it.next());
```### Associating arbitrary values with patterns
To build the automaton from pairs of a pattern and user-defined value, instead of assigning identifiers
automatically, use `with_values()`.```rust
use daachorse::DoubleArrayAhoCorasick;let patvals = vec![("bcd", 0), ("ab", 10), ("a", 20)];
let pma = DoubleArrayAhoCorasick::with_values(patvals).unwrap();let mut it = pma.find_overlapping_iter("abcd");
let m = it.next().unwrap();
assert_eq!((0, 1, 20), (m.start(), m.end(), m.value()));let m = it.next().unwrap();
assert_eq!((0, 2, 10), (m.start(), m.end(), m.value()));let m = it.next().unwrap();
assert_eq!((1, 4, 0), (m.start(), m.end(), m.value()));assert_eq!(None, it.next());
```### Building faster automata on multibyte characters
To build a faster automaton on multibyte characters, use `CharwiseDoubleArrayAhoCorasick` instead.
The standard version `DoubleArrayAhoCorasick` handles strings as UTF-8 sequences and defines
transition labels using byte values. On the other hand, `CharwiseDoubleArrayAhoCorasick` uses
Unicode code point values, reducing the number of transitions and faster matching.```rust
use daachorse::CharwiseDoubleArrayAhoCorasick;let patterns = vec!["ε ¨δΈη", "δΈη", "γ«"];
let pma = CharwiseDoubleArrayAhoCorasick::new(patterns).unwrap();let mut it = pma.find_iter("ε ¨δΈηδΈγ«");
let m = it.next().unwrap();
assert_eq!((0, 9, 0), (m.start(), m.end(), m.value()));let m = it.next().unwrap();
assert_eq!((12, 15, 2), (m.start(), m.end(), m.value()));assert_eq!(None, it.next());
```## `no_std`
Daachorse has no dependency on `std` (but requires a global allocator with the `alloc` crate).
## CLI
This repository contains a command-line interface named `daacfind` for searching patterns in text
files.```
% cat ./pat.txt
fn
const fn
pub fn
unsafe fn
% find . -name "*.rs" | xargs cargo run --release -p daacfind -- --color=auto -nf ./pat.txt
...
...
./src/errors.rs:67: fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
./src/errors.rs:81: fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
./src/lib.rs:115: fn default() -> Self {
./src/lib.rs:126: pub fn base(&self) -> Option {
./src/lib.rs:131: pub const fn check(&self) -> u8 {
./src/lib.rs:136: pub const fn fail(&self) -> u32 {
...
...
```## FAQ
* **Does this library support data types other than `str` and `[u8]`?
(e.g., structures implementing `Eq`.)**Not supported. This library uses Aho-Corasick automata built with a
data structure called *double-array trie*. The algorithm on this data
structure works with XOR operations on the input haystack. Therefore,
the haystack must be a sequence of integers. This library is specially
optimized for `str` and `[u8]` among integer sequences.* **Does this library provide bindings to programming languages other
than Rust?**We are providing [a Python binding](https://github.com/daac-tools/python-daachorse).
Other programming languages are not currently planned to be supported.
If you are interested in writing bindings, you are welcome to do so.
*daachorse* is free software.## Slack
We have a Slack workspace for developers and users to ask questions and discuss a variety of topics.
* https://daac-tools.slack.com/
* Please get an invitation from [here](https://join.slack.com/t/daac-tools/shared_invite/zt-1pwwqbcz4-KxL95Nam9VinpPlzUpEGyA).## License
Licensed under either of
* Apache License, Version 2.0
([LICENSE-APACHE](LICENSE-APACHE) or http://www.apache.org/licenses/LICENSE-2.0)
* MIT license
([LICENSE-MIT](LICENSE-MIT) or http://opensource.org/licenses/MIT)at your option.
If you use this library in academic settings,
please cite the following paper.```
@article{10.1002/spe.3190,
author = {Kanda, Shunsuke and Akabe, Koichi and Oda, Yusuke},
title = {Engineering faster double-array {Aho--Corasick} automata},
journal = {Software: Practice and Experience},
volume={53},
number={6},
pages={1332--1361},
year={2023},
keywords = {AhoβCorasick automata, code optimization, double-array, multiple pattern matching},
doi = {https://doi.org/10.1002/spe.3190},
url = {https://onlinelibrary.wiley.com/doi/abs/10.1002/spe.3190},
eprint = {https://onlinelibrary.wiley.com/doi/pdf/10.1002/spe.3190}
}
```## Contribution
See [the guidelines](./CONTRIBUTING.md).