Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/robert-bor/aho-corasick

Java implementation of the Aho-Corasick algorithm for efficient string matching
https://github.com/robert-bor/aho-corasick

Last synced: 9 days ago
JSON representation

Java implementation of the Aho-Corasick algorithm for efficient string matching

Host: GitHub
URL: https://github.com/robert-bor/aho-corasick
Owner: robert-bor
License: apache-2.0
Created: 2014-01-29T20:17:12.000Z (almost 11 years ago)
Default Branch: master
Last Pushed: 2024-04-25T08:09:57.000Z (7 months ago)
Last Synced: 2024-08-02T12:22:53.851Z (3 months ago)
Language: Java
Size: 653 KB
Stars: 886
Watchers: 56
Forks: 348
Open Issues: 16
Metadata Files:
- Readme: README.md
- License: LICENSE.md

Awesome Lists containing this project

README

        Aho-Corasick

============

[![Build Status](https://travis-ci.org/robert-bor/aho-corasick.svg?branch=master)](https://travis-ci.org/robert-bor/aho-corasick)

[![Codecov](https://codecov.io/gh/robert-bor/aho-corasick/branch/master/graph/badge.svg)](https://codecov.io/gh/robert-bor/aho-corasick)

[![Maven Central](https://maven-badges.herokuapp.com/maven-central/org.ahocorasick/ahocorasick/badge.svg)](https://maven-badges.herokuapp.com/maven-central/org.ahocorasick/ahocorasick)

[![Javadoc](https://javadoc.io/badge2/org.ahocorasick/ahocorasick/javadoc.svg)](https://javadoc.io/doc/org.ahocorasick/ahocorasick)

[![Apache 2](http://img.shields.io/badge/license-Apache%202-blue.svg)](http://www.apache.org/licenses/LICENSE-2.0)

Dependency

----------

Include this dependency in your POM. Be sure to check for the latest version in Maven Central.

```xml

  org.ahocorasick

  ahocorasick

  0.6.3

```

Introduction

------------

Most free-text searching is based on Lucene-like approaches, where the

search text is parsed into its various components. For every keyword a

lookup is done to see where it occurs. When looking for a couple of keywords

this approach is great, but when searching for 100,000 words, the approach

is quite slow (for example, checking against a dictionary).

The Aho-Corasick algorithm shines when looking for multiple words.

Rather than chop up the search text, it uses all the keywords to build

a [Trie](http://en.wikipedia.org/wiki/Trie) construct. The crucial

Aho-Corasick components include:

* goto

* fail

* output

Every character encountered is presented to a state object within the

*goto* structure. If there is a matching state, that will be elevated to

the new current state.

However, if there is no matching state, the algorithm will signal a

*fail* and fall back to states with less depth (i.e., a match less long)

and proceed from there, until it found a matching state, or it has reached

the root state.

Whenever a state is reached that matches an entire keyword, it is

emitted to an *output* set which can be read after the entire scan

has completed.

The algorithm is O(n). No matter how many keywords are given, or how large

the search text is, the performance will decline linearly.

The Aho-Corasick algorithm can help:

* find words in texts to link or emphasize them;

* add semantics to plain text; or

* check against a dictionary to see if syntactic errors were made.

See the [white paper](http://cr.yp.to/bib/1975/aho.pdf) by Aho and

Corasick for algorithmic details.

Usage

-----

Set up the Trie using a builder as follows:

```java

Trie trie = Trie.builder()

    .addKeyword("hers")

    .addKeyword("his")

    .addKeyword("she")

    .addKeyword("he")

    .build();

Collection emits = trie.parseText("ushers");

```

The collection will contain `Emit` objects that match:

* "she" starting at position 1, ending at position 3

* "he" starting at position 2, ending at position 3

* "hers" starting at position 2, ending at position 5

In situations where overlapping instances are not desired, retain

the longest and left-most matches by calling `ignoreOverlaps()`:

```java

Trie trie = Trie.builder()

    .ignoreOverlaps()

    .addKeyword("hot")

    .addKeyword("hot chocolate")

    .build();

Collection emits = trie.parseText("hot chocolate");

```

The `ignoreOverlaps()` method tells the Trie to remove all overlapping

matches. For this it relies on the following conflict resolution rules:

1. longer matches prevail over shorter matches; and

1. left-most prevails over right-most.

Only one result is returned:

* "hot chocolate" starting at position 0, ending at position 12

To check for whole words exclusively, call `onlyWholeWords()` as follows:

```java

Trie trie = Trie.builder()

    .onlyWholeWords()

    .addKeyword("sugar")

    .build();

Collection emits = trie.parseText("sugarcane sugar canesugar");

```

Only one match is found; whereas, without calling `onlyWholeWords()` three

matches are found. The sugarcane/canesugar words are discarded because

they are partial matches.

Some text is `WrItTeN` in mixed case, which makes it hard to identify.

Instruct the Trie to convert the searchtext to lowercase to ease the

matching process. The lower-casing applies to keywords as well.

```java

Trie trie = Trie.builder()

    .ignoreCase()

    .addKeyword("casing")

    .build();

Collection emits = trie.parseText("CaSiNg");

```

Normally, this match would not be found. By calling `ignoreCase()`,

the entire search text is made lowercase before matching begins.

Therefore it will find exactly one match.

It is also possible to just ask whether the text matches any of

the keywords, or just to return the first match it finds.

```java

Trie trie = Trie.builder().ignoreOverlaps()

        .addKeyword("ab")

        .addKeyword("cba")

        .addKeyword("ababc")

        .build();

Emit firstMatch = trie.firstMatch("ababcbab");

```

The value for `firstMatch` will be "ababc" from position 0. The

`containsMatch()` method checks whether `firstMatch` found a match and

returns `true` if that is the case.

For a barebones Aho-Corasick algorithm with a custom emit handler use:

 

```java

Trie trie = Trie.builder()

        .addKeyword("hers")

        .addKeyword("his")

        .addKeyword("she")

        .addKeyword("he")

        .build();

final List emits = new ArrayList<>();

EmitHandler emitHandler = new EmitHandler() {

    @Override

    public void emit(Emit emit) {

        emits.add(emit);

    }

};

```

In many cases you may want to do perform tasks with both the non-matching

and the matching text. Such implementations may be better served by using

`Trie.tokenize()`. The `tokenize()` method allows looping over the

corpus to deal with matches as soon as they are encountered. Here's an

example that outputs key words as italicized HTML elements:

```java

String speech = "The Answer to the Great Question... Of Life, " +

        "the Universe and Everything... Is... Forty-two,' said " +

        "Deep Thought, with infinite majesty and calm.";

Trie trie = Trie.builder().ignoreOverlaps().onlyWholeWords().ignoreCase()

    .addKeyword("great question")

    .addKeyword("forty-two")

    .addKeyword("deep thought")

    .build();

Collection tokens = trie.tokenize(speech);

StringBuilder html = new StringBuilder();

html.append("
");

for (Token token : tokens) {

    if (token.isMatch()) {

        html.append("");

    }

    html.append(token.getFragment());

    if (token.isMatch()) {

        html.append("");

    }

}

html.append("");

System.out.println(html);

```

You can also emit custom outputs. This might for example be useful to

implement a trivial named entity recognizer. In this case use a

`PayloadTrie` instead of a `Trie` as follows:

```java

class Word {

    private final String gender;

    public Word(String gender) {

        this.gender = gender;

    }

}

PayloadTrie trie = PayloadTrie.builder()

    .addKeyword("hers", new Word("f"))

    .addKeyword("his", new Word("m"))

    .addKeyword("she", new Word("f"))

    .addKeyword("he", new Word("m"))

    .addKeyword("nonbinary", new Word("nb"))

    .addKeyword("transgender", new Word("tg"))

    .build();

Collection> emits = trie.parseText("ushers");

```

Releases

--------

See [releases](https://github.com/robert-bor/aho-corasick/releases) for details.