https://github.com/raphamorim/fuzzy

fuzzy matching with Levenshtein, Damerau-Levenshtein, Bitap and n-gram
https://github.com/raphamorim/fuzzy

Last synced: 7 months ago
JSON representation

fuzzy matching with Levenshtein, Damerau-Levenshtein, Bitap and n-gram

Host: GitHub
URL: https://github.com/raphamorim/fuzzy
Owner: raphamorim
License: gpl-3.0
Created: 2025-07-26T18:16:14.000Z (about 1 year ago)
Default Branch: main
Last Pushed: 2025-07-31T13:50:51.000Z (12 months ago)
Last Synced: 2026-01-12T05:50:52.115Z (7 months ago)
Language: Go
Size: 3.23 MB
Stars: 21
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # Fuzzy

A comprehensive fuzzy search library implementing state-of-the-art algorithms and data structures for approximate string matching, text search, and similarity detection.

## Features

### Distance Metrics

- **Levenshtein Distance**: Classic edit distance algorithm

- **Damerau-Levenshtein Distance**: Supports transpositions

- **Myers' Algorithm**: Efficient diff algorithm for edit distance

### Data Structures

- **BK-Tree**: Metric tree for efficient similarity search

- **Suffix Array**: For substring search and pattern matching

- **FM-Index**: Compressed full-text index based on Burrows-Wheeler Transform

### Indexing Methods

- **N-gram Indexing**: Character n-gram based search with Jaccard similarity

- **Trigram Index**: Optimized 3-gram indexing for approximate matching

- **Q-gram Distance**: Distance metric based on q-gram profiles

### Locality-Sensitive Hashing

- **MinHash LSH**: For finding similar documents

- **SimHash**: Near-duplicate detection using hamming distance

### Advanced Algorithms

- **Wu-Manber**: Bit-parallel approximate string matching

- **Ukkonen A***: A* search for approximate pattern matching

- **Bitonic Sort**: Parallel-friendly sorting for fuzzy matches

![Benchmark](benchmark/plot/scaling_performance.png)

## Installation

```bash

go get github.com/raphamorim/fuzzy

```

## Usage Examples

### BK-Tree for Spell Checking

```go

tree := fuzzy.NewBKTree()

words := []string{"book", "books", "cake", "boo", "boon", "cook"}

for _, word := range words {

    tree.Add(word)

}

// Find words within edit distance 2 of "bok"

suggestions := tree.Search("bok", 2)

// Returns: ["book", "boo", "cook"]

```

### N-gram Search

```go

ng := fuzzy.NewNGram(3)

ng.Add("hello world", 0)

ng.Add("hello there", 1)

ng.Add("goodbye world", 2)

// Search with similarity threshold

results := ng.Search("helo wrld", 0.3)

// Returns indices of matching documents

```

### LSH for Document Similarity

```go

lsh := fuzzy.NewLSH(10, 5, 3) // 10 hash tables, 5 hash functions, 3-shingles

lsh.Add("The quick brown fox jumps over the lazy dog")

lsh.Add("The fast brown fox jumps over the lazy cat")

similar := lsh.Query("The quick brown fox leaps over the lazy dog", 0.7)

// Returns indices of similar documents

```

### Wu-Manber Approximate Search

```go

wm := fuzzy.NewWuManber("pattern")

text := "This text contains a patern with an error"

matches := wm.Search(text, 1) // Allow 1 error

for _, match := range matches {

    fmt.Printf("Found at position %d-%d with %d errors\n", 

        match.Start, match.End, match.Distance)

}

```

## Benchmarks

Run benchmarks with:

```bash

go test -bench=. ./...

```

## Algorithm Comparison

### Key Differences

| Algorithm | Type | Best Use Case | Strengths | Limitations |

|-----------|------|---------------|-----------|-------------|

| **BK-Tree** | Tree-based | Spell checking, typo correction | Exact edit distance, efficient for small distances | Slower for large edit distances |

| **LSH** | Hash-based | Near-duplicate detection, document similarity | Very fast queries, probabilistic | Approximate results, needs tuning |

| **Suffix Array** | Index-based | Substring search, pattern matching | Exact matches, handles large texts | High memory usage, slow updates |

| **Wu-Manber** | Bit-parallel | Pattern search with errors | Fast streaming search, configurable errors | Limited pattern length |

| **N-gram** | Statistical | Partial matching, similarity scoring | Flexible similarity metrics, language-agnostic | Less precise than edit distance |

### Algorithm Complexity

| Algorithm | Build Time | Search Time | Space |

|-----------|------------|-------------|-------|

| BK-Tree | O(n log n) | O(log n) | O(n) |

| Suffix Array | O(n log n) | O(log n + m) | O(n) |

| FM-Index | O(n log n) | O(m) | O(n) |

| N-gram Index | O(n) | O(m + k) | O(n) |

| LSH | O(n) | O(1) | O(n) |

| Wu-Manber | O(m) | O(n) | O(m) |

Where:

- n = text/corpus size

- m = pattern size

- k = number of results

## Comparison with Other Libraries

- **sahilm/fuzzy**: Uses a simple character-by-character matching algorithm optimized for editor-style filename matching

- **lithammer/fuzzysearch**: Implements basic fuzzy matching with Levenshtein distance ranking

This library implements advanced algorithms not found in those libraries:

- BK-trees and other metric tree structures

- Suffix arrays and FM-indexes for large text search

- N-gram based indexing

- Locality-sensitive hashing approaches

- Advanced edit distance algorithms like Myers' algorithm and Wu-Manber

## License

GPL-3.0

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/raphamorim/fuzzy

Awesome Lists containing this project

README