Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/schollz/closestmatch
Golang library for fuzzy matching within a set of strings :page_with_curl:
https://github.com/schollz/closestmatch
fuzzy-matching golang-library levenshtein string-matching
Last synced: 3 months ago
JSON representation
Golang library for fuzzy matching within a set of strings :page_with_curl:
- Host: GitHub
- URL: https://github.com/schollz/closestmatch
- Owner: schollz
- License: mit
- Created: 2017-03-30T14:43:15.000Z (over 7 years ago)
- Default Branch: master
- Last Pushed: 2022-09-13T03:39:56.000Z (about 2 years ago)
- Last Synced: 2024-06-19T12:43:32.621Z (5 months ago)
- Topics: fuzzy-matching, golang-library, levenshtein, string-matching
- Language: Go
- Homepage:
- Size: 641 KB
- Stars: 417
- Watchers: 13
- Forks: 53
- Open Issues: 10
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- awesome - closestmatch - Golang library for fuzzy matching within a set of strings :page_with_curl: (Go)
README
# closestmatch :page_with_curl:
*closestmatch* is a simple and fast Go library for fuzzy matching an input string to a list of target strings. *closestmatch* is useful for handling input from a user where the input (which could be mispelled or out of order) needs to match a key in a database. *closestmatch* uses a [bag-of-words approach](https://en.wikipedia.org/wiki/Bag-of-words_model) to precompute character n-grams to represent each possible target string. The closest matches have highest overlap between the sets of n-grams. The precomputation scales well and is much faster and more accurate than Levenshtein for long strings.
Getting Started
===============## Install
```
go get -u -v github.com/schollz/closestmatch
```## Use
#### Create a *closestmatch* object from a list words
```golang
// Take a slice of keys, say band names that are similar
// http://www.tonedeaf.com.au/412720/38-bands-annoyingly-similar-names.htm
wordsToTest := []string{"King Gizzard", "The Lizard Wizard", "Lizzard Wizzard"}// Choose a set of bag sizes, more is more accurate but slower
bagSizes := []int{2}// Create a closestmatch object
cm := closestmatch.New(wordsToTest, bagSizes)
```#### Find the closest match, or find the *N* closest matches
```golang
fmt.Println(cm.Closest("kind gizard"))
// returns 'King Gizzard'fmt.Println(cm.ClosestN("kind gizard",3))
// returns [King Gizzard Lizzard Wizzard The Lizard Wizard]
```#### Calculate the accuracy
```golang
// Calculate accuracy
fmt.Println(cm.AccuracyMutatingWords())
// ~ 66 % (still way better than Levenshtein which hits 0% with this particular set)// Improve accuracy by adding more bags
bagSizes = []int{2, 3, 4}
cm = closestmatch.New(wordsToTest, bagSizes)
fmt.Println(cm.AccuracyMutatingWords())
// accuracy improves to ~ 76 %
```#### Save/Load
```golang
// Save your current calculated bags
cm.Save("closestmatches.gob")// Open it again
cm2, _ := closestmatch.Load("closestmatches.gob")
fmt.Println(cm2.Closest("lizard wizard"))
// prints "The Lizard Wizard"
```### Advantages
*closestmatch* is more accurate than Levenshtein for long strings (like in the test corpus).
*closestmatch* is ~20x faster than [a fast implementation of Levenshtein](https://groups.google.com/forum/#!topic/golang-nuts/YyH1f_qCZVc). Try it yourself with the benchmarks:
```bash
cd $GOPATH/src/github.com/schollz/closestmatch && go test -run=None -bench=. > closestmatch.bench
cd $GOPATH/src/github.com/schollz/closestmatch/levenshtein && go test -run=None -bench=. > levenshtein.bench
benchcmp levenshtein.bench ../closestmatch.bench
```which gives the following benchmark (on Intel i7-3770 CPU @ 3.40GHz w/ 8 processors):
```bash
benchmark old ns/op new ns/op delta
BenchmarkNew-8 1.47 1933870 +131555682.31%
BenchmarkClosestOne-8 104603530 4855916 -95.36%
```The `New()` function in *closestmatch* is so slower than *levenshtein* because there is precomputation needed.
### Disadvantages
*closestmatch* does worse for matching lists of single words, like a dictionary. For comparison:
```
$ cd $GOPATH/src/github.com/schollz/closestmatch && go test
Accuracy with mutating words in book list: 90.0%
Accuracy with mutating letters in book list: 100.0%
Accuracy with mutating letters in dictionary: 38.9%
```while levenshtein performs slightly better for a single-word dictionary (but worse for longer names, like book titles):
```
$ cd $GOPATH/src/github.com/schollz/closestmatch/levenshtein && go test
Accuracy with mutating words in book list: 40.0%
Accuracy with mutating letters in book list: 100.0%
Accuracy with mutating letters in dictionary: 64.8%
```## License
MIT