https://github.com/alextanhongpin/stringdist
String metrics function in golang (levenshtein, damerau-levenshtein, jaro, jaro-winkler and additionally bk-tree) for autocorrect
https://github.com/alextanhongpin/stringdist
autocorrect bk-tree damerau-levenshtein edit-distance go golang jaro jaro-winkler
Last synced: 11 months ago
JSON representation
String metrics function in golang (levenshtein, damerau-levenshtein, jaro, jaro-winkler and additionally bk-tree) for autocorrect
- Host: GitHub
- URL: https://github.com/alextanhongpin/stringdist
- Owner: alextanhongpin
- License: apache-2.0
- Created: 2018-10-27T08:43:57.000Z (over 7 years ago)
- Default Branch: master
- Last Pushed: 2020-04-03T03:16:57.000Z (about 6 years ago)
- Last Synced: 2024-06-21T00:15:35.819Z (about 2 years ago)
- Topics: autocorrect, bk-tree, damerau-levenshtein, edit-distance, go, golang, jaro, jaro-winkler
- Language: Go
- Homepage:
- Size: 37.1 KB
- Stars: 16
- Watchers: 4
- Forks: 0
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# stringdist
[](http://godoc.org/github.com/alextanhongpin/stringdist)
`stringdist` package contains several string metrics for calculating edit distance between two different strings. This includes the _Levenshtein Distance_, _Damerau Levenshtein_ (both _Optimal String Alignment_, OSA and _true_ damerau levenshtein), Jaro, Jaro Winkler and additionally a _BK-Tree_ that can be used for autocorrect.
## Algorithms
- __Levenshtein__: A string metric for measuring the difference between two sequence. Done by computing the _minimum_ number of single-edit character edit (`insertion`, `substitution` and `deletion`) required to change from one word to another.
- __Damerau-Levenshteim__: similar to Levenshtein, but allows transposition of two adjacent characters. Can be computed with two different algorithm - _Optimal String Alignment_, (OSA) and _true damerau-levenshtein_. The assumption for ASA is taht no substring is edited more than once.
- __Jaro__: Jaro distance between two words is the minimum number of single-character transpositions required to change one word into the other.
- __Jaro-Winkler__: Similar to Jaro, but uses a prefix scale which gives more favourable ratings to strings that match from the beginning for a set prefix length.
- __BK-Tree__: A tree data structure specialized to index data in a metric space. Can be used for approximate string matching in a dictionary.
Other algorithms to explore:
- Sift3/4 algorithm
- Soundex
- Metaphone
- Hamming Distance
- Symspell
- Linspell
## Thoughts
- Autocorrect can be implemented using any of the distance metrics (such as levenshtein) with BK-Tree
- Distance metric can be supplied to bk-tree through an interface.
- Dictionary words can first be supplied to the tree, and subsequent words can be added later through other means (syncing, streaming, pub-sub)
- The tree can be snapshotted periodically to avoid rebuild (e.g. using `gob`), test should be conducted to see if rebuilding the tree is faster than reloading the whole tree.
- Build tree through prefix (A-Z) would result in better performance (?). How to avoid hotspots (more characters in A than Z)?
- Can part of the tree be transmitted through the network?
- How to blacklist words that are not supposed to be searchable? (profanity words)
-
## References
- https://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Dice%27s_coefficient#Javascript
- https://en.wikipedia.org/wiki/Wikipedia:AutoWikiBrowser/Typos#C
- https://ii.nlm.nih.gov/MTI/Details/trigram.shtml
- https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance
- https://en.wikipedia.org/wiki/Bitap_algorithm
- https://lingpipe-blog.com/2006/12/13/code-spelunking-jaro-winkler-string-comparison/
- Adjustment for longer string http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=7DCFAEBBA89D749D9D901DFA621FCA31?doi=10.1.1.64.7405&rep=rep1&type=pdf
- Table 6 shows the test cases https://www.census.gov/srd/papers/pdf/rrs2006-02.pdf
- http://alias-i.com/lingpipe/demos/tutorial/stringCompare/read-me.html