{"id":22165992,"url":"https://github.com/alextanhongpin/stringdist","last_synced_at":"2025-07-26T11:32:33.970Z","repository":{"id":57491772,"uuid":"154943749","full_name":"alextanhongpin/stringdist","owner":"alextanhongpin","description":"String metrics function in golang (levenshtein, damerau-levenshtein, jaro, jaro-winkler and additionally bk-tree) for autocorrect","archived":false,"fork":false,"pushed_at":"2020-04-03T03:16:57.000Z","size":38,"stargazers_count":16,"open_issues_count":2,"forks_count":0,"subscribers_count":4,"default_branch":"master","last_synced_at":"2024-06-21T00:15:35.819Z","etag":null,"topics":["autocorrect","bk-tree","damerau-levenshtein","edit-distance","go","golang","jaro","jaro-winkler"],"latest_commit_sha":null,"homepage":"","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/alextanhongpin.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2018-10-27T08:43:57.000Z","updated_at":"2023-08-19T09:34:29.000Z","dependencies_parsed_at":"2022-09-26T19:11:05.051Z","dependency_job_id":null,"html_url":"https://github.com/alextanhongpin/stringdist","commit_stats":null,"previous_names":["alextanhongpin/go-stringdist"],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alextanhongpin%2Fstringdist","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alextanhongpin%2Fstringdist/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alextanhongpin%2Fstringdist/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alextanhongpin%2Fstringdist/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/alextanhongpin","download_url":"https://codeload.github.com/alextanhongpin/stringdist/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":227673946,"owners_count":17802303,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["autocorrect","bk-tree","damerau-levenshtein","edit-distance","go","golang","jaro","jaro-winkler"],"created_at":"2024-12-02T05:17:43.867Z","updated_at":"2024-12-02T05:17:45.292Z","avatar_url":"https://github.com/alextanhongpin.png","language":"Go","funding_links":[],"categories":[],"sub_categories":[],"readme":"# stringdist\n\n[![](https://godoc.org/github.com/alextanhongpin/stringdist?status.svg)](http://godoc.org/github.com/alextanhongpin/stringdist)\n\n`stringdist` package contains several string metrics for calculating edit distance between two different strings. This includes the _Levenshtein Distance_, _Damerau Levenshtein_ (both _Optimal String Alignment_, OSA and _true_ damerau levenshtein), Jaro, Jaro Winkler and additionally a _BK-Tree_ that can be used for autocorrect.\n\n## Algorithms\n\n- __Levenshtein__: A string metric for measuring the difference between two sequence. Done by computing the _minimum_ number of single-edit character edit (`insertion`, `substitution` and `deletion`) required to change from one word to another.\n- __Damerau-Levenshteim__: similar to Levenshtein, but allows transposition of two adjacent characters. Can be computed with two different algorithm - _Optimal String Alignment_, (OSA) and _true damerau-levenshtein_. The assumption for ASA is taht no substring is edited more than once.\n- __Jaro__: Jaro distance between two words is the minimum number of single-character transpositions required to change one word into the other.\n- __Jaro-Winkler__: Similar to Jaro, but uses a prefix scale which gives more favourable ratings to strings that match from the beginning for a set prefix length.\n- __BK-Tree__: A tree data structure specialized to index data in a metric space. Can be used for approximate string matching in a dictionary.\n\nOther algorithms to explore:\n- Sift3/4 algorithm\n- Soundex\n- Metaphone\n- Hamming Distance\n- Symspell\n- Linspell\n\n## Thoughts\n\n- Autocorrect can be implemented using any of the distance metrics (such as levenshtein) with BK-Tree\n- Distance metric can be supplied to bk-tree through an interface.\n- Dictionary words can first be supplied to the tree, and subsequent words can be added later through other means (syncing, streaming, pub-sub)\n- The tree can be snapshotted periodically to avoid rebuild (e.g. using `gob`), test should be conducted to see if rebuilding the tree is faster than reloading the whole tree.\n- Build tree through prefix (A-Z) would result in better performance (?). How to avoid hotspots (more characters in A than Z)?\n- Can part of the tree be transmitted through the network?\n- How to blacklist words that are not supposed to be searchable? (profanity words)\n- \n\n\n## References\n- https://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Dice%27s_coefficient#Javascript\n- https://en.wikipedia.org/wiki/Wikipedia:AutoWikiBrowser/Typos#C\n- https://ii.nlm.nih.gov/MTI/Details/trigram.shtml\n- https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance\n- https://en.wikipedia.org/wiki/Bitap_algorithm\n- https://lingpipe-blog.com/2006/12/13/code-spelunking-jaro-winkler-string-comparison/\n- Adjustment for longer string http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=7DCFAEBBA89D749D9D901DFA621FCA31?doi=10.1.1.64.7405\u0026rep=rep1\u0026type=pdf\n- Table 6 shows the test cases https://www.census.gov/srd/papers/pdf/rrs2006-02.pdf\n- http://alias-i.com/lingpipe/demos/tutorial/stringCompare/read-me.html\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Falextanhongpin%2Fstringdist","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Falextanhongpin%2Fstringdist","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Falextanhongpin%2Fstringdist/lists"}