{"id":13880142,"url":"https://github.com/schneems/going_the_distance","last_synced_at":"2025-04-30T10:05:42.574Z","repository":{"id":20891253,"uuid":"24178661","full_name":"schneems/going_the_distance","owner":"schneems","description":"Distance Measurements are Awesome!","archived":false,"fork":false,"pushed_at":"2016-09-07T20:40:51.000Z","size":9,"stargazers_count":61,"open_issues_count":1,"forks_count":6,"subscribers_count":5,"default_branch":"master","last_synced_at":"2024-12-13T06:22:05.156Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Ruby","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":"jgamblin/Mirai-Source-Code","license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/schneems.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2014-09-18T07:38:22.000Z","updated_at":"2023-06-13T00:58:36.000Z","dependencies_parsed_at":"2022-07-07T22:51:10.667Z","dependency_job_id":null,"html_url":"https://github.com/schneems/going_the_distance","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/schneems%2Fgoing_the_distance","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/schneems%2Fgoing_the_distance/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/schneems%2Fgoing_the_distance/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/schneems%2Fgoing_the_distance/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/schneems","download_url":"https://codeload.github.com/schneems/going_the_distance/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":230866254,"owners_count":18292211,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-06T08:02:48.786Z","updated_at":"2024-12-22T18:04:01.476Z","avatar_url":"https://github.com/schneems.png","language":"Ruby","funding_links":[],"categories":["Ruby","Projects and Code Examples"],"sub_categories":["Text-to-Speech-to-Text"],"readme":"## Going the Distance\n\nThis contains scripts that do various distance calculations.\n\n## Quicklinks\n\nThis article and code was built on the backs of giants for more about levenshtein distance check out\n\n- [Wikipedia: Levenshtein Distance](http://en.wikipedia.org/wiki/Levenshtein_distance)\n- [Rosetta code](http://rosettacode.org/wiki/Levenshtein_distance)\n- [Peter Norvig: How Google Spell check works](http://norvig.com/spell-correct.html)\n- [Hamming Distance](http://en.wikipedia.org/wiki/Hamming_distance)\n- [Suggest Rails Generator Names PR](https://github.com/rails/rails/pull/15497)\n- [did_you_mean gem](http://www.yukinishijima.net/2014/10/21/did-you-mean-experience-in-ruby.html)\n\nFurther reading on edit distance:\n\n- [Faster Edit distance (using tries)](http://blog.faroo.com/2012/06/07/improved-edit-distance-based-spelling-correction/)\n- [Jaro-Winkler Distance](https://github.com/tonytonyjan/jaro_winkler)\n\nIf you want to learn more about algorithms, check out:\n\n- [Kahn Academy Algorithms](https://www.khanacademy.org/computing/computer-science/algorithms)\n\n## Word Distance\n\nCalculate \"distance\" between two words where distance is the \"cost\" it would take to change word B into word A\n\n## Dirty\n\nIf characters do not match, change them.\n\nRun:\n\n```\n$ ruby lib/dirty_distance.rb far foo\n# =\u003e 2\n```\n\n\nThis is very quick and requires 0(n) comparisons (only iterates over first string, however it does not take into account every way possible to modify a word. In addition to changing letters, we can also delete and insert letters.\n\nFor example the with this algorithm against `saturday` and `sunday` you would expect a small number, they both start with `s` and have similar substrings `day` but:\n\n```\n$ ruby lib/dirty_distance.rb saturday sunday\n# =\u003e 7\n```\n\nOuch. For a more accurate algorithm you can use Levenshtein\n\n\n### Algorithm:\n\n\n```\ndef distance(str1, str2)\n  cost = 0\n  str1.each_char.with_index do |char, index|\n    cost += 1 if str2[index] != char\n  end\n  cost\nend\n```\n\n## Levenshtein\n\nOur 3 operations we check for are insertion (adding an extra character), deletion (removing a character), and substitution (changing one character for another). The difficulty of calculating deletion and insertion is that the length of our strings change. Even so this is the basic logic.\n\nWe can use this logic, iterate over each set of characters and look for the following scenarios.\n\n- Match:\n\nTwo characters match each other, distance is zero\n\n```\ndistance(\"s\", \"s\")\n# =\u003e 0\n```\n\nMove on, nothing to see here\n\n- Deletion:\n\nIf the the removing the current character in the provided string matches the next character, this means that we should delete the character.\n\n```\ndistance(\"schneems\", \"zschneems\")\n# =\u003e 1\n```\n\nAnother way to look at this is we compare the substring of `\"zschneems\"[1..-1]` and see if it matches the first string. If it does bingo.\n\n- Insertion\n\nIf the removing a charcter from the target string matches the next character (or the whole substring), this means we should add a character\n\n```\ndistance(\"schneems\", \"chneems\")\n```\n\nHere `\"schneems\"[1..-1]` matches `\"chneems\"` so we should insert a character.\n\n- Substitution\n\nIf a character does not qualify for deletion, addition, and is not a match, by definition we must substitute a character. Another way to look at this is\n\n```\ndistance(\"zchneems\", \"schneems\")\n```\n\nIf the first characters do not match `\"z\" != \"s\"` but the substring does\n\n```\n\"zchneems\"[1..-1] == \"schneems\"[1..-1]\n# =\u003e true\n```\n\nThen you've got a substitution on your hands.\n\nTo change \"sunday\" into \"saturday\" you can do it with insertion\n\nWe we INSERT the \"at\" after the \"s\"\n\n```\n\"sunday\" =\u003e \"satunday\"\n```\n\nNow we SUBSTITUTE the \"n\" for an \"r\"\n\n```\n\"satunday\" =\u003e \"saturday\"\n```\n\nBoom, only 3 changes gets us the desired result. The distance between the two words is now 3. Previously the \"dirty\" method calculated it was 7 which was pretty far off.\n\n## Levenshtein - Recursive\n\nWith these rules in mind we can get a more accurate result. But how to calculate this? We can compare each sub string for every possible permutation. To do this we can use a recursive algorithm.\n\nThe recursive algorithm is simple but dirty. For comparing `sunday` to `saturday` it takes 1647 comparison. The value is that it is accurate, while the \"dirty\" implementation only took 7 iterations, it also produced an incorrect result.\n\n\n```\n$ ruby lib/levenshtein_recusive.rb saturday sunday\n# =\u003e 3\n```\n\n### Algorithm\n\n\n```\ndef distance(str1, str2)\n  return str2.length if str1.empty?\n  return str1.length if str2.empty?\n\n  return distance(str1[1..-1], str2[1..-1]) if str1[0] == str2[0] # match\n  l1 = distance(str1, str2[1..-1])          # deletion\n  l2 = distance(str1[1..-1], str2)          # insertion\n  l3 = distance(str1[1..-1], str2[1..-1])   # substitution\n  return 1 + [l1,l2,l3].min                 # increment cost\nend\n```\n\nIf either of the strings is empty, then the distance between the two is the length of the other.\n\n```\n  return str2.length if str1.empty?\n  return str1.length if str2.empty?\n```\n\nIf the first character matches, we only need to know the distance of the substrings\n\n```\n  return distance(str1[1..-1], str2[1..-1]) if str1[0] == str2[0] # match\n```\n\nIf we get past this point, we know we haven't matched, now we can look to see what the distance would be if we deleted one character\n\n```\n  l1 = distance(str1, str2[1..-1])          # deletion\n```\n\nDistance if we add a character\n\n```\n  l2 = distance(str1[1..-1], str2)          # insertion\n```\n\nAnd the distance of substituting a character\n\n```\n  l3 = distance(str1[1..-1], str2[1..-1])   # substitution\n```\n\nFinally we figure out which of these methods was the cheapest, and add one to it (to account for this iteration), we return that value.\n\nIt's a bit confusing to totally wrap your head around, but go back to the deletion/insertion/substitution examples above and it helps.\n\n## Levenshtein - Matrix\n\nSo the dirty version is fast but not accurate, and the recursive version is accurate but not fast. If we look closely at the recursive algorithm, it looks like we're comparing different versions of substrings. We're also using these substring calculations in our own calculations. Unfortunately we're re-calculating over and over again. This comparison could easily be cached\n\n```\nComparing 'day' 'ay'\n```\n\nSince we're dealing with two lengths of string we can cache distance calculations in a matrix form.\n\nThe result will be O(n*m) calculations where n and m are the lengths of the two strings.\n\n```\n$ levenshtein_matrix saturday sunday\n# =\u003e 3\n```\n\nIf we wanted to change the word `sunday` to a blank string `\"\"` the matrix would look like this:\n\n\n```\n+---+---+\n|   |   |\n+---+---+\n|   | 0 |\n+---+---+\n| S | 1 |\n+---+---+\n| U | 2 |\n+---+---+\n| N | 3 |\n+---+---+\n| D | 4 |\n+---+---+\n| A | 5 |\n+---+---+\n| Y | 6 |\n+---+---+\n```\n\nIt would take 6 deletions to turn `sunday` into `\"\"`. Similarly with saturday.\n\n\n```\n+---+---+---+---+---+---+---+---+---+---+\n|   |   | S | A | T | U | R | D | A | Y |\n+---+---+---+---+---+---+---+---+---+---+\n|   | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |\n+---+---+---+---+---+---+---+---+---+---+\n```\n\nIt would take 8 deletions to turn `saturday` into `\"\"`. Now we can add one character at a time to see how the rest of the algorithm will work.\n\n\n### Skip - Step\n\nTo turn `s` into `saturday` we can see we don't need to do anything for the first character since s matches s, we can skip the step:\n\n\n```\n+---+---+---+\n|   |   | S |\n+---+---+---+\n|   | 0 | 1 |\n+---+---+---+\n| S | 1 | 0 |\n+---+---+---+\n```\n\nThis skip will cost the same thing as if we were changing the previous character `\"\"` (blank), to the prev target character (also blank)\n\n```\nrow_index = 1\ncolumn_index = 1\nmatrix[row_index - 1][column_index - 1]\n# =\u003e 0\n```\n\n### Insertion - Step\n\nNow we add the next target letter. To change `s` to `sa` we need to perform an insertion.\n\n```\n+---+---+---+---+\n|   |   | S | A |\n+---+---+---+---+\n|   | 0 | 1 | 2 |\n+---+---+---+---+\n| S | 1 | 0 |   |\n+---+---+---+---+\n```\n\n\nAnother way to look at an insertion is that it will cost the same as if we were targeting `s` instead of `sa` plus one. We can calculate the cost for an insertion by looking at the same row, previous column then adding one.\n\n```\nrow_index = 1\ncolumn_index = 2\nmatrix[row_index][column_index - 1] + 1\n# =\u003e 1\n```\n\n\nWe continue with insertions for the rest of the row:\n\n\n```\n+---+---+---+---+---+---+---+---+---+---+\n|   |   | S | A | T | U | R | D | A | Y |\n+---+---+---+---+---+---+---+---+---+---+\n|   | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |\n+---+---+---+---+---+---+---+---+---+---+\n| S | 1 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |\n+---+---+---+---+---+---+---+---+---+---+\n```\n\nSo the total cost of changing `s` into `saturday` would be `7`. Next character, let's change `su` into `sa`\n\n\n\n\n#### Deletion\n\n\nIntellectually we can see turning `su` into `s` would take 1 change, a deletion how would we calculate the cost for a deletion?\n\n```\n+---+---+---+\n|   |   | S |\n+---+---+---+\n|   | 0 | 1 |\n+---+---+---+\n| S | 1 | 0 |\n+---+---+---+\n| U | 2 |   |\n+---+---+---+\n```\n\n\nIf we delete `u` then the cost of changing `su` into `s` is the same as changing `s` into `s` + 1 (to account for the deletion action). We already have this information stored in our matrix. We need to get the value of the same column but the previous row and add one to it.\n\n```\nrow_index = 2\ncolumn_index = 1\nmatrix[row_index - 1, column_index] + 1\n# =\u003e 1\n```\n\nThe cost to change `su` to `s` would be 1 if we delete `u`.\n\n\n## Substitution\n\nTo change `su` into `sa` we can substitute the `u` for an `a`\n\n```\n+---+---+---+---+\n|   |   | S | A |\n+---+---+---+---+\n|   | 0 | 1 | 2 |\n+---+---+---+---+\n| S | 1 | 0 | 1 |\n+---+---+---+---+\n| U | 2 | 1 |   |\n+---+---+---+---+\n```\n\nIf we are substituting a character, the cost would be the same as the previous string (not including current character) plus 1. This cost is stored in the previous row and previous column.\n\n```\nrow_index = 2\ncolumn_index = 3\nmatrix[row_index - 1][column_index - 1]\n# =\u003e 1\n```\n\n\n## Algorithm\n\nWe can now calculate the cost for a deletion, substitution, and insertion. If we calculate all three, the best choice will be the lowest value. We can iterate over the entire matrix until we have calculated every value:\n\n\n```\n+---+---+---+---+---+---+---+---+---+---+\n|   |   | S | A | T | U | R | D | A | Y |\n+---+---+---+---+---+---+---+---+---+---+\n|   | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |\n+---+---+---+---+---+---+---+---+---+---+\n| S | 1 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |\n+---+---+---+---+---+---+---+---+---+---+\n| U | 2 | 1 | 1 | 2 | 2 | 3 | 4 | 5 | 6 |\n+---+---+---+---+---+---+---+---+---+---+\n| N | 3 | 2 | 2 | 2 | 3 | 3 | 4 | 5 | 6 |\n+---+---+---+---+---+---+---+---+---+---+\n| D | 4 | 3 | 3 | 3 | 3 | 4 | 3 | 4 | 5 |\n+---+---+---+---+---+---+---+---+---+---+\n| A | 5 | 4 | 3 | 4 | 4 | 4 | 4 | 3 | 4 |\n+---+---+---+---+---+---+---+---+---+---+\n| Y | 6 | 5 | 4 | 4 | 5 | 5 | 5 | 4 | 3 |\n+---+---+---+---+---+---+---+---+---+---+\n```\n\nTo get the cost of changing any string to another string, we look at\n\n```\nstring1 = \"sunday\"\nstring2 = \"saturday\"\nmatrix[string1.length][string2.length]\n# =\u003e 3\n```\n\nThe neat thing is we don't have to re-calculate any substrings. For example\n\n```\nstring1 = \"sun\"\nstring2 = \"sat\"\nmatrix[string1.length][string2.length]\n# =\u003e 2\n```\n\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fschneems%2Fgoing_the_distance","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fschneems%2Fgoing_the_distance","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fschneems%2Fgoing_the_distance/lists"}