{"id":13484536,"url":"https://github.com/tonytonyjan/jaro_winkler","last_synced_at":"2025-05-15T14:02:08.837Z","repository":{"id":20463426,"uuid":"23740784","full_name":"tonytonyjan/jaro_winkler","owner":"tonytonyjan","description":"Ruby \u0026 C implementation of Jaro-Winkler distance algorithm which supports UTF-8 string.","archived":false,"fork":false,"pushed_at":"2025-05-11T13:58:16.000Z","size":207,"stargazers_count":200,"open_issues_count":10,"forks_count":33,"subscribers_count":8,"default_branch":"master","last_synced_at":"2025-05-13T14:16:38.282Z","etag":null,"topics":["algorithm","jaro-winkler","jaro-winkler-distance","ruby"],"latest_commit_sha":null,"homepage":"","language":"Ruby","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/tonytonyjan.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2014-09-06T17:40:22.000Z","updated_at":"2025-05-11T13:58:19.000Z","dependencies_parsed_at":"2024-05-01T13:19:58.404Z","dependency_job_id":"b825911b-da48-40cf-9d29-10b0eb5c7144","html_url":"https://github.com/tonytonyjan/jaro_winkler","commit_stats":{"total_commits":240,"total_committers":12,"mean_commits":20.0,"dds":0.0708333333333333,"last_synced_commit":"ec51b6e2969b2434fc157f3987db60566825e72b"},"previous_names":[],"tags_count":33,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tonytonyjan%2Fjaro_winkler","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tonytonyjan%2Fjaro_winkler/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tonytonyjan%2Fjaro_winkler/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tonytonyjan%2Fjaro_winkler/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/tonytonyjan","download_url":"https://codeload.github.com/tonytonyjan/jaro_winkler/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254035600,"owners_count":22003593,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["algorithm","jaro-winkler","jaro-winkler-distance","ruby"],"created_at":"2024-07-31T17:01:25.794Z","updated_at":"2025-05-15T14:02:08.808Z","avatar_url":"https://github.com/tonytonyjan.png","language":"Ruby","funding_links":[],"categories":["Scientific","Ruby"],"sub_categories":[],"readme":"![test](https://github.com/tonytonyjan/jaro_winkler/actions/workflows/test.yml/badge.svg)\n\n[jaro_winkler](https://rubygems.org/gems/jaro_winkler) is an implementation of [Jaro-Winkler similarity](http://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance) algorithm which is written in C extension and will fallback to pure Ruby version in platforms other than MRI/KRI like JRuby or Rubinius. **Both of C and Ruby implementation support any kind of string encoding, such as UTF-8, EUC-JP, Big5, etc.**\n\n# Installation\n\n```\ngem install jaro_winkler\n```\n\n# Usage\n\n```ruby\nrequire 'jaro_winkler'\n\n# Jaro Winkler Similarity\n\nJaroWinkler.similarity \"MARTHA\", \"MARHTA\"\n# =\u003e 0.9611\nJaroWinkler.similarity \"MARTHA\", \"marhta\", ignore_case: true\n# =\u003e 0.9611\nJaroWinkler.similarity \"MARTHA\", \"MARHTA\", weight: 0.2\n# =\u003e 0.9778\n\n# Jaro Similarity\n\nJaroWinkler.jaro_similarity \"MARTHA\", \"MARHTA\"\n# =\u003e 0.9444444444444445\n```\n\nThere is no `JaroWinkler.jaro_winkler_similarity`, it's tediously long.\n\n## Options\n\nName        | Type    | Default | Note\n----------- | ------  | ------- | ------------------------------------------------------------------------------------------------------------\nignore_case | boolean | false   | All lower case characters are converted to upper case prior to the comparison.\nweight      | number  | 0.1     | A constant scaling factor for how much the score is adjusted upwards for having common prefixes.\nthreshold   | number  | 0.7     | The prefix bonus is only added when the compared strings have a Jaro similarity above the threshold.\nadj_table   | boolean | false   | The option is used to give partial credit for characters that may be errors due to known phonetic or character recognition errors. A typical example is to match the letter \"O\" with the number \"0\".\n\n# Adjusting Table\n\n## Default Table\n\n```\n['A', 'E'], ['A', 'I'], ['A', 'O'], ['A', 'U'], ['B', 'V'], ['E', 'I'], ['E', 'O'], ['E', 'U'], ['I', 'O'], ['I', 'U'],\n['O', 'U'], ['I', 'Y'], ['E', 'Y'], ['C', 'G'], ['E', 'F'], ['W', 'U'], ['W', 'V'], ['X', 'K'], ['S', 'Z'], ['X', 'S'],\n['Q', 'C'], ['U', 'V'], ['M', 'N'], ['L', 'I'], ['Q', 'O'], ['P', 'R'], ['I', 'J'], ['2', 'Z'], ['5', 'S'], ['8', 'B'],\n['1', 'I'], ['1', 'L'], ['0', 'O'], ['0', 'Q'], ['C', 'K'], ['G', 'J'], ['E', ' '], ['Y', ' '], ['S', ' ']\n```\n\n## How it works?\n\nOriginal Formula:\n\n![origin](https://chart.googleapis.com/chart?cht=tx\u0026chs\u0026chl=%5Cbegin%7Bcases%7D0%26%7B%5Ctext%7Bif%20%7Dm%3D0%7D%5C%5C%5Cfrac%7B1%7D%7B3%7D(%5Cfrac%7Bm%7D%7B%5Cleft%7Cs1%5Cright%7C%7D%2B%5Cfrac%7Bm%7D%7B%5Cleft%7Cs2%5Cright%7C%7D%2B%5Cfrac%7Bm-t%7D%7Bm%7D)%26%5Ctext%7Bothers%7D%5Cend%7Bcases%7D)\n\nwhere\n\n- `m` is the number of matching characters.\n- `t` is half the number of transpositions.\n\nWith Adjusting Table:\n\n![adj](https://chart.googleapis.com/chart?cht=tx\u0026chs\u0026chl=%5Cbegin%7Bcases%7D0%26%5Ctext%7Bif%20%7Dm%3D0%5C%5C%5Cfrac%7B1%7D%7B3%7D(%5Cfrac%7B%5Cfrac%7Bs%7D%7B10%7D%2Bm%7D%7B%5Cleft%7Cs1%5Cright%7C%7D%2B%5Cfrac%7B%5Cfrac%7Bs%7D%7B10%7D%2Bm%7D%7B%5Cleft%7Cs2%5Cright%7C%7D%2B%5Cfrac%7Bm-t%7D%7Bm%7D)%26%5Ctext%7Bothers%7D%5Cend%7Bcases%7D)\n\nwhere\n\n- `s` is the number of nonmatching but similar characters.\n\n# Why This?\n\nThere is also another similar gem named [fuzzy-string-match](https://github.com/kiyoka/fuzzy-string-match) which both provides C and Ruby version as well.\n\nI reinvent this wheel because of the naming in `fuzzy-string-match` such as `getDistance` breaks convention, and some weird code like `a1 = s1.split( // )` (`s1.chars` could be better), furthermore, it's bugged (see tables below).\n\n# Compare with other gems\n\n|                 | jaro_winkler | fuzzystringmatch | hotwater | amatch  |\n|-----------------|--------------|------------------|----------|---------|\n| Encoding Support| **Yes**      | Pure Ruby only   | No       | No      |\n| Windows Support | **Yes**      | ?                | No       | **Yes** |\n| Adjusting Table | **Yes**      | No               | No       | No      |\n| Native          | **Yes**      | **Yes**          | **Yes**  | **Yes** |\n| Pure Ruby       | **Yes**      | **Yes**          | No       | No      |\n| Speed           | **1st**      | 3rd              | 2nd      | 4th     |\n\nI made a table below to compare accuracy between each gem:\n\nstr_1      | str_2      | origin | jaro_winkler | fuzzystringmatch | hotwater | amatch\n---        | ---        | ---    | ---          | ---              | ---      | ---\n\"henka\"    | \"henkan\"   | 0.9667 | 0.9667       | **0.9722**       | 0.9667   | **0.9444**\n\"al\"       | \"al\"       | 1.0    | 1.0          | 1.0              | 1.0      | 1.0\n\"martha\"   | \"marhta\"   | 0.9611 | 0.9611       | 0.9611           | 0.9611   | **0.9444**\n\"jones\"    | \"johnson\"  | 0.8324 | 0.8324       | 0.8324           | 0.8324   | **0.7905**\n\"abcvwxyz\" | \"cabvwxyz\" | 0.9583 | 0.9583       | 0.9583           | 0.9583   | 0.9583\n\"dwayne\"   | \"duane\"    | 0.84   | 0.84         | 0.84             | 0.84     | **0.8222**\n\"dixon\"    | \"dicksonx\" | 0.8133 | 0.8133       | 0.8133           | 0.8133   | **0.7667**\n\"fvie\"     | \"ten\"      | 0.0    | 0.0          | 0.0              | 0.0      | 0.0\n\n- The \"origin\" result is from the [original C implementation by the author of the algorithm](http://web.archive.org/web/20100227020019/http://www.census.gov/geo/msb/stand/strcmp.c).\n- Test data are borrowed from [fuzzy-string-match's rspec file](https://github.com/kiyoka/fuzzy-string-match/blob/master/test/basic_pure_spec.rb).\n\n# Benchmark\n\n```\n$ bundle exec rake benchmark\nruby 2.4.1p111 (2017-03-22 revision 58053) [x86_64-darwin16]\n\n# C Extension\nRehearsal --------------------------------------------------------------\njaro_winkler (8c16e09)       0.240000   0.000000   0.240000 (  0.241347)\nfuzzy-string-match (1.0.1)   0.400000   0.010000   0.410000 (  0.403673)\nhotwater (0.1.2)             0.250000   0.000000   0.250000 (  0.254503)\namatch (0.4.0)               0.870000   0.000000   0.870000 (  0.875930)\n----------------------------------------------------- total: 1.770000sec\n\n                                 user     system      total        real\njaro_winkler (8c16e09)       0.230000   0.000000   0.230000 (  0.236921)\nfuzzy-string-match (1.0.1)   0.380000   0.000000   0.380000 (  0.381942)\nhotwater (0.1.2)             0.250000   0.000000   0.250000 (  0.254977)\namatch (0.4.0)               0.860000   0.000000   0.860000 (  0.861207)\n\n# Pure Ruby\nRehearsal --------------------------------------------------------------\njaro_winkler (8c16e09)       0.440000   0.000000   0.440000 (  0.438470)\nfuzzy-string-match (1.0.1)   0.860000   0.000000   0.860000 (  0.862850)\n----------------------------------------------------- total: 1.300000sec\n\n                                 user     system      total        real\njaro_winkler (8c16e09)       0.440000   0.000000   0.440000 (  0.439237)\nfuzzy-string-match (1.0.1)   0.910000   0.010000   0.920000 (  0.920259)\n```\n\n# Todo\n\n- Custom adjusting word table.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftonytonyjan%2Fjaro_winkler","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftonytonyjan%2Fjaro_winkler","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftonytonyjan%2Fjaro_winkler/lists"}