{"id":21564641,"url":"https://github.com/vanessaklee/akin","last_synced_at":"2025-09-03T15:43:01.894Z","repository":{"id":50634364,"uuid":"373851064","full_name":"vanessaklee/akin","owner":"vanessaklee","description":"A collection of metrics and phonetic algorithms for fuzzy string matching in Elixir.","archived":false,"fork":false,"pushed_at":"2023-09-03T02:39:11.000Z","size":8882,"stargazers_count":37,"open_issues_count":0,"forks_count":5,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-08-27T14:56:01.693Z","etag":null,"topics":["algorithm","comparison-tool","disambiguation","double-metaphone","elixir","hamming-distance","jaro-winkler","levenshtein-distance","metaphone","sorensen-dice-distance","string-comparison","string-matching"],"latest_commit_sha":null,"homepage":"","language":"Elixir","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/vanessaklee.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-06-04T13:30:23.000Z","updated_at":"2025-07-22T15:11:55.000Z","dependencies_parsed_at":"2025-04-10T13:07:46.741Z","dependency_job_id":"f8a99b4d-7277-4e30-ab3d-12597f6fb59c","html_url":"https://github.com/vanessaklee/akin","commit_stats":null,"previous_names":[],"tags_count":4,"template":false,"template_full_name":null,"purl":"pkg:github/vanessaklee/akin","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vanessaklee%2Fakin","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vanessaklee%2Fakin/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vanessaklee%2Fakin/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vanessaklee%2Fakin/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/vanessaklee","download_url":"https://codeload.github.com/vanessaklee/akin/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vanessaklee%2Fakin/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":273467627,"owners_count":25111130,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-09-03T02:00:09.631Z","response_time":76,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["algorithm","comparison-tool","disambiguation","double-metaphone","elixir","hamming-distance","jaro-winkler","levenshtein-distance","metaphone","sorensen-dice-distance","string-comparison","string-matching"],"created_at":"2024-11-24T10:16:37.646Z","updated_at":"2025-09-03T15:43:01.868Z","avatar_url":"https://github.com/vanessaklee.png","language":"Elixir","readme":"Akin\n=======\n\nAkin is a collection of string comparison algorithms for Elixir. This solution was born of a [Record Linking](https://en.wikipedia.org/wiki/Record_linkage) project. It combines and modifies [The Fuzz](https://github.com/smashedtoatoms/the_fuzz) and [Fuzzy Compare](https://github.com/patrickdet/fuzzy_compare). Algorithms can be called independently or in total to return a map of metrics. This library was built to facilitiate the disambiguation of names but can be used to compare any two binaries. \n\n## New! Notebooks\n\n### Disambiguation\n\n[![Run Disambiguation in Livebook](https://livebook.dev/badge/v1/blue.svg)](https://livebook.dev/run?url=https%3A%2F%2Fgithub.com%2Fvanessaklee%2Fakin%2Fblob%2Fmain%2Fnotebooks%2Fdisambiguation.livemd)\n\n### Name Disambiguation\n\n[![Run Name Disambiguation in Livebook](https://livebook.dev/badge/v1/blue.svg)](https://livebook.dev/run?url=https%3A%2F%2Fgithub.com%2Fvanessaklee%2Fakin%2Fblob%2Fmain%2Fnotebooks%2Fname_disambiguation.livemd)\n\n\u003cdetails\u003e\n  \u003csummary\u003eTable of Contents\u003c/summary\u003e\n  \n  1. [Installation](#installation)\n  1. [Algorithms](#algorithms)\n  1. [Metrics](#metrics)\n     * [Compare Strings](#compare-strings)\n     * [Options](#options)\n         * [Algorithms](#algorithms)\n         * [Stems](#stems)\n         * [n-gram Size](#n-gram-size)\n         * [Match Level](#match-level)\n     * [Preprocessing](#preprocessing)\n         * [Accents](#accents)\n     * [Phonemes](#phonemes)\n     * [Name Disambiguation](#name-disambiguation)\n  1. Algorithm [Definitions](#definitions)\n  1. [Resources](#resources)\n  1. [In Development](#in-development)\n\u003c/details\u003e\n\n## Installation\n\nAdd a dependency in your mix.exs:\n\n```elixir\ndeps: [{:akin, \"~\u003e 0.2.0\"}]\n```\n\n## Algorithms\n\nTo see all of the avialable algorithms. Hamming Distance is excluded as it only compares strings of equal length. Hamming may be called directly. See: [Single Algorithms](#single-algorithms)\n\n```elixir\niex\u003e Akin.Util.list_algorithms()\n[\"bag_distance\", \"substring_set\", \"sorensen_dice\", \"jaccard\", \"jaro_winkler\", \n\"levenshtein\", \"metaphone\", \"double_metaphone\", \"substring_double_metaphone\", \"ngram\", \n\"overlap\", \"substring_sort\", \"tversky\"]\n```\n\n## Metrics\n\n### Compare Strings\n\nCompare two strings using all of the available algorithms. The return value is a map of scores for each algorithm.\n\n ```elixir\niex\u003e Akin.compare(\"weird\", \"wierd\")\n%{\n  bag_distance: 1.0,\n  sorensen_dice: 0.25,\n  double_metaphone: 1.0,\n  jaccard: 0.14,\n  jaro_winkler: 0.94,\n  levenshtein: 0.6,\n  metaphone: 1.0,\n  ngram: 0.25,\n  overlap: 0.25,\n  tversky: 0.14\n}\n```\n\n```elixir\niex\u003e Akin.compare(\"beginning\", \"begining\")\n%{\n  bag_distance: 0.89,\n  sorensen_dice: 0.93,\n  double_metaphone: 1.0,\n  jaccard: 0.88,\n  jaro_winkler: 0.95,\n  levenshtein: 0.89,\n  metaphone: 1.0,\n  ngram: 0.88,\n  overlap: 1.0,\n  tversky: 0.88\n}\n```\n\n### Options\n\nComparison accepts options in a Keyword list. \n\n  1. `algorithms`: algorithms to use in comparision. Accepts the name or a keyword list. Default is algorithms/0.\n      1. `metric` - algorithm metric. Default is both\n        - \"string\": uses string algorithms\n        - \"phonetic\": uses phonetic algorithms\n      1. `unit` - algorithm unit. Default is both.\n        - \"whole\": uses algorithms best suited for whole string comparison (distance)\n        - \"partial\": uses algorithms best suited for partial string comparison (substring)\n  1. `level` - level for double phonetic matching. Default is \"normal\".\n      - \"strict\": both encodings for each string must match\n      - \"strong\": the primary encoding for each string must match\n      - \"normal\": the primary encoding of one string must match either encoding of other string (default)\n      - \"weak\":   either primary or secondary encoding of one string must match one encoding of other string\n  1. `match_at`: an algorith score equal to or above this value is condsidered a match. Default is 0.9\n  1. `ngram_size`: number of contiguous letters to split strings into. Default is 2.\n  1. `short_length`: qualifies as \"short\" to recieve a shortness boost. Used by Name Metric. Default is 8.\n  1. `stem`: boolean representing whether to compare the stemmed version the strings; uses Stemmer. Default `false`\n\n#### Algorithms\n\nRestrict the list of algorithms by name or metric and/or unit.\n\n```elixir\niex\u003e opts = [algorithms: [\"bag_distance\", \"jaccard\", \"jaro_winkler\"]]\niex\u003e Akin.compare(\"weird\", \"wierd\", opts) \n%{\nbag_distance: 1.0, \njaccard: 0.14, \njaro_winkler: 0.94\n}\niex\u003e opts = [algorithms: [metric: \"phonetic\", unit: \"whole\"]]\niex \u003e Akin.compare(\"weird\", \"wierd\", opts)\n%{\ndouble_metaphone: 1.0, \nmetaphone: 1.0\n}\n```\n\n#### n-gram Size\n\nThe default ngram size for the algorithms is 2. You can change by setting \na value in opts.\n\n```elixir\niex\u003e Akin.compare(\"weird\", \"wierd\", [algorithms: [\"sorensen_dice\"]])\n%{sorensen_dice: 0.25}\niex\u003e Akin.compare(\"weird\", \"wierd\", [algorithms: [\"sorensen_dice\"], ngram_size: 1])\n%{sorensen_dice: 0.8}\n```\n\n#### Match Level\n\nThe default match strictness is \"normal\" You change it by setting \na value in opts. Currently it only affects the outcomes of the `substring_set` and\n`double_metaphone` algorithms\n\n```elixir\niex\u003e left = \"Alice in Wonderland\"\niex\u003e right = \"Alice's Adventures in Wonderland\"\niex\u003e Akin.compare(left, right, [algorithms: [\"substring_set\"]])\n%{substring_set: 0.85}\niex\u003e Akin.compare(left, right, [algorithms: [\"substring_set\"], level: \"weak\"])\n%{substring_set: 0.85}\niex\u003e left = \"which way\"\niex\u003e right = \"whitch way\"\niex\u003e Akin.compare(left, right, [algorithms: [\"double_metaphone\"], level: \"weak\"])\n%{double_metaphone: 1.0}\niex\u003e Akin.compare(left, right, [algorithms: [\"double_metaphone\"], level: \"strict\"])\n%{double_metaphone: 0.0}\n```\n\n#### Stems\n\nCompare the stemmed version of two strings.\n\n```elixir\niex\u003e Akin.compare(\"write\", \"writing\", [algorithms: [\"bag_distance\", \"double_metaphone\"]])\n%{bag_distance: 0.57, double_metaphone: 0.0}\niex\u003e Akin.compare(\"write\", \"writing\", [algorithms: [\"bag_distance\", \"double_metaphone\"], stem: true])\n%{bag_distance: 1.0, double_metaphone: 1.0}\n```\n\n##### Additional Examples\n\n```elixir\niex\u003e Akin.compare(\"weird\", \"wierd\", algorithms: [\"bag_distance\", \"jaro_winkler\", \"jaccard\"])\n%{bag_distance: 1.0, jaccard: 0.14, jaro_winkler: 0.94}\n```\n\n```elixir\niex\u003e Akin.compare(\"weird\", \"wierd\", algorithms: [metric: \"string\", unit: \"whole\"], ngram_size: 1)\n%{\n  bag_distance: 1.0,\n  jaccard: 0.67,\n  jaro_winkler: 0.94,\n  levenshtein: 0.6,\n  sorensen_dice: 0.8,\n  tversky: 1.0\n}\n```\n\n### Preprocessing\n\nBefore being compared, strings are converted to downcase and unicode standard, whitespace is standardized, nontext (like punctuation \u0026 emojis) is replaced, and accents are converted. The string is then composed into a struct representing the corpus of data used by the comparison algorithms. \n\n\"Alice Liddell\" becomes\n```\n%Akin.Corpus{\n  list: [\"alice\", \"liddell\"],\n  original: \"alice liddell\",\n  set: #MapSet\u003c[\"alice\", \"liddell\"]\u003e,\n  stems: [\"alic\", \"liddel\"],\n  string: \"aliceliddell\"\n}\n```\n\n#### Accents\n\n```elixir\niex\u003e Akin.compare(\"Hubert Łępicki\", \"Hubert Lepicki\")\n%{\n  bag_distance: 0.92,\n  dice_sorensen: 0.83,\n  double_metaphone: 0.0,\n  jaccard: 0.71,\n  jaro_winkler: 0.97,\n  levenshtein: 0.92,\n  metaphone: 0.0,\n  ngram: 0.83,\n  overlap: 0.83,\n  tversky: 0.71\n}\n```\n\n### Phonemes\n\n```elixir\niex\u003e Akin.phonemes(\"virginia\") \n[\"frjn\", \"frkn\"]\niex\u003e Akin.phonemes(\"beginning\")\n[\"bjnnk\", \"pjnnk\", \"pknnk\"]\niex\u003e Akin.phonemes(\"wonderland\")\n[\"wntrlnt\", \"antrlnt\", \"fntrlnt\"]\n```\n\n### Name Disambiguation\n\n_UNDER DEVELOPMENT_\n\nIdentity is the challenge of author name disambiguation (AND). The aim of AND is to match an author's name to that author when the author appears in a list of many authors. Complexity arises from homonymity (many people with the same name) and synonymity (when one person uses different forms/spellings of their name in publications). \n\nGiven the name of an author which is divided into the given, middle, and family name parts (i.e. \"Virginia\", nil, \"Woolf\") and a list of possible matching author names, find and return the matches for the author in the list. If initials exist in the left name, a separate comparison is performed for the initals and the sets of the right string.\n\nIf the comparison metrics produce a score greater than or equal to 0.9, they considered a match and returned in the list.\n\n```elixir\niex\u003e Akin.match_names(\"V. Woolf\", [\"V Woolf\", \"V Woolfe\", \"Virginia Woolf\", \"V White\", \"Viginia Wolverine\", \"Virginia Woolfe\"])\n[\"v woolfe\", \"v woolf\"]\niex\u003e Akin.match_names(\"V. Woolf\", [\"V Woolf\", \"V Woolfe\", \"Virginia Woolf\", \"V White\", \"Viginia Wolverine\", \"Virginia Woolfe\"])\n[\"virginia woolfe\", \"v woolf\"]\n```\n\nThis may not be what you want. There are likely to be unwanted matches.\n\n```elixir\niex\u003e Akin.match_names(\"V. Woolf\", [\"Victor Woolf\", \"Virginia Woolf\", \"V White\", \"V Woolf\", \"Virginia Woolfe\"])\n[\"v woolf\", \"virginia woolf\", \"victor woolf\"]\n```\n\n---\n\n## Definitions\n\n\u003cdetails\u003e\n  \u003csummary\u003e\u003cu\u003eBag Distance\u003c/u\u003e\u003c/summary\u003e\n\nThe bag distance is a cheap distance measure which always returns a distance smaller or equal to the edit distance. It's meant to be an efficient approximation of the distance between two strings to quickly rule out strings that are largely different.  \n\u003c/details\u003e\n\n\u003cdetails\u003e\n  \u003csummary\u003e\u003cu\u003eDouble Metaphone\u003c/u\u003e\u003c/summary\u003e\n\nCalculates the [Double Metaphone Phonetic Algorithm](https://xlinux.nist.gov/dads/HTML/doubleMetaphone.html) metric of two strings. The return value is based on the match level: strict, strong, normal (default), or weak. \n\n  * \"strict\": both encodings for each string must match\n  * \"strong\": the primary encoding for each string must match\n  * \"normal\": the primary encoding of one string must match either encoding of other string (default)\n  * \"weak\":   either primary or secondary encoding of one string must match one encoding of other string\n\u003c/details\u003e\n\n\u003cdetails\u003e\n  \u003csummary\u003e\u003cu\u003eHamming Distance\u003c/u\u003e\u003c/u\u003e\u003c/summary\u003e\n\nNote: Hamming algorithm is not used in an of the comparison functions because it requires the strings being compared are of the same length. It can be called directly, however, so it is still a part of this library.\n\nThe Hamming distance between two strings of equal length is the number of positions at which the corresponding letters are different. Returns the percentage of change needed to the left string of the comparison of one string (left) with another string (right). Returns 0.0 if the strings are not the same length. Returns 1.0 if the string are equal.\n\u003c/details\u003e\n\n\u003cdetails\u003e\n  \u003csummary\u003e\u003cu\u003eJaccard Similarity\u003c/u\u003e\u003c/summary\u003e\n\nCalculates the similarity of two strings as the size of the intersection divided by the size of the union. Default ngram: 2.\n\u003c/details\u003e\n\n\u003cdetails\u003e\n  \u003csummary\u003e\u003cu\u003eJaro-Winkler Similarity\u003c/u\u003e\u003c/summary\u003e\n\nJaro-Winkler calculates the edit distance between two strings. A score of one denotes equality. Unlike the Jaro Similarity, it modifies the prefix scale to gives a more favorable rating to strings that match from the beginning.\n\u003c/details\u003e\n\n\u003cdetails\u003e\n  \u003csummary\u003e\u003cu\u003eLevenshtein Distance\u003c/u\u003e\u003c/summary\u003e\n\nCompare two strings for their Levenshtein score. The score is determined by finding the edit distance: the minimum number of single-character edits needed to change one word into the other. The distance is substracted from 1.0 and then divided by the longest length between the two strings. \n\u003c/details\u003e\n\n\u003cdetails\u003e\n  \u003csummary\u003e\u003cu\u003eMetaphone\u003c/u\u003e\u003c/summary\u003e\n\nCompares two strings by converting each to an approximate phonetic representation in ASCII and then comparing those phoenetic representations. Returns a 1 if the phoentic representations are an exact match.\n\u003c/details\u003e\n\n\u003cdetails\u003e\n  \u003csummary\u003e\u003cu\u003eN-Gram Similarity\u003c/u\u003e\u003c/summary\u003e\n\nCalculates the ngram distance between two strings. Default ngram: 2.\n\u003c/details\u003e\n\n\u003cdetails\u003e\n  \u003csummary\u003e\u003cu\u003eOverlap Metric\u003c/u\u003e\u003c/summary\u003e\n\nUses the Overlap Similarity metric to compare two strings by tokenizing the strings and measuring their overlap. Default ngram: 1.\n\u003c/details\u003e\n\n\u003cdetails\u003e\n  \u003csummary\u003e\u003cu\u003eSørensen–Dice\u003c/u\u003e\u003c/summary\u003e\n\nSørensen–Dice coefficient is calculated using bigrams. The equation is `2nt / nx + ny` where nx is the number of bigrams in string x, ny is the number of bigrams in string y, and nt is the number of bigrams in both strings. For example, the bigrams of `night` and `nacht` are `{ni,ig,gh,ht}` and `{na,ac,ch,ht}`. They each have four and the intersection is `ht`. \n\n``` (2 · 1) / (4 + 4) = 0.25 ```\n\u003c/details\u003e\n\n\u003cdetails\u003e\n  \u003csummary\u003e\u003cu\u003eSubstring Double Metaphone\u003c/u\u003e\u003c/summary\u003e\n\nIterate over the cartesian product of the two lists sending each element through\nthe Double Metaphone using all strictness levels until a true value is found\nin the list of returned booleans from the Double Metaphone algorithm. Return the \npercentage of true values found. If true is never returned, return 0. Increases  \naccuracy for search terms containing more than one word.\n\u003c/details\u003e\n\n\u003cdetails\u003e\n  \u003csummary\u003e\u003cu\u003eSubstring Set\u003c/u\u003e\u003c/summary\u003e\n\nSplits the strings on spaces, sorts, re-joins, and then determines Jaro-Winkler distance. Best when the strings contain irrelevent substrings. \n\u003c/details\u003e\n\n\u003cdetails\u003e\n  \u003csummary\u003e\u003cu\u003eSubstring Sort\u003c/u\u003e\u003c/summary\u003e\n\nSorts substrings by words, compares the sorted strings in pairs, and returns the maximum ratio. If one strings is signficantly longer than the other, this method will compare matching substrings only. \n\u003c/details\u003e\n\n\u003cdetails\u003e\n  \u003csummary\u003e\u003cu\u003eTversky\u003c/u\u003e\u003c/summary\u003e \n\nA generalization of Sørensen–Dice and Jaccard.\n\u003c/details\u003e\n\n## Resources\n\n* [Disambiguation Datasets](https://github.com/dhwajraj/dataset-person-name-disambiguation)\n* [Double Metaphone in python](https://github.com/oubiwann/metaphone/blob/master/metaphone/metaphone.py)\n* [Fuzzy Compare](https://github.com/patrickdet/fuzzy_compare)\n* [Python Fuzzy Wuzzy (2011)](https://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/)\n* [ML Authur Block Dismabiguation](https://github.com/helenamihaljevic/ads_author_disambiguation)\n* [ML Author Name Disambiguation](https://medium.com/ai2-blog/s2and-an-improved-author-disambiguation-system-for-semantic-scholar-d09380da30e6)\n* [Record Linking](https://en.wikipedia.org/wiki/Record_linkage)\n* [The Fuzz](https://github.com/smashedtoatoms/the_fuzz)\n* [Homophones used in testing metaphone algorithms](https://www.cs.cmu.edu/afs/cs/project/ai-repository/ai/areas/speech/database/homofonz/)\n\n## In Development\n\n* Further enhancements to name matching\n* Add Damerau-Levenshtein algorithm\n  * [Damerau-Levenshtein](https://en.wikipedia.org/wiki/Damerau-Levenshtein_distance)\n  * [Examples](https://datascience.stackexchange.com/questions/60019/damerau-levenshtein-edit-distance-in-python)\n* Add Caverphone algorithm\n  * [Caverphone](https://en.wikipedia.org/wiki/Caverphone)\n  * [Research](https://caversham.otago.ac.nz/files/working/ctp150804.pdf)\n  * [Example](https://gist.github.com/kastnerkyle/a697d4e762fa8f53c70eea7bc712eead)\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvanessaklee%2Fakin","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fvanessaklee%2Fakin","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvanessaklee%2Fakin/lists"}