{"id":16928308,"url":"https://github.com/maartengr/polyfuzz","last_synced_at":"2025-10-04T11:16:51.789Z","repository":{"id":40456459,"uuid":"314748658","full_name":"MaartenGr/PolyFuzz","owner":"MaartenGr","description":"Fuzzy string matching, grouping, and evaluation. ","archived":false,"fork":false,"pushed_at":"2025-02-17T08:19:37.000Z","size":4219,"stargazers_count":758,"open_issues_count":31,"forks_count":71,"subscribers_count":14,"default_branch":"master","last_synced_at":"2025-04-12T22:16:46.170Z","etag":null,"topics":["bert","edit-distance","embeddings","levenshtein-distance","string-matching","tf-idf"],"latest_commit_sha":null,"homepage":"https://maartengr.github.io/PolyFuzz/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/MaartenGr.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-11-21T06:28:09.000Z","updated_at":"2025-03-30T19:24:57.000Z","dependencies_parsed_at":"2024-06-18T15:27:19.481Z","dependency_job_id":"e78e0b04-39f4-4e89-ad0e-a827f06177b7","html_url":"https://github.com/MaartenGr/PolyFuzz","commit_stats":{"total_commits":26,"total_committers":6,"mean_commits":4.333333333333333,"dds":"0.23076923076923073","last_synced_commit":"5d0734b093ece13e3e0e95dd3a26549f014497a8"},"previous_names":[],"tags_count":9,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MaartenGr%2FPolyFuzz","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MaartenGr%2FPolyFuzz/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MaartenGr%2FPolyFuzz/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MaartenGr%2FPolyFuzz/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/MaartenGr","download_url":"https://codeload.github.com/MaartenGr/PolyFuzz/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248637787,"owners_count":21137538,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bert","edit-distance","embeddings","levenshtein-distance","string-matching","tf-idf"],"created_at":"2024-10-13T20:36:26.028Z","updated_at":"2025-10-04T11:16:51.764Z","avatar_url":"https://github.com/MaartenGr.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cimg src=\"images/logo.png\" width=\"70%\" /\u003e\n\n[![PyPI Downloads](https://static.pepy.tech/badge/polyfuzz)](https://pepy.tech/projects/polyfuzz)\n[![PyPI - Python](https://img.shields.io/badge/python-3.9+-blue.svg)](https://pypi.org/project/polyfuzz/)\n[![PyPI - License](https://img.shields.io/badge/license-MIT-green.svg)](https://github.com/MaartenGr/polyfuzz/blob/master/LICENSE)\n[![PyPI - PyPi](https://img.shields.io/pypi/v/polyfuzz)](https://pypi.org/project/polyfuzz/)\n[![Build](https://img.shields.io/github/actions/workflow/status/MaartenGr/polyfuzz/testing.yml?branch=master)](https://github.com/MaartenGr/polyfuzz/actions)\n[![docs](https://img.shields.io/badge/docs-Passing-green.svg)](https://maartengr.github.io/PolyFuzz/)  \n**`PolyFuzz`** performs fuzzy string matching, string grouping, and contains extensive evaluation functions. \nPolyFuzz is meant to bring fuzzy string matching techniques together within a single framework.\n\nCurrently, methods include a variety of edit distance measures, a character-based n-gram TF-IDF, word embedding\ntechniques such as FastText and GloVe, and 🤗 transformers embeddings.  \n\nCorresponding medium post can be found [here](https://towardsdatascience.com/string-matching-with-bert-tf-idf-and-more-274bb3a95136?source=friends_link\u0026sk=0f765b76ceaba49363829c13dfdc9d98).\n\n\n\u003ca name=\"installation\"/\u003e\u003c/a\u003e\n## Installation\nYou can install **`PolyFuzz`** via pip:\n \n```bash\npip install polyfuzz\n```\n\nYou may want to install more depending on the transformers and language backends that you will be using. The possible installations are:\n\n```python\npip install polyfuzz[sbert]\npip install polyfuzz[flair]\npip install polyfuzz[gensim]\npip install polyfuzz[spacy]\npip install polyfuzz[use]\n```\n\nIf you want to speed up the cosine similarity comparison and decrease memory usage when using embedding models, \nyou can use `sparse_dot_topn` which is installed via:\n\n```bash\npip install polyfuzz[fast]\n```\n\n\u003cdetails\u003e\n\u003csummary\u003eInstallation Issues\u003c/summary\u003e\n\nYou might run into installation issues with `sparse_dot_topn`. If so, one solution that has worked for many \nis by installing it via conda first before installing PolyFuzz:\n\n```bash\nconda install -c conda-forge sparse_dot_topn\n```\n\nIf that does not work, I would advise you to look through their \nissues](https://github.com/ing-bank/sparse_dot_topn/issues) page or continue to use PolyFuzz without `sparse_dot_topn`. \n\n\u003c/details\u003e  \n\n\n\u003ca name=\"gettingstarted\"/\u003e\u003c/a\u003e\n## Getting Started\n\nFor an in-depth overview of the possibilities of `PolyFuzz` \nyou can check the full documentation [here](https://maartengr.github.io/PolyFuzz/) or you can follow along \nwith the notebook [here](https://github.com/MaartenGr/PolyFuzz/blob/master/notebooks/Overview.ipynb).\n\n### Quick Start\n\nThe main goal of `PolyFuzz` is to allow the user to perform different methods for matching strings. \nWe start by defining two lists, one to map from and one to map to. We are going to be using `TF-IDF` to create \nn-grams on a character level in order to compare similarity between strings. Then, we calculate the similarity \nbetween strings by calculating the cosine similarity between vector representations. \n\nWe only have to instantiate `PolyFuzz` with `TF-IDF` and match the lists:\n\n```python\nfrom polyfuzz import PolyFuzz\n\nfrom_list = [\"apple\", \"apples\", \"appl\", \"recal\", \"house\", \"similarity\"]\nto_list = [\"apple\", \"apples\", \"mouse\"]\n\nmodel = PolyFuzz(\"TF-IDF\")\nmodel.match(from_list, to_list)\n```  \n\nThe resulting matches can be accessed through `model.get_matches()`:\n\n```python\n\u003e\u003e\u003e model.get_matches()\n         From      To    Similarity\n0       apple   apple    1.000000\n1      apples  apples    1.000000\n2        appl   apple    0.783751\n3       recal    None    0.000000\n4       house   mouse    0.587927\n5  similarity    None    0.000000\n\n``` \n\n**NOTE 1**: If you want to compare distances within a single list, you can simply pass that list as such: `model.match(from_list)`\n\n**NOTE 2**: When instantiating `PolyFuzz` we also could have used \"EditDistance\" or \"Embeddings\" to quickly \naccess Levenshtein and FastText (English) respectively. \n\n### Production\nThe `.match` function allows you to quickly extract similar strings. However, after selecting the right models to be used, you may want to use PolyFuzz \nin production to match incoming strings. To do so, we can make use of the familiar `fit`, `transform`, and `fit_transform` functions. \n\nLet's say that we have a list of words that we know to be correct called `train_words`. We want to any incoming word to mapped to one of the words in `train_words`. \nIn other words, we `fit` on `train_words` and we use `transform` on any incoming words:\n\n```python\nfrom sklearn.datasets import fetch_20newsgroups\nfrom sklearn.feature_extraction.text import CountVectorizer\nfrom polyfuzz import PolyFuzz\n\ntrain_words = [\"apple\", \"apples\", \"appl\", \"recal\", \"house\", \"similarity\"]\nunseen_words = [\"apple\", \"apples\", \"mouse\"]\n\n# Fit\nmodel = PolyFuzz(\"TF-IDF\")\nmodel.fit(train_words)\n\n# Transform\nresults = model.transform(unseen_words)\n```\n\nIn the above example, we are using `fit` on `train_words` to calculate the TF-IDF representation of those words which are saved to be used again in `transform`. \nThis speeds up `transform` quite a bit since all TF-IDF representations are stored when applying `fit`. \n\nThen, we apply save and load the model as follows to be used in production:\n\n```python\n# Save the model\nmodel.save(\"my_model\")\n\n# Load the model\nloaded_model = PolyFuzz.load(\"my_model\")\n```\n\n### Group Matches\nWe can group the matches `To` as there might be significant overlap in strings in our to_list. \nTo do this, we calculate the similarity within strings in to_list and use `single linkage` to then \ngroup the strings with a high similarity.\n\nWhen we extract the new matches, we can see an additional column `Group` in which all the `To` matches were grouped to:\n\n```python\n\u003e\u003e\u003e model.group(link_min_similarity=0.75)\n\u003e\u003e\u003e model.get_matches()\n\t      From\tTo\t\tSimilarity\tGroup\n0\t     apple\tapple\t1.000000\tapples\n1\t    apples\tapples\t1.000000\tapples\n2\t      appl\tapple\t0.783751\tapples\n3\t     recal\tNone\t0.000000\tNone\n4\t     house\tmouse\t0.587927\tmouse\n5\tsimilarity\tNone\t0.000000\tNone\n```\n\nAs can be seen above, we grouped apple and apples together to `apple` such that when a string is mapped to `apple` it \nwill fall in the cluster of `[apples, apple]` and will be mapped to the first instance in the cluster which is `apples`.\n\n### Precision-Recall Curve  \nNext, we would like to see how well our model is doing on our data. We express our results as \n**`precision`** and **`recall`** where precision is defined as the minimum similarity score before a match is correct and \nrecall the percentage of matches found at a certain minimum similarity score.  \n\nCreating the visualizations is as simple as:\n\n```\nmodel.visualize_precision_recall()\n```\n\u003cimg src=\"images/tfidf.png\" width=\"100%\" /\u003e \n\n## Models\nCurrently, the following models are implemented in PolyFuzz:\n* TF-IDF\n* EditDistance (you can use any distance measure, see [documentation](https://maartengr.github.io/PolyFuzz/tutorial/models/#EditDistance))\n* FastText and GloVe\n* 🤗 Transformers\n\nWith `Flair`, we can use all 🤗 Transformers [models](https://huggingface.co/transformers/pretrained_models.html). \nWe simply have to instantiate any Flair WordEmbedding method and pass it through PolyFuzzy.\n\nAll models listed above can be found in `polyfuzz.models` and can be used to create and compare different models:\n\n```python\nfrom polyfuzz.models import EditDistance, TFIDF, Embeddings\nfrom flair.embeddings import TransformerWordEmbeddings\n\nembeddings = TransformerWordEmbeddings('bert-base-multilingual-cased')\nbert = Embeddings(embeddings, min_similarity=0, model_id=\"BERT\")\ntfidf = TFIDF(min_similarity=0)\nedit = EditDistance()\n\nstring_models = [bert, tfidf, edit]\nmodel = PolyFuzz(string_models)\nmodel.match(from_list, to_list)\n```\n\nTo access the results, we again can call `get_matches` but since we have multiple models we get back a dictionary \nof dataframes back. \n\nIn order to access the results of a specific model, call `get_matches` with the correct id: \n\n```python\n\u003e\u003e\u003e model.get_matches(\"BERT\")\n        From\t    To          Similarity\n0\tapple\t    apple\t1.000000\n1\tapples\t    apples\t1.000000\n2\tappl\t    apple\t0.928045\n3\trecal\t    apples\t0.825268\n4\thouse\t    mouse\t0.887524\n5\tsimilarity  mouse\t0.791548\n``` \n\nFinally, visualize the results to compare the models:\n\n```python\nmodel.visualize_precision_recall(kde=True)\n```\n\n\u003cimg src=\"images/multiple_models.png\" width=\"100%\" /\u003e\n\n## Custom Grouper\nWe can even use one of the `polyfuzz.models` to be used as the grouper in case you would like to use \nsomething else than the standard TF-IDF model:\n\n```python\nmodel = PolyFuzz(\"TF-IDF\")\nmodel.match(from_list, to_list)\n\nedit_grouper = EditDistance(n_jobs=1)\nmodel.group(edit_grouper)\n```\n\n## Custom Models\nAlthough the options above are a great solution for comparing different models, what if you have developed your own? \nIf you follow the structure of PolyFuzz's `BaseMatcher`  \nyou can quickly implement any model you would like.\n\nBelow, we are implementing the ratio similarity measure from RapidFuzz.\n\n```python\nimport numpy as np\nimport pandas as pd\nfrom rapidfuzz import fuzz\nfrom polyfuzz.models import BaseMatcher\n\n\nclass MyModel(BaseMatcher):\n    def match(self, from_list, to_list, **kwargs):\n        # Calculate distances\n        matches = [[fuzz.ratio(from_string, to_string) / 100 for to_string in to_list] \n                    for from_string in from_list]\n        \n        # Get best matches\n        mappings = [to_list[index] for index in np.argmax(matches, axis=1)]\n        scores = np.max(matches, axis=1)\n        \n        # Prepare dataframe\n        matches = pd.DataFrame({'From': from_list,'To': mappings, 'Similarity': scores})\n        return matches\n```\nThen, we can simply create an instance of MyModel and pass it through PolyFuzz:\n```python\ncustom_model = MyModel()\nmodel = PolyFuzz(custom_model)\n```\n\n## Citation\nTo cite PolyFuzz in your work, please use the following bibtex reference:\n\n```bibtex\n@misc{grootendorst2020polyfuzz,\n  author       = {Maarten Grootendorst},\n  title        = {PolyFuzz: Fuzzy string matching, grouping, and evaluation.},\n  year         = 2020,\n  publisher    = {Zenodo},\n  version      = {v0.2.2},\n  doi          = {10.5281/zenodo.4461050},\n  url          = {https://doi.org/10.5281/zenodo.4461050}\n}\n```\n\n## References\nBelow, you can find several resources that were used for or inspired by when developing PolyFuzz:  \n  \n**Edit distance algorithms**:  \nThese algorithms focus primarily on edit distance measures and can be used in `polyfuzz.models.EditDistance`:\n\n* https://github.com/jamesturk/jellyfish\n* https://github.com/ztane/python-Levenshtein\n* https://github.com/seatgeek/fuzzywuzzy\n* https://github.com/maxbachmann/rapidfuzz\n* https://github.com/roy-ht/editdistance\n\n**Other interesting repos**:\n\n* https://github.com/ing-bank/sparse_dot_topn\n    * Used in PolyFuzz for fast cosine similarity between sparse matrices\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmaartengr%2Fpolyfuzz","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmaartengr%2Fpolyfuzz","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmaartengr%2Fpolyfuzz/lists"}