{"id":15882779,"url":"https://github.com/chris-santiago/stringcluster","last_synced_at":"2026-05-16T17:33:55.023Z","repository":{"id":133575279,"uuid":"400661117","full_name":"chris-santiago/stringcluster","owner":"chris-santiago","description":"A Scikit-Learn style deduper.","archived":false,"fork":false,"pushed_at":"2022-10-27T18:09:57.000Z","size":353,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-10-07T14:47:00.867Z","etag":null,"topics":["dedupe","deduplication","scikit-learn","text-processing","text-similarity","transformer"],"latest_commit_sha":null,"homepage":"https://chris-santiago.github.io/stringcluster/","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/chris-santiago.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2021-08-27T23:54:49.000Z","updated_at":"2022-03-30T02:36:36.000Z","dependencies_parsed_at":null,"dependency_job_id":"2a8009a5-90d5-4879-bebf-3ec8fa97640c","html_url":"https://github.com/chris-santiago/stringcluster","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/chris-santiago/stringcluster","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chris-santiago%2Fstringcluster","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chris-santiago%2Fstringcluster/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chris-santiago%2Fstringcluster/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chris-santiago%2Fstringcluster/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/chris-santiago","download_url":"https://codeload.github.com/chris-santiago/stringcluster/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chris-santiago%2Fstringcluster/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":279183585,"owners_count":26121431,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-16T02:00:06.019Z","response_time":53,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["dedupe","deduplication","scikit-learn","text-processing","text-similarity","transformer"],"created_at":"2024-10-06T04:07:07.734Z","updated_at":"2025-10-16T11:29:57.255Z","avatar_url":"https://github.com/chris-santiago.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# string-cluster\n[![Build Status](https://app.travis-ci.com/chris-santiago/stringcluster.svg?branch=master)](https://app.travis-ci.com/chris-santiago/stringcluster)\n[![codecov](https://codecov.io/gh/chris-santiago/stringcluster/branch/master/graph/badge.svg?token=X2SqPEfCdZ)](https://codecov.io/gh/chris-santiago/stringcluster)\n## Install\n\nCreate a virtual environment with Python 3.9 and install from git:\n\n```bash\npip install git+https://github.com/chris-santiago/stringcluster.git\n```\n\n## Use\n\n### Preliminaries\n\nThis example shows how to use `StringCluster` to deduplicate a list of public company names.  The example dataset is a series of company names and their respective variations.  \n\n`StringCluster` uses Tf-Idf vectorization to tokenize each element in a series of strings and normalize the count of each n-gram token. It then uses this transformation to construct a cosine similarity matrix by computing the linear kernel for the vector representations of each data observation. `StringCluster` can compare cosine similarity to either itself or a master list of strings to de-duplicate the original series.\n\n\n```python\nimport re\n\nimport pandas as pd\n\nfrom stringcluster import StringCluster\n```\n\n### Data\n\nAs mentioned, the example dataset is a series of company names (strings). To illustrate, we'll pull out all samples that contain the string \"FACEBOOK\"; we have 11 unique versions for this single company.\n\n\n```python\ndata = pd.read_csv('../data/companies.csv')\ndata.head(10)\n```\n\n\n\n\n\u003cdiv\u003e\n\u003ctable border=\"1\" class=\"dataframe\"\u003e\n  \u003cthead\u003e\n    \u003ctr style=\"text-align: right;\"\u003e\n      \u003cth\u003e\u003c/th\u003e\n      \u003cth\u003ecompany\u003c/th\u003e\n    \u003c/tr\u003e\n  \u003c/thead\u003e\n  \u003ctbody\u003e\n    \u003ctr\u003e\n      \u003cth\u003e0\u003c/th\u003e\n      \u003ctd\u003eMICROSOFT CORP\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e1\u003c/th\u003e\n      \u003ctd\u003eAPPLE INC\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e2\u003c/th\u003e\n      \u003ctd\u003eFACEBOOK INC\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e3\u003c/th\u003e\n      \u003ctd\u003eISHARES TR\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e4\u003c/th\u003e\n      \u003ctd\u003eORACLE CORP\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e5\u003c/th\u003e\n      \u003ctd\u003eALPHABET INC - A\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e6\u003c/th\u003e\n      \u003ctd\u003eJOHNSON \u0026amp; JOHNSON\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e7\u003c/th\u003e\n      \u003ctd\u003eWESTERN DIGITAL CORP\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e8\u003c/th\u003e\n      \u003ctd\u003eAMAZON.COM INC\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e9\u003c/th\u003e\n      \u003ctd\u003eVISA INC\u003c/td\u003e\n    \u003c/tr\u003e\n  \u003c/tbody\u003e\n\u003c/table\u003e\n\u003c/div\u003e\n\n\n\n\n```python\ncompanies = data['company']\nmask = data['company'].str.contains('FACEBOOK')\nfacebook = data['company'][mask]\nprint(f'Number of unique version: {facebook.nunique()}')\nfacebook\n```\n\n    Number of unique version: 11\n\n\n\n\n\n    2                                           FACEBOOK INC\n    408     FACEBOOK INC            CLASS                  A\n    474                                    FACEBOOK INC CL A\n    998                                           FACEBOOK-A\n    1042                                FACEBOOK INC CLASS A\n    1101                                      FACEBOOK INC A\n    1448                                      FACEBOOK INC-A\n    3020                                FACEBOOK INC COM NPV\n    3626                                     FACEBOOK INC -A\n    3638                                            FACEBOOK\n    4340                                      FACEBOOK, INC.\n    Name: company, dtype: object\n\n\n\n### De-duplicating\n\nAs mentioned, `StringCluster` can be used with or without a \"master\" list of string representations, depending on the use case. A master list is provided as the `y` parameter in the `.fit_transform()` method. This can be useful if user have a designated set of representations that they wish to group each sample under.\n\n#### Without a master list\n\nLet's first take a look at use **without** a master list.  The `StringCluster` transformer takes three parameters:\n\n|Parameter|Type|Description|\n|---------|----|-----------|\n|`ngram_size`|int|Size of ngrams to be extracted; default 2.|\n|`threshold`|float|Threshold to determine similarities; must be between [0, 1]; default 0.8.|\n|`stop_tokens`|str|RegEx pattern to remove during tokenization; default `r'[\\W_]+'`|\n\nAlthough we're using Tf-Idf vectorization, and common tokens will have less effect, we can improve performance by providing a list of domain-specific stop tokens. In this case, we'll remove special characters, white space and any word that relates to \"corporation\", \"incorporated\", etc., prior to Tf-Idf vectorization-- these variations within a company's name are meaningless.\n\nAfter fitting the `StringCluster` object and transforming the data, we see that all 11 variations of \"Facebook\" have consolidated to \"FACEBOOK INC\". \n\n**Of note: When using `StringCluster` without a master list, the transformer will default to replacing variations of a string representation with the first variation seen-- in the case, \"FACEBOOK INC\".**\n\n\n```python\nSTOP_TOKENS = r'[\\W_]+|(corporation$)|(corp.$)|(corp$)|(incorporated$)|(inc.$)|(inc$)|(company$)|(common$)|(com$)'\n\ncluster = StringCluster(ngram_size=2, threshold=0.7, stop_tokens=STOP_TOKENS)\nlabels = cluster.fit_transform(data['company'])\n```\n\n\n```python\nlabels[facebook.index]\n```\n\n\n\n\n    2       FACEBOOK INC\n    408     FACEBOOK INC\n    474     FACEBOOK INC\n    998     FACEBOOK INC\n    1042    FACEBOOK INC\n    1101    FACEBOOK INC\n    1448    FACEBOOK INC\n    3020    FACEBOOK INC\n    3626    FACEBOOK INC\n    3638    FACEBOOK INC\n    4340    FACEBOOK INC\n    Name: company, dtype: object\n\n\n\n#### With a master list\n\nLet's take a look at use with a master list.  As mentioned, the master list is passed as the `y` parameter in the `.fit()` and `fit_transform()` methods.  In this case, each string in the series is compared against the master list and replaced with the representation in the master list with which it exhibits the highest cosine similarity.\n\n\n```python\nTEST_SERIES = pd.Series(\n        ['Johnson \u0026 Johnson, Inc.', 'Johnson \u0026 Johnson Inc.', 'Johnson \u0026 Johnson Inc',\n         'Johnson \u0026 Johnson', 'Intel Corp', 'Intel Corp.', 'Intel Corporation', 'Google',\n         'Apple', 'Amazon', 'Amazon Inc', 'Comcast Inc.', 'Comcast Corp']\n    )\nMASTER = ['Johnson \u0026 Johnson', 'Intel Corp', 'Google', 'Apple Inc', 'Amazon', 'Comcast']\n\nSTOP_TOKENS = r'[\\W_]+|(corporation$)|(corp.$)|(corp$)|(incorporated$)|(inc.$)|(inc$)|(company$)|(common$)|(com$)'\n\ncluster = StringCluster(ngram_size=2, stop_tokens=STOP_TOKENS)\nlabels = cluster.fit_transform(TEST_SERIES, MASTER)\n```\n\n\n```python\nlabels\n```\n\n\n\n\n    0     Johnson \u0026 Johnson\n    1     Johnson \u0026 Johnson\n    2     Johnson \u0026 Johnson\n    3     Johnson \u0026 Johnson\n    4            Intel Corp\n    5            Intel Corp\n    6            Intel Corp\n    7                Google\n    8             Apple Inc\n    9                Amazon\n    10               Amazon\n    11              Comcast\n    12              Comcast\n    dtype: object\n\n\n\n### Trialing Different Threshold Values\n\nThe `StringCluster` transformer is sensitive to the `threshold` parameter (especially without a master list), as this controls how matches are flagged, based on their cosine similarity.  Let's take a look at how varying levels of the `threshold` parameter affect results on our Facebook example.\n\n\n```python\nthresh = 0.7\nwhile thresh \u003c 1:\n    cluster = StringCluster(ngram_size=2, threshold=thresh, stop_tokens=STOP_TOKENS)\n    labels = cluster.fit_transform(data['company'])\n    print(f'Threshold: {thresh}')\n    print('----------------------------------------')\n    print(labels[facebook.index])\n    print('========================================')\n    thresh += 0.05\n```\n\n    Threshold: 0.7\n    ----------------------------------------\n    2       FACEBOOK INC\n    408     FACEBOOK INC\n    474     FACEBOOK INC\n    998     FACEBOOK INC\n    1042    FACEBOOK INC\n    1101    FACEBOOK INC\n    1448    FACEBOOK INC\n    3020    FACEBOOK INC\n    3626    FACEBOOK INC\n    3638    FACEBOOK INC\n    4340    FACEBOOK INC\n    Name: company, dtype: object\n    ========================================\n    Threshold: 0.75\n    ----------------------------------------\n    2               FACEBOOK INC\n    408             FACEBOOK INC\n    474             FACEBOOK INC\n    998             FACEBOOK INC\n    1042            FACEBOOK INC\n    1101            FACEBOOK INC\n    1448            FACEBOOK INC\n    3020    FACEBOOK INC COM NPV\n    3626            FACEBOOK INC\n    3638            FACEBOOK INC\n    4340            FACEBOOK INC\n    Name: company, dtype: object\n    ========================================\n    Threshold: 0.8\n    ----------------------------------------\n    2                                           FACEBOOK INC\n    408     FACEBOOK INC            CLASS                  A\n    474                                         FACEBOOK INC\n    998                                         FACEBOOK INC\n    1042    FACEBOOK INC            CLASS                  A\n    1101                                        FACEBOOK INC\n    1448                                        FACEBOOK INC\n    3020                                FACEBOOK INC COM NPV\n    3626                                        FACEBOOK INC\n    3638                                        FACEBOOK INC\n    4340                                        FACEBOOK INC\n    Name: company, dtype: object\n    ========================================\n    Threshold: 0.8500000000000001\n    ----------------------------------------\n    2                                           FACEBOOK INC\n    408     FACEBOOK INC            CLASS                  A\n    474     FACEBOOK INC            CLASS                  A\n    998                                         FACEBOOK INC\n    1042    FACEBOOK INC            CLASS                  A\n    1101                                        FACEBOOK INC\n    1448                                        FACEBOOK INC\n    3020                                FACEBOOK INC COM NPV\n    3626                                        FACEBOOK INC\n    3638                                        FACEBOOK INC\n    4340                                        FACEBOOK INC\n    Name: company, dtype: object\n    ========================================\n    Threshold: 0.9000000000000001\n    ----------------------------------------\n    2                                           FACEBOOK INC\n    408     FACEBOOK INC            CLASS                  A\n    474     FACEBOOK INC            CLASS                  A\n    998                                         FACEBOOK INC\n    1042    FACEBOOK INC            CLASS                  A\n    1101                                   FACEBOOK INC CL A\n    1448                                   FACEBOOK INC CL A\n    3020                                FACEBOOK INC COM NPV\n    3626                                   FACEBOOK INC CL A\n    3638                                        FACEBOOK INC\n    4340                                        FACEBOOK INC\n    Name: company, dtype: object\n    ========================================\n    Threshold: 0.9500000000000002\n    ----------------------------------------\n    2                                           FACEBOOK INC\n    408     FACEBOOK INC            CLASS                  A\n    474                                    FACEBOOK INC CL A\n    998                                         FACEBOOK INC\n    1042    FACEBOOK INC            CLASS                  A\n    1101                                      FACEBOOK INC A\n    1448                                      FACEBOOK INC A\n    3020                                FACEBOOK INC COM NPV\n    3626                                      FACEBOOK INC A\n    3638                                        FACEBOOK INC\n    4340                                        FACEBOOK INC\n    Name: company, dtype: object\n    ========================================\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fchris-santiago%2Fstringcluster","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fchris-santiago%2Fstringcluster","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fchris-santiago%2Fstringcluster/lists"}