{"id":18866805,"url":"https://github.com/ing-bank/entitymatchingmodel","last_synced_at":"2025-04-05T05:09:31.058Z","repository":{"id":211454811,"uuid":"638474636","full_name":"ing-bank/EntityMatchingModel","owner":"ing-bank","description":"Entity Matching Model solves the problem of matching company names between two possibly very large datasets.","archived":false,"fork":false,"pushed_at":"2025-02-25T07:29:32.000Z","size":297,"stargazers_count":68,"open_issues_count":2,"forks_count":8,"subscribers_count":7,"default_branch":"main","last_synced_at":"2025-03-29T04:12:00.750Z","etag":null,"topics":["entity-matching","pandas","spark"],"latest_commit_sha":null,"homepage":"https://entitymatchingmodel.readthedocs.io/en/latest/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ing-bank.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-05-09T12:50:07.000Z","updated_at":"2025-03-23T12:14:07.000Z","dependencies_parsed_at":"2025-02-12T08:09:49.961Z","dependency_job_id":"a1ad17c0-cba8-4aba-a42b-c283771d4201","html_url":"https://github.com/ing-bank/EntityMatchingModel","commit_stats":null,"previous_names":["ing-bank/entitymatchingmodel"],"tags_count":12,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ing-bank%2FEntityMatchingModel","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ing-bank%2FEntityMatchingModel/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ing-bank%2FEntityMatchingModel/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ing-bank%2FEntityMatchingModel/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ing-bank","download_url":"https://codeload.github.com/ing-bank/EntityMatchingModel/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247289429,"owners_count":20914464,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["entity-matching","pandas","spark"],"created_at":"2024-11-08T05:07:31.467Z","updated_at":"2025-04-05T05:09:31.029Z","avatar_url":"https://github.com/ing-bank.png","language":"Python","readme":"# Entity Matching model\n\n[![Build](https://github.com/ing-bank/EntityMatchingModel/actions/workflows/test.yml/badge.svg?branch=main)](https://github.com/ing-bank/EntityMatchingModel/actions)\n[![Latest Github release](https://img.shields.io/github/v/release/ing-bank/EntityMatchingModel)](https://github.com/ing-bank/EntityMatchingModel/releases)\n[![GitHub release date](https://img.shields.io/github/release-date/ing-bank/EntityMatchingModel)](https://github.com/ing-bank/EntityMatchingModel/releases)\n[![Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/charliermarsh/ruff/main/assets/badge/v1.json)](https://github.com/astral-sh/ruff)\n[![Downloads](https://static.pepy.tech/badge/emm)](https://pepy.tech/project/emm)\n\n\nEntity Matching Model (EMM) solves the problem of matching company names between two possibly very\nlarge datasets. EMM can match millions against millions of names with a distributed approach.\nIt uses the well-established candidate selection techniques in string matching,\nnamely: tfidf vectorization combined with cosine similarity (with significant optimization),\nboth word-based and character-based, and sorted neighbourhood indexing.\nThese so-called indexers act complementary for selecting realistic name-pair candidates.\nOn top of the indexers, EMM has a classifier with optimized string-based, rank-based, and legal-entity\nbased features to estimate how confident a company name match is.\n\nThe classifier can be trained to give a string similarity score or a probability of match.\nBoth types of score are useful, in particular when there are many good-looking matches to choose between.\nOptionally, the EMM package can also be used to match a group of company names that belong together,\nto a common company name in the ground truth. For example, all different names used to address an external bank account.\nThis step aggregates the name-matching scores from the supervised layer into a single match.\n\nThe package is modular in design and and works both using both Pandas and Spark. A classifier trained with the former\ncan be used with the latter and vice versa.\n\nFor release history see [GitHub Releases](https://github.com/ing-bank/EntityMatchingModel/releases).\n\n## Notebooks\n\nFor detailed examples of the code please see the notebooks under `notebooks/`.\n\n- `01-entity-matching-pandas-version.ipynb`: Using the Pandas version of EMM for name-matching.\n- `02-entity-matching-spark-version.ipynb`: Using the Spark version of EMM for name-matching.\n- `03-entity-matching-training-pandas-version.ipynb`: Fitting the supervised model and setting a discrimination threshold (Pandas).\n- `04-entity-matching-aggregation-pandas-version.ipynb`: Using the aggregation layer and setting a discrimination threshold (Pandas).\n\n## Documentation\n\nFor documentation, design, and API see [the documentation](https://entitymatchingmodel.readthedocs.io/en/latest/).\nOr read our Medium blog [Entity Matching at Scale!](https://medium.com/p/af20429a80c7)\n\n## Check it out\n\nThe Entity matching model library requires Python \u003e= 3.7 and is pip friendly. To get started, simply do:\n\n```shell\npip install emm\n```\n\nor check out the code from our repository:\n\n```shell\ngit clone https://github.com/ing-bank/EntityMatchingModel.git\npip install -e EntityMatchingModel/\n```\n\nwhere in this example the code is installed in edit mode (option -e).\n\nAdditional dependencies can be installed with, e.g.:\n\n```shell\npip install \"emm[spark,dev,test]\"\n```\n\nYou can now use the package in Python with:\n\n\n```python\nimport emm\n```\n\n**Congratulations, you are now ready to use the Entity Matching model!**\n\n## Quick run\n\nAs a quick example, you can do:\n\n```python\nfrom emm import PandasEntityMatching\nfrom emm.data.create_data import create_example_noised_names\n\n# generate example ground-truth names and matching noised names, with typos and missing words.\nground_truth, noised_names = create_example_noised_names(random_seed=42)\ntrain_names, test_names = noised_names[:5000], noised_names[5000:]\n\n# two example name-pair candidate generators: character-based cosine similarity and sorted neighbouring indexing\nindexers = [\n  {\n      'type': 'cosine_similarity',\n      'tokenizer': 'characters',   # character-based cosine similarity. alternative: 'words'\n      'ngram': 2,                  # 2-character tokens only\n      'num_candidates': 5,         # max 5 candidates per name-to-match\n      'cos_sim_lower_bound': 0.2,  # lower bound on cosine similarity\n  },\n  {'type': 'sni', 'window_length': 3}  # sorted neighbouring indexing window of size 3.\n]\nem_params = {\n  'name_only': True,         # only consider name information for matching\n  'entity_id_col': 'Index',  # important to set both index and name columns to pick up\n  'name_col': 'Name',\n  'indexers': indexers,\n  'supervised_on': False,    # no supervided model (yet) to select best candidates\n  'with_legal_entity_forms_match': True,   # add feature that indicates match of legal entity forms (e.g. ltd != co)\n}\n# 1. initialize the entity matcher\np = PandasEntityMatching(em_params)\n\n# 2. fitting: prepare the indexers based on the ground truth names, eg. fit the tfidf matrix of the first indexer.\np.fit(ground_truth)\n\n# 3. create and fit a supervised model for the PandasEntityMatching object, to pick the best match (this takes a while)\n#    input is \"positive\" names column 'Name' that are all supposed to match to the ground truth,\n#    and an id column 'Index' to check with candidate name-pairs are matching and which not.\n#    A fraction of these names may be turned into negative names (= no match to the ground truth).\n#    (internally, candidate name-pairs are automatically generated, these are the input to the classification)\np.fit_classifier(train_names, create_negative_sample_fraction=0.5)\n\n# 4. scoring: generate pandas dataframe of all name-pair candidates.\n#    The classifier-based probability of match is provided in the column 'nm_score'.\n#    Note: can also call p.transform() without training the classifier first.\ncandidates_scored_pd = p.transform(test_names)\n\n# 5. scoring: for each name-to-match, select the best ground-truth candidate.\nbest_candidates = candidates_scored_pd[candidates_scored_pd.best_match]\nbest_candidates.head()\n```\n\nFor Spark, you can use the class `SparkEntityMatching` instead, with the same API as the Pandas version.\nFor all available examples, please see the tutorial notebooks under `notebooks/`.\n\n## Project contributors\n\nThis package was authored by ING Analytics Wholesale Banking.\n\n## Contact and support\n\nContact the WBAA team via Github issues.\nPlease note that INGA-WB provides support only on a best-effort basis.\n\n## License\n\nCopyright ING WBAA 2023. Entity Matching Model is completely free, open-source and licensed under the [MIT license](https://en.wikipedia.org/wiki/MIT_License).\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fing-bank%2Fentitymatchingmodel","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fing-bank%2Fentitymatchingmodel","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fing-bank%2Fentitymatchingmodel/lists"}