{"id":21037431,"url":"https://github.com/vintasoftware/entity-embed","last_synced_at":"2025-10-08T17:23:53.790Z","repository":{"id":51083627,"uuid":"333867421","full_name":"vintasoftware/entity-embed","owner":"vintasoftware","description":"PyTorch library for transforming entities like companies, products, etc. into vectors to support scalable Record Linkage / Entity Resolution using Approximate Nearest Neighbors.","archived":false,"fork":false,"pushed_at":"2022-11-18T11:36:08.000Z","size":11981,"stargazers_count":151,"open_issues_count":4,"forks_count":16,"subscribers_count":19,"default_branch":"main","last_synced_at":"2025-04-02T10:44:12.815Z","etag":null,"topics":["approximate-nearest-neighbors","data-matching","deduplication","deep-learning","embeddings","entity-matching","entity-resolution","python","pytorch","record-linkage","representation-learning"],"latest_commit_sha":null,"homepage":"https://entity-embed.readthedocs.io/en/latest/","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/vintasoftware.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.rst","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2021-01-28T19:07:11.000Z","updated_at":"2025-02-25T01:08:21.000Z","dependencies_parsed_at":"2022-09-19T06:00:31.691Z","dependency_job_id":null,"html_url":"https://github.com/vintasoftware/entity-embed","commit_stats":null,"previous_names":[],"tags_count":5,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vintasoftware%2Fentity-embed","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vintasoftware%2Fentity-embed/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vintasoftware%2Fentity-embed/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vintasoftware%2Fentity-embed/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/vintasoftware","download_url":"https://codeload.github.com/vintasoftware/entity-embed/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248076317,"owners_count":21043748,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["approximate-nearest-neighbors","data-matching","deduplication","deep-learning","embeddings","entity-matching","entity-resolution","python","pytorch","record-linkage","representation-learning"],"created_at":"2024-11-19T13:26:13.367Z","updated_at":"2025-10-08T17:23:48.752Z","avatar_url":"https://github.com/vintasoftware.png","language":"Jupyter Notebook","funding_links":[],"categories":["Open-Source Software"],"sub_categories":["Embeddings (for pairwise comparison)"],"readme":"# Entity Embed\n\n[![PyPi version](https://img.shields.io/pypi/v/entity-embed.svg)](https://pypi.python.org/pypi/entity-embed)\n[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/entity-embed)](https://pypi.org/project/entity-embed/)\n[![CI](https://github.com/vintasoftware/entity-embed/actions/workflows/ci.yml/badge.svg)](https://github.com/vintasoftware/entity-embed/actions/workflows/ci.yml)\n[![Documentation Status](https://readthedocs.org/projects/entity-embed/badge/?version=latest)](https://entity-embed.readthedocs.io/en/latest/?badge=latest)\n[![Coverage Status](https://coveralls.io/repos/github/vintasoftware/entity-embed/badge.svg?branch=main)](https://coveralls.io/github/vintasoftware/entity-embed?branch=main)\n[![License: MIT](https://img.shields.io/github/license/vintasoftware/django-react-boilerplate.svg)](LICENSE.txt)\n\nEntity Embed allows you to transform entities like companies, products, etc. into vectors to support **scalable Record Linkage / Entity Resolution using Approximate Nearest Neighbors**.\n\nUsing Entity Embed, you can train a deep learning model to transform records into vectors in an N-dimensional embedding space. Thanks to a contrastive loss, those vectors are organized to keep similar records close and dissimilar records far apart in this embedding space. Embedding records enables [scalable ANN search](http://ann-benchmarks.com/index.html), which means finding thousands of candidate duplicate pairs of records per second per CPU.\n\nEntity Embed achieves Recall of ~0.99 with Pair-Entity ratio below 100 on a variety of datasets. **Entity Embed aims for high recall at the expense of precision. Therefore, this library is suited for the Blocking/Indexing stage of an Entity Resolution pipeline.**  A scalabale and noise-tolerant Blocking procedure is often the main bottleneck for performance and quality on Entity Resolution pipelines, so this library aims to solve that. Note the ANN search on embedded records returns several candidate pairs that must be filtered to find the best matching pairs, possibly with a pairwise classifier (an [example](#Examples) for that is available).\n\nEntity Embed is based on and is a special case of the [AutoBlock model described by Amazon](https://www.amazon.science/publications/autoblock-a-hands-off-blocking-framework-for-entity-matching).\n\n**⚠️ Warning: this project is under heavy development.**\n\n![Embedding Space Example](https://user-images.githubusercontent.com/397989/113318040-689a2d00-92e6-11eb-8373-29477d57d29e.png)\n\n## Documentation\n\nhttps://entity-embed.readthedocs.io\n\n## Requirements\n\n### System\n\n- MacOS or Linux (tested on latest MacOS and Ubuntu via GitHub Actions).\n- Entity Embed can train and run on a powerful laptop. Tested on a system with 32 GBs of RAM, RTX 2070 Mobile (8 GB VRAM), i7-10750H (12 threads). With batch sizes smaller than 32 and few field types, it's possible to train and run even with 2 GB of VRAM.\n\n### Libraries\n\n- **Python**: \u003e= 3.6\n- **[Numpy](https://numpy.org/)**: \u003e= 1.19.0\n- **[PyTorch](https://pytorch.org/)**: \u003e= 1.7.1, \u003c 1.9\n- **[PyTorch Lightning](https://pytorch-lightning.readthedocs.io/en/latest/)**: \u003e= 1.1.6, \u003c 1.3\n- **[N2](https://github.com/kakao/n2/)**: \u003e= 0.1.7, \u003c 1.2\n\nAnd others, see [requirements.txt](/requirements.txt).\n\n## Installation\n\n```\npip install entity-embed\n```\n\n### For Conda users\n\nIf you're using Conda, you must install PyTorch beforehand to have proper CUDA support. Inside the Conda environment, please run the following command **before** installing Entity Embed using `pip`:\n\n```\nconda install pytorch torchvision torchaudio cudatoolkit=11.1 -c pytorch -c conda-forge\n```\n\n## Examples\n\nRun:\n\n```\npip install -r requirements-examples.txt\n```\n\nThen check the example Jupyter Notebooks:\n\n- Deduplication, when you have a single dirty dataset with duplicates: [notebooks/Deduplication-Example.ipynb](/notebooks/Deduplication-Example.ipynb)\n- Record Linkage, when you have multiple clean datasets you need to link: [notebooks/Record-Linkage-Example.ipynb](/notebooks/Record-Linkage-Example.ipynb)\n- After you run the notebooks/Record-Linkage-Example.ipynb, you can check the [notebooks/End-to-End-Matching-Example.ipynb](/notebooks/End-to-End-Matching-Example.ipynb) to learn how to integrate Entity Embed with a pairwise classifier.\n\n### Colab\n\nPlease check [notebooks/google-colab/](https://github.com/vintasoftware/entity-embed/tree/main/notebooks/google-colab/).\n\n## Releases\n\nSee [CHANGELOG.md](/CHANGELOG.md).\n\n## Credits\n\nThis project is maintained by [open-source contributors](/AUTHORS.rst) and [Vinta Software](https://www.vintasoftware.com/).\n\nThis package was created with [Cookiecutter](https://github.com/audreyr/cookiecutter) and the [`audreyr/cookiecutter-pypackage`](https://github.com/audreyr/cookiecutter-pypackage) project template.\n\n## Commercial Support\n[![alt text](https://avatars2.githubusercontent.com/u/5529080?s=80\u0026v=4 \"Vinta Logo\")](https://www.vinta.com.br/)\n\n[Vinta Software](https://www.vinta.com.br/) is always looking for exciting work, so if you need any commercial support, feel free to get in touch: contact@vinta.com.br\n\n## References\n\n- Zhang, W., Wei, H., Sisman, B., Dong, X. L., Faloutsos, C., \u0026 Page, D. (2020, January). AutoBlock: A hands-off blocking framework for entity matching. In *Proceedings of the 13th International Conference on Web Search and Data Mining* (pp. 744-752). [(pdf)](https://www.amazon.science/publications/autoblock-a-hands-off-blocking-framework-for-entity-matching)\n- Dai, X., Yan, X., Zhou, K., Wang, Y., Yang, H., \u0026 Cheng, J. (2020, July). Convolutional Embedding for Edit Distance. In *Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval* (pp. 599-608). [(pdf)](https://arxiv.org/abs/2001.11692) [(code)](https://github.com/xinyandai/string-embed/)\n\n## Citations\n\nIf you use Entity Embed in your research, please consider citing it.\n\nBibTeX entry:\n\n```\n@software{entity-embed,\n  title = {{Entity Embed}: Scalable Entity Resolution using Approximate Nearest Neighbors.},\n  author = {Juvenal, Flávio and Vieira, Renato},\n  url = {https://github.com/vintasoftware/entity-embed},\n  version = {0.0.6},\n  date = {2021-07-16},\n  year = {2021}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvintasoftware%2Fentity-embed","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fvintasoftware%2Fentity-embed","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvintasoftware%2Fentity-embed/lists"}