Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/imvladikon/deduplicator
Simple entity deduplication package
https://github.com/imvladikon/deduplicator
deduplication entity-resolution
Last synced: about 1 month ago
JSON representation
Simple entity deduplication package
- Host: GitHub
- URL: https://github.com/imvladikon/deduplicator
- Owner: imvladikon
- Created: 2023-12-11T20:15:21.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2023-12-11T20:30:19.000Z (about 1 year ago)
- Last Synced: 2024-11-09T00:58:57.506Z (3 months ago)
- Topics: deduplication, entity-resolution
- Language: Python
- Homepage:
- Size: 43.9 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
### Deduplication and consolidation package
Simple package to deduplicate records based on the specified attributes and to consolidate
### Deduplication usage
Usage from python:
```python
from deduplicator import Deduplicator
from deduplicator.matching import NameSimilarityrecords = [{"name": ..., "phone": ...}, ...]
deduplicator = Deduplicator(
comparators=[("name", NameSimilarity)], # list of tuples (attribute, comparator)
aggregation_strategy="mean",
# attributes to block on, nested attributes can be specified with a dot
blocking_attributes=["phone"],
clust_kwargs={"eps": 0.1, "min_samples": 2, "metric": "precomputed"},
)
```Optionally, you can specify a custom `blocking_rule` instead of `blocking_attributes`:
```python
from deduplicator import Deduplicator
from deduplicator.blockings import SortedNeighbourhoodBlockSplitter
from deduplicator.blockings.rules import PhoneticGroupBy, ExactGroupBy, NLetterAbbreviationGroupBy, FirstNCharsGroupBy
from deduplicator.matching import NameSimilarityrecords = [{"contact_name": ..., "phone": ..., "user_id": ...}, ...]
rule1 = PhoneticGroupBy("contact_name") & ExactGroupBy("phone")
rule2 = FirstNCharsGroupBy("contact_name") & ExactGroupBy("phone")
rule3 = NLetterAbbreviationGroupBy("contact_name", n_letters=2)
blocking_rule = rule1 | rule2 | rule3
deduplicator = Deduplicator(
comparators=[("contact_name", NameSimilarity())],
aggregation_strategy="mean",
blocking_rule=blocking_rule,
blocking_splitter=SortedNeighbourhoodBlockSplitter(fields=['phone', 'user_id'],
max_block_size=128),
clust_kwargs={"eps": 0.1, "min_samples": 2, "metric": "precomputed"},
)for cluster_id, duplicates in deduplicator(records, similarity_threshold=0.7):
print("-" * 100)
print(cluster_id)for duplicate in duplicates:
print(duplicate["contact_name"])```