{"id":29042212,"url":"https://github.com/usc-isi-i2/ppjoin","last_synced_at":"2025-06-26T15:07:08.374Z","repository":{"id":141865283,"uuid":"257420478","full_name":"usc-isi-i2/ppjoin","owner":"usc-isi-i2","description":"PPJoin and P4Join Python 3 implementation","archived":false,"fork":false,"pushed_at":"2020-08-18T00:15:31.000Z","size":176,"stargazers_count":5,"open_issues_count":0,"forks_count":0,"subscribers_count":9,"default_branch":"master","last_synced_at":"2023-10-20T23:37:03.715Z","etag":null,"topics":["deduplication","jaccard","jaccard-similarity","join","p4join","pper","ppjoin","privacy-preserving-record-linkage","recordlinkage","string-similarity"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/usc-isi-i2.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2020-04-20T22:40:19.000Z","updated_at":"2023-10-20T23:37:04.612Z","dependencies_parsed_at":null,"dependency_job_id":"4a5c848d-118d-4383-93b8-0d463235be01","html_url":"https://github.com/usc-isi-i2/ppjoin","commit_stats":null,"previous_names":[],"tags_count":2,"template":null,"template_full_name":null,"purl":"pkg:github/usc-isi-i2/ppjoin","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/usc-isi-i2%2Fppjoin","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/usc-isi-i2%2Fppjoin/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/usc-isi-i2%2Fppjoin/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/usc-isi-i2%2Fppjoin/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/usc-isi-i2","download_url":"https://codeload.github.com/usc-isi-i2/ppjoin/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/usc-isi-i2%2Fppjoin/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":262090336,"owners_count":23257127,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["deduplication","jaccard","jaccard-similarity","join","p4join","pper","ppjoin","privacy-preserving-record-linkage","recordlinkage","string-similarity"],"created_at":"2025-06-26T15:07:07.512Z","updated_at":"2025-06-26T15:07:08.366Z","avatar_url":"https://github.com/usc-isi-i2.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# PPJoin [![doi](https://zenodo.org/badge/DOI/10.5281/zenodo.3924703.svg)](https://doi.org/10.5281/zenodo.3924703)\n\nPPJoin and P4Join Python 3 implementation.\n\n## PPJoin\n\nPPJoin stands for Position Prefix Join which is an efficient set similarity join algorithm using the [Jaccard similarity](https://en.wikipedia.org/wiki/Jaccard_index) with several filtering techniques.\n\n\nPPJoin is introduced in\n\n\u003e Xiao, Chuan, et al. \"Efficient similarity joins for near-duplicate detection.\" ACM Transactions on Database Systems (TODS) 36.3 (2011): 1-41.\n\n\u003e This implementation is based on https://github.com/teh/ppjoin.\n\n`join` function takes a list of datasets from different parties and a threshold `t` as input. \nEach dataset is a list of records and each record is formed by list of tokens.\n\n```\nppjoin.join(datasets: List[List[List[str]]], t: float) -\u003e Set[Tuple[Tuple]]\n```\n\nThe return will be a set of tuples and each tuple contains two inner tuples:\n\n```\n((dataset1 index, record index), (dataset2 index, record index))\n```\n\nExample:\n\n```\nfrom ppjoin import ppjoin\n\ndef tokenizer(record):\n    return set(ppjoin.whitespace_tokenizer(record.lower()))\n\n\nds0 = ['a b d', 'a b c', 'h k']\nds1 = ['a b k', 'a b', 'h k', 'a c h']\nds2 = ['a c h']\nds = [\n    [tokenizer(w) for w in ds0],\n    [tokenizer(w) for w in ds1],\n    [tokenizer(w) for w in ds2]\n]\n\n\nresult = ppjoin.join(ds, t=0.5)\n\nfor r in result:\n    ds1_id, r1id = r[0]\n    ds2_id, r2id = r[1]\n    print('Found pair: {} from dataset {}, {} from dataset {}'.format(\n        ds[ds1_id][r1id], ds1_id, ds[ds2_id][r2id], ds2_id\n    ))\n```\n\nOutput:\n\n```\nFound pair: ['a', 'b', 'c'] from dataset 0, ['a', 'b', 'k'] from dataset 1\nFound pair: ['h', 'k'] from dataset 0, ['h', 'k'] from dataset 1\nFound pair: ['a', 'b', 'c'] from dataset 0, ['a', 'c', 'h'] from dataset 2\nFound pair: ['a', 'b', 'd'] from dataset 0, ['a', 'b', 'k'] from dataset 1\nFound pair: ['a', 'b', 'd'] from dataset 0, ['a', 'b'] from dataset 1\nFound pair: ['a', 'b', 'c'] from dataset 0, ['a', 'c', 'h'] from dataset 1\nFound pair: ['a', 'c', 'h'] from dataset 1, ['a', 'c', 'h'] from dataset 2\nFound pair: ['a', 'b', 'c'] from dataset 0, ['a', 'b'] from dataset 1\n```\n\n## P4Join\n\nP4Join (Privacy-Preserving Prefix Position Join) adapts PPJoin with bit operations to solve privacy-preserving record linkage problem. \nIt supports length, prefix and an optimized position filter.\n\n\nThis is introduced in\n\n\u003e Sehili, Ziad, et al. \"Privacy preserving record linkage with PPJoin.\" Datenbanksysteme für Business, Technologie und Web (BTW 2015) (2015).\n\n\nFirst step of P4Join is to encode original record into bit vector. \n`encode_record` takes a list of records each contains a list of tokens, a HMAC key, the length of the bit vector and k which indicates applying how many rounds of combined hash functions.\nThe return of it is a list of encoded record vectors.\n\n```\np4join.encode_record(record: List[List[str]], hmac_key: str, vec_len: int, k: int = 2) -\u003e List[int]\n```\n\n\nP4Join's `join` function is similar to PPJoin's but takes encoded datasets as input. The return format is also identical to PPJoin.\n\n```\np4join.join(datasets: List[List[int]], t: float = 0, vec_len: int = 0) -\u003e Set[Tuple[Tuple]]\n```\n\nExample:\n\n```\nfrom ppjoin import ppjoin, p4join\n\n\ndef tokenizer(record):\n    return set(ppjoin.whitespace_tokenizer(record.lower()))\n\n\nhash_key = 'key'\nvec_len = 40\nk = 2\n\nds0 = ['a b d', 'a b c', 'h k']\nds1 = ['a b k', 'a b', 'h k', 'a c h']\nds2 = ['a c h']\nds = [\n    [tokenizer(w) for w in ds0],\n    [tokenizer(w) for w in ds1],\n    [tokenizer(w) for w in ds2]\n]\n\nds_encoded = [\n    [p4join.encode_record(w, hash_key, vec_len, k) for w in d] for d in ds\n]\n\n\nresult = p4join.join(ds_encoded, t=0.5, vec_len=vec_len)\n\nfor r in result:\n    ds1_id, r1id = r[0]\n    ds2_id, r2id = r[1]\n    print('Found pair: {} from dataset {}, {} from dataset {}'.format(\n        ds[ds1_id][r1id], ds1_id, ds[ds2_id][r2id], ds2_id\n    ))\n```\n\n## Installation\n\n```\npip install -e .\n```\n\n## Test\n\nTo run all unit tests:\n\n```\npython -m unittest discover ppjoin/tests\n```\n\n\u003e Tests on real world dataset Abt-Buy is from [DBGroup of Leipzig](https://dbs.uni-leipzig.de/research/projects/object_matching/benchmark_datasets_for_entity_resolution).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fusc-isi-i2%2Fppjoin","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fusc-isi-i2%2Fppjoin","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fusc-isi-i2%2Fppjoin/lists"}