https://github.com/usc-isi-i2/ppjoin
PPJoin and P4Join Python 3 implementation
https://github.com/usc-isi-i2/ppjoin
deduplication jaccard jaccard-similarity join p4join pper ppjoin privacy-preserving-record-linkage recordlinkage string-similarity
Last synced: 3 months ago
JSON representation
PPJoin and P4Join Python 3 implementation
- Host: GitHub
- URL: https://github.com/usc-isi-i2/ppjoin
- Owner: usc-isi-i2
- License: mit
- Created: 2020-04-20T22:40:19.000Z (over 5 years ago)
- Default Branch: master
- Last Pushed: 2020-08-18T00:15:31.000Z (about 5 years ago)
- Last Synced: 2023-10-20T23:37:03.715Z (almost 2 years ago)
- Topics: deduplication, jaccard, jaccard-similarity, join, p4join, pper, ppjoin, privacy-preserving-record-linkage, recordlinkage, string-similarity
- Language: Python
- Homepage:
- Size: 172 KB
- Stars: 5
- Watchers: 9
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# PPJoin [](https://doi.org/10.5281/zenodo.3924703)
PPJoin and P4Join Python 3 implementation.
## PPJoin
PPJoin stands for Position Prefix Join which is an efficient set similarity join algorithm using the [Jaccard similarity](https://en.wikipedia.org/wiki/Jaccard_index) with several filtering techniques.
PPJoin is introduced in
> Xiao, Chuan, et al. "Efficient similarity joins for near-duplicate detection." ACM Transactions on Database Systems (TODS) 36.3 (2011): 1-41.
> This implementation is based on https://github.com/teh/ppjoin.
`join` function takes a list of datasets from different parties and a threshold `t` as input.
Each dataset is a list of records and each record is formed by list of tokens.```
ppjoin.join(datasets: List[List[List[str]]], t: float) -> Set[Tuple[Tuple]]
```The return will be a set of tuples and each tuple contains two inner tuples:
```
((dataset1 index, record index), (dataset2 index, record index))
```Example:
```
from ppjoin import ppjoindef tokenizer(record):
return set(ppjoin.whitespace_tokenizer(record.lower()))ds0 = ['a b d', 'a b c', 'h k']
ds1 = ['a b k', 'a b', 'h k', 'a c h']
ds2 = ['a c h']
ds = [
[tokenizer(w) for w in ds0],
[tokenizer(w) for w in ds1],
[tokenizer(w) for w in ds2]
]result = ppjoin.join(ds, t=0.5)
for r in result:
ds1_id, r1id = r[0]
ds2_id, r2id = r[1]
print('Found pair: {} from dataset {}, {} from dataset {}'.format(
ds[ds1_id][r1id], ds1_id, ds[ds2_id][r2id], ds2_id
))
```Output:
```
Found pair: ['a', 'b', 'c'] from dataset 0, ['a', 'b', 'k'] from dataset 1
Found pair: ['h', 'k'] from dataset 0, ['h', 'k'] from dataset 1
Found pair: ['a', 'b', 'c'] from dataset 0, ['a', 'c', 'h'] from dataset 2
Found pair: ['a', 'b', 'd'] from dataset 0, ['a', 'b', 'k'] from dataset 1
Found pair: ['a', 'b', 'd'] from dataset 0, ['a', 'b'] from dataset 1
Found pair: ['a', 'b', 'c'] from dataset 0, ['a', 'c', 'h'] from dataset 1
Found pair: ['a', 'c', 'h'] from dataset 1, ['a', 'c', 'h'] from dataset 2
Found pair: ['a', 'b', 'c'] from dataset 0, ['a', 'b'] from dataset 1
```## P4Join
P4Join (Privacy-Preserving Prefix Position Join) adapts PPJoin with bit operations to solve privacy-preserving record linkage problem.
It supports length, prefix and an optimized position filter.This is introduced in
> Sehili, Ziad, et al. "Privacy preserving record linkage with PPJoin." Datenbanksysteme für Business, Technologie und Web (BTW 2015) (2015).
First step of P4Join is to encode original record into bit vector.
`encode_record` takes a list of records each contains a list of tokens, a HMAC key, the length of the bit vector and k which indicates applying how many rounds of combined hash functions.
The return of it is a list of encoded record vectors.```
p4join.encode_record(record: List[List[str]], hmac_key: str, vec_len: int, k: int = 2) -> List[int]
```P4Join's `join` function is similar to PPJoin's but takes encoded datasets as input. The return format is also identical to PPJoin.
```
p4join.join(datasets: List[List[int]], t: float = 0, vec_len: int = 0) -> Set[Tuple[Tuple]]
```Example:
```
from ppjoin import ppjoin, p4joindef tokenizer(record):
return set(ppjoin.whitespace_tokenizer(record.lower()))hash_key = 'key'
vec_len = 40
k = 2ds0 = ['a b d', 'a b c', 'h k']
ds1 = ['a b k', 'a b', 'h k', 'a c h']
ds2 = ['a c h']
ds = [
[tokenizer(w) for w in ds0],
[tokenizer(w) for w in ds1],
[tokenizer(w) for w in ds2]
]ds_encoded = [
[p4join.encode_record(w, hash_key, vec_len, k) for w in d] for d in ds
]result = p4join.join(ds_encoded, t=0.5, vec_len=vec_len)
for r in result:
ds1_id, r1id = r[0]
ds2_id, r2id = r[1]
print('Found pair: {} from dataset {}, {} from dataset {}'.format(
ds[ds1_id][r1id], ds1_id, ds[ds2_id][r2id], ds2_id
))
```## Installation
```
pip install -e .
```## Test
To run all unit tests:
```
python -m unittest discover ppjoin/tests
```> Tests on real world dataset Abt-Buy is from [DBGroup of Leipzig](https://dbs.uni-leipzig.de/research/projects/object_matching/benchmark_datasets_for_entity_resolution).