https://github.com/usc-isi-i2/ppjoin

PPJoin and P4Join Python 3 implementation
https://github.com/usc-isi-i2/ppjoin

deduplication jaccard jaccard-similarity join p4join pper ppjoin privacy-preserving-record-linkage recordlinkage string-similarity

Last synced: 3 months ago
JSON representation

PPJoin and P4Join Python 3 implementation

Host: GitHub
URL: https://github.com/usc-isi-i2/ppjoin
Owner: usc-isi-i2
License: mit
Created: 2020-04-20T22:40:19.000Z (over 5 years ago)
Default Branch: master
Last Pushed: 2020-08-18T00:15:31.000Z (about 5 years ago)
Last Synced: 2023-10-20T23:37:03.715Z (almost 2 years ago)
Topics: deduplication, jaccard, jaccard-similarity, join, p4join, pper, ppjoin, privacy-preserving-record-linkage, recordlinkage, string-similarity
Language: Python
Homepage:
Size: 172 KB
Stars: 5
Watchers: 9
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # PPJoin [![doi](https://zenodo.org/badge/DOI/10.5281/zenodo.3924703.svg)](https://doi.org/10.5281/zenodo.3924703)

PPJoin and P4Join Python 3 implementation.

## PPJoin

PPJoin stands for Position Prefix Join which is an efficient set similarity join algorithm using the [Jaccard similarity](https://en.wikipedia.org/wiki/Jaccard_index) with several filtering techniques.

PPJoin is introduced in

> Xiao, Chuan, et al. "Efficient similarity joins for near-duplicate detection." ACM Transactions on Database Systems (TODS) 36.3 (2011): 1-41.

> This implementation is based on https://github.com/teh/ppjoin.

`join` function takes a list of datasets from different parties and a threshold `t` as input. 

Each dataset is a list of records and each record is formed by list of tokens.

```

ppjoin.join(datasets: List[List[List[str]]], t: float) -> Set[Tuple[Tuple]]

```

The return will be a set of tuples and each tuple contains two inner tuples:

```

((dataset1 index, record index), (dataset2 index, record index))

```

Example:

```

from ppjoin import ppjoin

def tokenizer(record):

    return set(ppjoin.whitespace_tokenizer(record.lower()))

ds0 = ['a b d', 'a b c', 'h k']

ds1 = ['a b k', 'a b', 'h k', 'a c h']

ds2 = ['a c h']

ds = [

    [tokenizer(w) for w in ds0],

    [tokenizer(w) for w in ds1],

    [tokenizer(w) for w in ds2]

]

result = ppjoin.join(ds, t=0.5)

for r in result:

    ds1_id, r1id = r[0]

    ds2_id, r2id = r[1]

    print('Found pair: {} from dataset {}, {} from dataset {}'.format(

        ds[ds1_id][r1id], ds1_id, ds[ds2_id][r2id], ds2_id

    ))

```

Output:

```

Found pair: ['a', 'b', 'c'] from dataset 0, ['a', 'b', 'k'] from dataset 1

Found pair: ['h', 'k'] from dataset 0, ['h', 'k'] from dataset 1

Found pair: ['a', 'b', 'c'] from dataset 0, ['a', 'c', 'h'] from dataset 2

Found pair: ['a', 'b', 'd'] from dataset 0, ['a', 'b', 'k'] from dataset 1

Found pair: ['a', 'b', 'd'] from dataset 0, ['a', 'b'] from dataset 1

Found pair: ['a', 'b', 'c'] from dataset 0, ['a', 'c', 'h'] from dataset 1

Found pair: ['a', 'c', 'h'] from dataset 1, ['a', 'c', 'h'] from dataset 2

Found pair: ['a', 'b', 'c'] from dataset 0, ['a', 'b'] from dataset 1

```

## P4Join

P4Join (Privacy-Preserving Prefix Position Join) adapts PPJoin with bit operations to solve privacy-preserving record linkage problem. 

It supports length, prefix and an optimized position filter.

This is introduced in

> Sehili, Ziad, et al. "Privacy preserving record linkage with PPJoin." Datenbanksysteme für Business, Technologie und Web (BTW 2015) (2015).

First step of P4Join is to encode original record into bit vector. 

`encode_record` takes a list of records each contains a list of tokens, a HMAC key, the length of the bit vector and k which indicates applying how many rounds of combined hash functions.

The return of it is a list of encoded record vectors.

```

p4join.encode_record(record: List[List[str]], hmac_key: str, vec_len: int, k: int = 2) -> List[int]

```

P4Join's `join` function is similar to PPJoin's but takes encoded datasets as input. The return format is also identical to PPJoin.

```

p4join.join(datasets: List[List[int]], t: float = 0, vec_len: int = 0) -> Set[Tuple[Tuple]]

```

Example:

```

from ppjoin import ppjoin, p4join

def tokenizer(record):

    return set(ppjoin.whitespace_tokenizer(record.lower()))

hash_key = 'key'

vec_len = 40

k = 2

ds0 = ['a b d', 'a b c', 'h k']

ds1 = ['a b k', 'a b', 'h k', 'a c h']

ds2 = ['a c h']

ds = [

    [tokenizer(w) for w in ds0],

    [tokenizer(w) for w in ds1],

    [tokenizer(w) for w in ds2]

]

ds_encoded = [

    [p4join.encode_record(w, hash_key, vec_len, k) for w in d] for d in ds

]

result = p4join.join(ds_encoded, t=0.5, vec_len=vec_len)

for r in result:

    ds1_id, r1id = r[0]

    ds2_id, r2id = r[1]

    print('Found pair: {} from dataset {}, {} from dataset {}'.format(

        ds[ds1_id][r1id], ds1_id, ds[ds2_id][r2id], ds2_id

    ))

```

## Installation

```

pip install -e .

```

## Test

To run all unit tests:

```

python -m unittest discover ppjoin/tests

```

> Tests on real world dataset Abt-Buy is from [DBGroup of Leipzig](https://dbs.uni-leipzig.de/research/projects/object_matching/benchmark_datasets_for_entity_resolution).

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/usc-isi-i2/ppjoin

Awesome Lists containing this project

README