https://github.com/joogswastaken/bf-pprl-attack

Implementation of a frequency-based re-identification attack on Bloom filters in PPRL protocols
https://github.com/joogswastaken/bf-pprl-attack

cybersec infosec poc privacy python security

Last synced: 10 months ago
JSON representation

Implementation of a frequency-based re-identification attack on Bloom filters in PPRL protocols

Host: GitHub
URL: https://github.com/joogswastaken/bf-pprl-attack
Owner: JoogsWasTaken
License: mit
Created: 2022-12-11T18:34:56.000Z (over 3 years ago)
Default Branch: main
Last Pushed: 2022-12-11T19:17:11.000Z (over 3 years ago)
Last Synced: 2025-05-29T00:07:48.018Z (about 1 year ago)
Topics: cybersec, infosec, poc, privacy, python, security
Language: Python
Homepage: https://eulenbu.de/posts/bf-pprl-attacks/
Size: 18.6 KB
Stars: 0
Watchers: 1
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# Frequency-based attack on Bloom filters in PPRL

This repository contains an implementation of a frequency-based re-identification attack on Bloom filters in privacy-preserving record linkage protocols.
The attack was first described by Christen et al.[^1] and discussed on my personal website in a series dedicated to Bloom-filter-based PPRL.

## How to use

You will need a frequency table of values you want to mask using Bloom filters.
You can find an example in the [data directory](./data/) using the top 1k first names in Germany[^2].
The first column must contain values and the second column their respective absolute frequencies.
The table must be CSV-formatted.

Using this project assumes you have [Poetry](https://python-poetry.org/) installed.
Run `poetry install` in the root of this repository, then drop into a virtual environment using `poetry shell`.

To perform the attack the same way the authors did, you will need to compute the amount of hash values *k*.
Choose a filter size *m* (e.g. 256) and token size *q* (e.g. 2) and run the following script.

```
$ python compute_optimal_k.py data/german-names.csv -m 256 -q 2
24.19163983958364
```

In this example, *k* should be 24.
Next, generate a list of CLKs based on the frequency information of your word list.
It's advisable that you create an output directory first, e.g. using `mkdir -p out`.
Select an amount of CLKs to generate, e.g. 1m, then run the following script with your previously selected value *k*.

```
$ python generate_bf.py data/german-names.csv out/german-names-masked.csv -n 1000000 -q 2 -m 256 -k 24
```

Finally, run the attack with the following script.
You can enable CSV output with the `--stdout-csv` flag which will print the amount of exact matches, potential matches, false matches and no matches as comma-separated values.
The output file contains the detailed guesses for each CLK.

```
$ python perform_attack.py data/german-names.csv out/german-names-masked.csv out/german-names-guess.csv -q 2
TOTAL WORD COUNT: 1000
Exact matches: 3
Potential matches: 0
False matches: 81
No matches: 916
```

## References

[^1]: Christen, Peter, et al. "Efficient cryptanalysis of bloom filters for privacy-preserving record linkage." Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, Cham, 2017.
[^2]: Taken from Forebears' "Most Common Last Names In Germany" ([URL](https://forebears.io/germany/surnames), [Archive](https://web.archive.org/web/20220922090455/https://forebears.io/germany/surnames))

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/joogswastaken/bf-pprl-attack

Awesome Lists containing this project

README