Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/city-of-helsinki/tabular-anonymizer
Anonymization and pseudonymization tools for tabular data.
https://github.com/city-of-helsinki/tabular-anonymizer
Last synced: about 1 month ago
JSON representation
Anonymization and pseudonymization tools for tabular data.
- Host: GitHub
- URL: https://github.com/city-of-helsinki/tabular-anonymizer
- Owner: City-of-Helsinki
- License: mit
- Created: 2021-12-07T12:30:28.000Z (about 3 years ago)
- Default Branch: main
- Last Pushed: 2022-11-04T11:02:47.000Z (about 2 years ago)
- Last Synced: 2024-04-16T07:26:04.521Z (8 months ago)
- Language: Python
- Size: 1.77 MB
- Stars: 5
- Watchers: 2
- Forks: 3
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Tabular Anonymizer
Anonymization and pseudonymization tools for tabular data.
This library provides tools and methods for anonymization and privacy protection of data in Pandas DataFrame format.
## Installation
### Using pip-tools
To install this package using [pip-tools](https://pypi.org/project/pip-tools/1.8.0/):
Add `-e https://github.com/Datahel/tabular-anonymizer.git#egg=tabular_anonymizer` to your `requirements.in`
Run:
$ pip-compile --generate-hashes --allow-unsafe -o requirements.txt requirements.in
$ pip-sync requirements.txt### Using pip
To install this package using [pip](https://pip.pypa.io/en/stable/):
Run:
$ pip install git+https://github.com/Datahel/tabular-anonymizer.git
### Git clone + pip (if you want to inspect the examples)You can alternatively clone this repository and install library from local folder with pip using -e flag:
$ git clone https://github.com/Datahel/tabular-anonymizer.git
$ pip install -e tabular-anonymizerYou can then try out the examples found under `examples/` folder.
## Usage
### Anonymization
DataFrameAnonymizer anonymization functionality supports K-anonymity alone or together with L-diversity or T-closeness
using Mondrian algorithm.#### K-Anonymity in practice
In this simplified example there is mock dataset of 20 persons about their age, salary and education.
We will anonymize this using mondrian algorithm with K=5.![Dataframe before anonymization](documents/mondrian_data.png?raw=true "Dataframe")
After mondrian partitioning process (with K=5), data is divided to groups of at least 5 using age and salary as dimensions.
![Values after mondrian partitioning](documents/mondrian_plot.png?raw=true "Partitioned data")
In anonymization process a new dataframe is constructed and groups are divided to separate rows by sensitive attribute (education).
![Anonymized dataset](documents/mondrian_anonymized.png?raw=true "Anonymized data with K=5")
You can test this in practice with: examples/plot_partitions.py
#### Example: K-Anonymity using DataFrameAnonymizer
import pandas as pd
from tabular_anonymizer import DataFrameAnonymizer# Setup dataframe
df = pd.read_csv("./adult.csv", sep=",")
# Define sensitive attributes
sensitive_columns = ['label']# Anonymize dataframe with k=10
p = DataFrameAnonymizer(sensitive_columns)
df_anonymized = p.anonymize_k_anonymity(df, k=10)#### Example: K-Anonymity with L-diversity using DataFrameAnonymizer
import pandas as pd
from tabular_anonymizer import DataFrameAnonymizer# Setup dataframe
df = pd.read_csv("./adult.csv", sep=",")
# Define sensitive attributes
sensitive_columns = ['label']# Anonymize dataframe with k=10
p = DataFrameAnonymizer(sensitive_columns)
df_anonymized = p.anonymize_l_diversity(df, k=10, l=2)### Pseudonymization
Pseudonymization tool is intended for combining data from multiple sources. Both datasets share an identifier column. The function `combine_and_pseudonymize` replaces the identifier with a hash.
![Dataframe before pseudonymization](documents/pseudonymization_before.png?raw=true "Dataframe")
![Dataframe before pseudonymization](documents/pseudonymization_after.png?raw=true "Dataframe after pseudonymization of education column")
![Encryption process](documents/pseudonymization_encryption.png?raw=true "Pseudonymization and ecryption process")
#### Example: Pseudonymization of dataframe column with generated secret key
from tabular_anonymizer import utils
file1 = "exampples/adult.csv"
df = pd.read_csv(file1, sep=",", index_col=0)
# Simple way
utils.pseudonymize(df, 'column_name', generate_nonce=True)#### Example: Pseudonymization of multiple dataframes and columns shared secret key
# Let's assume we have two dataframes df1 and df2.
# Both dataframes have common identifier data in columns column_name1 and column_name2, for example birth date
# If you want to merge these datasets for example you can encrypt both columns using shared salt before that.from tabular_anonymizer import utils
# Generate nonces to be used as salt
nonce1 = utils.create_nonce() # Generated random salt #1
nonce2 = utils.create_nonce() # Generated random salt #2# Pseudonymize given columns using sha3_224 with two salts
utils.pseudonymize(df1, 'column_name1', nonce1, nonce2)
utils.pseudonymize(df2, 'column_name2', nonce1, nonce2)#### Example: Combining two datasets with shared common column with pseudonymization
# Let's assume that dataframes df1 and df2 have equal size and common column "id" which is direct identifier
# (such as phonenumber). We can combine (merge) these two datasets and pseudonymize values in ID-column
# so it is no longer sensitive information.from tabular_anonymizer import utils
# combine (merge) two datasets with common index column and pseudonymize
df_c = utils.combine_and_pseudonymize(df1, df2, 'id')#### Example: Post-processing & partial masking
# Convert intervals to partially masked ['20220', '20210'] => '202**'
generalize(df, 'zip', generalize_partial_masking)# Original table
# id| zip
# 1 | '20220'
# 2 | '20210'
# Anonymized table (K=2)
# zip
# ['20220', '20210']
# After partial masking
# zip
# '202**'### Example jupyter notebooks
Besides example scripts, there are Jupyter notebooks can be found in examples-folder for testing purposes.
examples/sample_notebook.ipynb # Example how to use tabular anonymizer
examples/check_anonymity.ipynb # Example for validating anonymizer results#### Run notebooks in GitHub Codespaces
If you use GitHub codespaces, you can execute example scripts directly in VSCode browser interface. Required plugins are included in codespaces container configuration.
#### Running Jupyter lab as server in codespaces
Codespaces allows you to run notebooks directly in we interface. However, if you need to run jupyter-lab server in codespaces, follow these instructions.
1. Start jupyter-lab server in codespaces terminal using following command :
jupyter-lab --ip 0.0.0.0 --config .devcontainer/jupyter-server-config.py --no-browser
2. Observe jupyter-lab server log and click link pointing to 127.0.0.1, eg: http://127.0.0.1:8888/lab?token=... A small popup with link titled "Follow link using forwarded port" appears. Click the link and codespaces will redirect you to Jupyterlab user interface.
![Jupyter_server_codespaces](documents/jupyterlab_codespaces.png?raw=true "JupyterLab in codespaces")
#### Run examples and jupyterlab in local docker environment
You can run Jupyterlab and do experiments with tabular anonymizer in docker container:
docker build . -t tabular-anonymizer && docker run --rm -it -p 8888:8888 tabular-anonymizer
Open http://127.0.0.1:8888 in your web browser and navigate to examples/sample_notebook.ipynb
Hit ctrl + c to quit container.
## Acknowledgements
Mondrian algorithm of this library is based on [glassonion1/AnonyPy](https://github.com/glassonion1/anonypy) mondrian implementation.
Visualization example (plot_partitions.py) is based on Nuclearstar/K-Anonymity plot implementation. [Nuclearstar/K-Anonymity](https://github.com/Nuclearstar/K-Anonymity)