Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/city-of-helsinki/tabular-anonymizer

Anonymization and pseudonymization tools for tabular data.
https://github.com/city-of-helsinki/tabular-anonymizer

Last synced: about 1 month ago
JSON representation

Anonymization and pseudonymization tools for tabular data.

Host: GitHub
URL: https://github.com/city-of-helsinki/tabular-anonymizer
Owner: City-of-Helsinki
License: mit
Created: 2021-12-07T12:30:28.000Z (about 3 years ago)
Default Branch: main
Last Pushed: 2022-11-04T11:02:47.000Z (about 2 years ago)
Last Synced: 2024-04-16T07:26:04.521Z (8 months ago)
Language: Python
Size: 1.77 MB
Stars: 5
Watchers: 2
Forks: 3
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        # Tabular Anonymizer

Anonymization and pseudonymization tools for tabular data. 

This library provides tools and methods for anonymization and privacy protection of data in Pandas DataFrame format.

## Installation

### Using pip-tools

To install this package using [pip-tools](https://pypi.org/project/pip-tools/1.8.0/):

Add `-e https://github.com/Datahel/tabular-anonymizer.git#egg=tabular_anonymizer` to your `requirements.in`

Run:

        $ pip-compile --generate-hashes --allow-unsafe -o requirements.txt requirements.in

        $ pip-sync requirements.txt

### Using pip

To install this package using [pip](https://pip.pypa.io/en/stable/):

Run:

        $ pip install git+https://github.com/Datahel/tabular-anonymizer.git

        

### Git clone + pip (if you want to inspect the examples)

You can alternatively clone this repository and install library from local folder with pip using -e flag:

        $ git clone https://github.com/Datahel/tabular-anonymizer.git

        $ pip install -e tabular-anonymizer

You can then try out the examples found under `examples/` folder. 

## Usage

### Anonymization

DataFrameAnonymizer anonymization functionality supports K-anonymity alone or together with L-diversity or T-closeness 

using Mondrian algorithm.

#### K-Anonymity in practice

In this simplified example there is mock dataset of 20 persons about their age, salary and education. 

We will anonymize this using mondrian algorithm with K=5.

![Dataframe before anonymization](documents/mondrian_data.png?raw=true "Dataframe")

After mondrian partitioning process (with K=5), data is divided to groups of at least 5 using age and salary as dimensions.

![Values after mondrian partitioning](documents/mondrian_plot.png?raw=true "Partitioned data")

In anonymization process a new dataframe is constructed and groups are divided to separate rows by sensitive attribute (education). 

![Anonymized dataset](documents/mondrian_anonymized.png?raw=true "Anonymized data with K=5")

You can test this in practice with: examples/plot_partitions.py 

#### Example: K-Anonymity using DataFrameAnonymizer

    import pandas as pd

    from tabular_anonymizer import DataFrameAnonymizer

    # Setup dataframe

    df = pd.read_csv("./adult.csv", sep=",")

    

    # Define sensitive attributes

    sensitive_columns = ['label']

    # Anonymize dataframe with k=10

    p = DataFrameAnonymizer(sensitive_columns)

    df_anonymized = p.anonymize_k_anonymity(df, k=10)

#### Example: K-Anonymity with L-diversity using DataFrameAnonymizer

    import pandas as pd

    from tabular_anonymizer import DataFrameAnonymizer

    # Setup dataframe

    df = pd.read_csv("./adult.csv", sep=",")

    

    # Define sensitive attributes

    sensitive_columns = ['label']

    # Anonymize dataframe with k=10

    p = DataFrameAnonymizer(sensitive_columns)

    df_anonymized = p.anonymize_l_diversity(df, k=10, l=2)

### Pseudonymization

Pseudonymization tool is intended for combining data from multiple sources. Both datasets share an identifier column. The function `combine_and_pseudonymize` replaces the identifier with a hash.

![Dataframe before pseudonymization](documents/pseudonymization_before.png?raw=true "Dataframe")

![Dataframe before pseudonymization](documents/pseudonymization_after.png?raw=true "Dataframe after pseudonymization of education column")

![Encryption process](documents/pseudonymization_encryption.png?raw=true "Pseudonymization and ecryption process")

#### Example: Pseudonymization of dataframe column with generated secret key

    from tabular_anonymizer import utils

    file1 = "exampples/adult.csv"

    df = pd.read_csv(file1, sep=",", index_col=0)

    # Simple way

    utils.pseudonymize(df, 'column_name', generate_nonce=True)

#### Example: Pseudonymization of multiple dataframes and columns shared secret key

    # Let's assume we have two dataframes df1 and df2. 

    # Both dataframes have common identifier data in columns column_name1 and column_name2, for example birth date

    # If you want to merge these datasets for example you can encrypt both columns using shared salt before that. 

    from tabular_anonymizer import utils

    # Generate nonces to be used as salt

    nonce1 = utils.create_nonce() # Generated random salt #1

    nonce2 = utils.create_nonce() # Generated random salt #2

    # Pseudonymize given columns using sha3_224 with two salts

    utils.pseudonymize(df1, 'column_name1', nonce1, nonce2)

    utils.pseudonymize(df2, 'column_name2', nonce1, nonce2)

#### Example: Combining two datasets with shared common column with pseudonymization

    # Let's assume that dataframes df1 and df2 have equal size and common column "id" which is direct identifier 

    # (such as phonenumber). We can combine (merge) these two datasets and pseudonymize values in ID-column

    # so it is no longer sensitive information.

    from tabular_anonymizer import utils

    # combine (merge) two datasets with common index column and pseudonymize

    df_c = utils.combine_and_pseudonymize(df1, df2, 'id')

#### Example: Post-processing & partial masking

    # Convert intervals to partially masked ['20220', '20210'] => '202**'

    generalize(df, 'zip', generalize_partial_masking)

    # Original table

    #                               id|    zip

    #                               1 | '20220'

    #                               2 | '20210'

    # Anonymized table (K=2)

    #                               zip

    #                               ['20220', '20210']

    # After partial masking

    #                               zip

    #                               '202**'

### Example jupyter notebooks

Besides example scripts, there are Jupyter notebooks can be found in examples-folder for testing purposes. 

    examples/sample_notebook.ipynb # Example how to use tabular anonymizer

    examples/check_anonymity.ipynb # Example for validating anonymizer results

#### Run notebooks in GitHub Codespaces

If you use GitHub codespaces, you can execute example scripts directly in VSCode browser interface. Required plugins are included in codespaces container configuration. 

#### Running Jupyter lab as server in codespaces

Codespaces allows you to run notebooks directly in we interface. However, if you need to run jupyter-lab server in codespaces, follow these instructions.

1. Start jupyter-lab server in codespaces terminal using following command :

        jupyter-lab --ip 0.0.0.0 --config .devcontainer/jupyter-server-config.py --no-browser

2. Observe jupyter-lab server log and click link pointing to 127.0.0.1, eg: http://127.0.0.1:8888/lab?token=... A small popup with link titled "Follow link using forwarded port" appears. Click the link and codespaces will redirect you to Jupyterlab user interface.  

![Jupyter_server_codespaces](documents/jupyterlab_codespaces.png?raw=true "JupyterLab in codespaces")

#### Run examples and jupyterlab in local docker environment

You can run Jupyterlab and do experiments with tabular anonymizer in docker container:

     docker build . -t tabular-anonymizer && docker run --rm -it -p 8888:8888 tabular-anonymizer  

Open http://127.0.0.1:8888 in your web browser and navigate to examples/sample_notebook.ipynb

Hit ctrl + c to quit container.

## Acknowledgements

Mondrian algorithm of this library is based on [glassonion1/AnonyPy](https://github.com/glassonion1/anonypy) mondrian implementation. 

Visualization example (plot_partitions.py) is based on Nuclearstar/K-Anonymity plot implementation. [Nuclearstar/K-Anonymity](https://github.com/Nuclearstar/K-Anonymity)