https://github.com/stefanrmmr/differentially_private_synthetic_data
Differentially Private Synthetic Data Generation [DP-SDG] - Experimental Setups & Knowledge Base - WORK IN PROGRESS
https://github.com/stefanrmmr/differentially_private_synthetic_data
data-analysis data-anonymity data-anonymization differential-privacy differentially-private dpgan dpsdg dpwgan dsgvokonform gdpr pategan privacy privacy-enhancing-technologies privacy-preserving-machine-learning privacy-preserving-synthetic-data quasi-identifiers sensitive-data-security synthetic-data synthetic-data-generation synthetic-dataset-generation
Last synced: 3 months ago
JSON representation
Differentially Private Synthetic Data Generation [DP-SDG] - Experimental Setups & Knowledge Base - WORK IN PROGRESS
- Host: GitHub
- URL: https://github.com/stefanrmmr/differentially_private_synthetic_data
- Owner: stefanrmmr
- Created: 2022-02-21T15:15:12.000Z (over 3 years ago)
- Default Branch: main
- Last Pushed: 2022-07-26T16:10:37.000Z (almost 3 years ago)
- Last Synced: 2025-02-28T16:08:44.437Z (3 months ago)
- Topics: data-analysis, data-anonymity, data-anonymization, differential-privacy, differentially-private, dpgan, dpsdg, dpwgan, dsgvokonform, gdpr, pategan, privacy, privacy-enhancing-technologies, privacy-preserving-machine-learning, privacy-preserving-synthetic-data, quasi-identifiers, sensitive-data-security, synthetic-data, synthetic-data-generation, synthetic-dataset-generation
- Language: Jupyter Notebook
- Homepage:
- Size: 5.23 MB
- Stars: 11
- Watchers: 1
- Forks: 2
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Experimental Implementation of DP-WGAN
Differentially Private Synthetic Data Generation
For **Continuous Data with binary Targets** using the Differentially Private Wasserstein GAN1) DP-WGAN **Synthetic Data** for "Health care: Heart attack possibility" [Kaggle Dataset](https://www.kaggle.com/datasets/nareshbhat/health-care-data-set-on-heart-attack-possibility?select=heart.csv) --> [view Notebook](https://github.com/stefanrmmr/differentially_private_synthetic_data/blob/main/dpwgan_borealis_heart_disease.ipynb)
2) DP-WGAN **Synthetic Data** for "BankNote Authentication UCI" [Kaggle Dataset](https://www.kaggle.com/datasets/shantanuss/banknote-authentication-uci) --> [view Notebook](https://github.com/stefanrmmr/differentially_private_synthetic_data/blob/main/dpwgan_borealis_banknote.ipynb)___
### Metrics achieved for DP-WGAN on the Heart Disease Dataset
![]()
*after multiple attempts using normalized input data, epsilon = approx 3.4 and delta = 1e-5___
### Process Steps & Key Concepts
- The data needs to be in csv format and has to be partitioned as train and test before feeding it to the models.
- Missing values are not supported and needs to replaced appropriately by the user before usage.
- In case the data has continuous and categorical attributes, it needs to be pre-processed
(discretization for continuous values/ encoding for categorical attr.)
- The generative GAN-based ML models are trained using the training dataset.
- The generative model is used to create a synthetic version of the train dataset
- To compensate for irregularities multiple GAN-Generator models are trained
- To compensate for irregularities multiple synthetic datasets are generated,
the optimal best-performing dataset that yields the max AUC is selected
- **Logistic Regression Classifiers** are trained using the real data, as well as, the synthetically generated dataset
- Both classifiers are evaluated regarding performance on the left-out real test dataset (preserved for evaluation)
- Relevant Metrics (mainly AUC) and visualizations of correlation-matrices of synthetic datasets were generated___
### Acknowledgements & Sources
Major parts of this summary notebook were extracted from this [BOREALIS Private Data Generation](https://github.com/BorealisAI/private-data-generation) Github repository by BorealisAI. Note that, this Jupyter notebook covers only one (DP-WGAN) of various possible datasets and generative models for differentially private synthetic data generation. The aforementioned analysis aproaches have yielded the following results as extracted from the original notebook. For more information rearding **differential privacy specific privacy arguments Delta & Epsylon** please refer to this [info-page by Microsoft]( https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/dwork.pdf)