https://github.com/shuyib/data-privacy-pres

A repo that takes you through some principles about data privacy based on the Kenya Data Protection Act and General Data Protection Regulation. Useful for a data person.
https://github.com/shuyib/data-privacy-pres

data-ethics data-privacy-course differential-privacy federated-learning hacker-statistics k-anonymity machine-learning

Last synced: 16 days ago
JSON representation

A repo that takes you through some principles about data privacy based on the Kenya Data Protection Act and General Data Protection Regulation. Useful for a data person.

Host: GitHub
URL: https://github.com/shuyib/data-privacy-pres
Owner: Shuyib
License: cc0-1.0
Created: 2022-09-29T11:41:10.000Z (about 3 years ago)
Default Branch: main
Last Pushed: 2025-04-27T11:03:16.000Z (7 months ago)
Last Synced: 2025-06-21T18:08:15.008Z (5 months ago)
Topics: data-ethics, data-privacy-course, differential-privacy, federated-learning, hacker-statistics, k-anonymity, machine-learning
Language: HTML
Homepage:
Size: 8.07 MB
Stars: 7
Watchers: 1
Forks: 7
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/git/Shuyib%2Fdata-privacy-pres/HEAD?urlpath=%2Fdoc%2Ftree%2Fpresentation.ipynb)
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Shuyib/data-privacy-pres/blob/master/presentation.ipynb)

This is a presentation about Data privacy and anonymization. Mostly on a data person level by that I mean those who work with data and those who are working with data person. You can simulate data to make the insurance data set. See the folder layout to learn how to do it.

Folders:
.
├── codebook - this folder has a description of the simulated dataset. Particularly what the columns of the dataframe mean.
│   ├── Insurance_data_ke.txt - this was created with [CSVkit](https://csvkit.readthedocs.io/en/latest/index.html) (csvstat) function.
│   └── insurance_report.html - this is generated by [pandas profiling](https://pandas-profiling.ydata.ai/docs/master/index.html) library. A short cut in doing Exploratory data analysis fast.
├── data - directory where the simulated data should be placed. Run utils/dataloader.py to generate it.
│   ├── feature_engineered_insurance2.csv - data which has undergone feature engineering used in the demo.
│   ├── feature_engineered_insurance.csv - data which was created for the same problem but has issues. Create a new one.
│   ├── Insurance_data_ke.csv - The insurance dataset created by running `python utils/dataloader.py`
│   ├── Insurance_data_ke_featureeng.csv - Insurance dataset created as an intermediate step for feature engineering.
│   └── Organs.csv - Single patient data who was recovering from surgery from a heart disease. Just contains data about their vitals from a thermometer, pulse oximeter.
├── Dockerfile - a blueprint to run the project in a reproducible way see. # How to run in docker image.
├── environment.yml - a conda virtual environment file.
├── Kenya Data Protection Act - Quick Guide 2021.pdf - a demo for privacy engineering strategy at Deloitte.
├── Makefile - workflow orchestrator. Helps automating code formating and running repetitive tasks.
├── presentation - this directory has the presentations that were used live.
│   ├── presentation.pdf - HTML to PDF using LaTeX.
│   ├── presentation.slides.html - reveal.js presentation. Open with your browser.
| ├── presentation2.html - Quarto version of the presentation.
| ├── presentation2.pdf - PDF version of the presentation.
├── presentation.ipynb - jupyter notebook with jupyter notebook extensions and reveal.js extension.
├── README.md - the file you are reading.
├── requirements.txt - what packages were used.
├── Screenshot from 2022-09-10 07-03-38.png - demo of PCA using the iris dataset.
└── utils - Scripts used to generate the simulated data
├── codebook.sh - this is bash script used to create the codebook Insurance_data_ke.txt
├── dataloader.py - data generator that uses methods from the faker library and numpy.
├── Feature_engineering.ipynb - a feature engineering workflow that I use for making the insurance dataset ready for statistical modeling aka machine learning.

# How to make the conda environment locally

If you have anaconda/miniconda. In the data-privacy-pres directory, complete the following steps.

1. Create the virtual environment

```bash
conda env create -f data-privacy-env.yml
```

2. This will create an environment called *data-privacy-env*. You can activate it like this.

```bash
source activate data-privacy-env
```

# How to run the docker image

Build docker image
```bash
sudo docker build -t data-privacy-env:v1 .
```

Run the docker image
```bash
sudo docker run -p 9999:9999 data-privacy-env:v1
```

# References

* https://www.manning.com/books/data-privacy
* https://ethics.fast.ai/
* https://www.apple.com/privacy/docs/Differential_Privacy_Overview.pdf
* https://www.manning.com/books/privacy-preserving-machine-learning
* https://www.manning.com/books/grokking-deep-learning
* https://www.manning.com/books/build-a-career-in-data-science
* https://www.datacamp.com/courses/data-privacy-and-anonymization-in-python

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/shuyib/data-privacy-pres

Awesome Lists containing this project

README