https://github.com/gmodena/tinydp

Differential privacy for sklearn pipelines.
https://github.com/gmodena/tinydp

differential-privacy machine-learning sklearn-compatible

Last synced: 4 months ago
JSON representation

Differential privacy for sklearn pipelines.

Host: GitHub
URL: https://github.com/gmodena/tinydp
Owner: gmodena
License: apache-2.0
Created: 2021-06-24T08:13:48.000Z (over 4 years ago)
Default Branch: develop
Last Pushed: 2024-02-01T21:16:43.000Z (over 1 year ago)
Last Synced: 2025-03-01T15:45:19.997Z (7 months ago)
Topics: differential-privacy, machine-learning, sklearn-compatible
Language: Python
Homepage:
Size: 16.6 KB
Stars: 1
Watchers: 3
Forks: 0
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          [![Project Status: Concept – Minimal or no implementation has been done yet, or the repository is only intended to be a limited example, demo, or proof-of-concept.](https://www.repostatus.org/badges/latest/concept.svg)](https://www.repostatus.org/#concept)

![Build status](https://github.com/gmodena/tinydp/workflows/build/badge.svg)

# tinydp

Sprinkle some differential privacy on sklearn pipelines.

# Getting started

This code is a proof of concept. It's expected to break when data presents degenerate cases. YMMV.

The package and its development deps can be installed with:

```bash

pythom -m venv venv

source venv/bin/activate

pip install -r requirements.txt

python setup.py install

```

# Private Aggregation of Teacher Ensembles 

`PrivateClassifier` implements the Private Aggregation of Teacher Ensembles (PATE) framework to learn from private data.

PATE assumes two main components: private labelled datasets ("teachers") and a public unlabelled dataset ("student").

Private data is partitioned in non-overlapping training sets. 

An ensemble of "teachers" are trained independently (with no privacy guarantee). 

The "teacher" models are scored on the unlabelled, public, "student" datasets. Their predictions are aggregated and perturbed with random noise. A "student" model can then trained on public data labelled by the ensamble, instead of on the original, private, dataset.

[This blog post](https://nowave.it/course-notes-on-differential-privacy.html) contains some details on references on how

(and why) this works.

# Example

Currently only classification tasks are supported.

```python

from sklearn.datasets import make_classification

from sklearn.model_selection import train_test_split

from dp.ensemble import PrivateClassifier

from sklearn.metrics import classification_report

X, y = make_classification(n_features=5, n_redundant=0, n_informative=2,

                           random_state=1, n_clusters_per_class=1, n_samples=10000)

# We train teachers on private data, and use them to label public data.

# The public, labelled, dataset can be shared and used to train student models.

X_private, X_public, y_private, y_public = train_test_split(X, y, test_size=0.33, random_state=42)

# PrivateClassifier implements a PATE ensemble of teachers.

# It behaves like a regular sklearn Classifier. The epsilon

# parameter governs the amount of noise added to teachers' predictions.

clf = PrivateClassifier(n_estimators=10, epsilon=0.1, random_state=1)

clf.fit(X_private, y_private)

y_pred = clf.predict(X_public)

classification_report(y_public, y_pred)

```

# Evaluation

A-posteriori analysis can be performed on the teachers aggregate to determine whether the model satisfies 

the desired epsilon budget. OpenMined's [pysyft](https://github.com/OpenMined/PySyft) provides some utilities for this type of analysis.

```python

from syft.frameworks.torch.dp import pate

data_dep_eps, data_ind_eps = pate.perform_analysis(teacher_preds=clf.teacher_preds,

                                                   indices=y_public,

                                                   noise_eps=0.1, delta=1e-5)

print("Data Independent Epsilon:", data_ind_eps)

print("Data Dependent Epsilon:", data_dep_eps)

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/gmodena/tinydp

Awesome Lists containing this project

README