https://github.com/sodascience/metasyn-disclosure-control
Plugin for metasyn that prevents data from leaking.
https://github.com/sodascience/metasyn-disclosure-control
disclosure-control metasyn plugin privacy-protection synthetic-data
Last synced: 9 months ago
JSON representation
Plugin for metasyn that prevents data from leaking.
- Host: GitHub
- URL: https://github.com/sodascience/metasyn-disclosure-control
- Owner: sodascience
- License: mit
- Created: 2022-07-29T08:45:50.000Z (almost 4 years ago)
- Default Branch: main
- Last Pushed: 2025-07-17T12:42:39.000Z (11 months ago)
- Last Synced: 2025-08-31T20:41:20.894Z (10 months ago)
- Topics: disclosure-control, metasyn, plugin, privacy-protection, synthetic-data
- Language: Python
- Homepage:
- Size: 4.2 MB
- Stars: 2
- Watchers: 3
- Forks: 1
- Open Issues: 4
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Metasyn disclosure control
[](https://github.com/sodascience/metasyn)
[](https://github.com/sodascience/metasyn-disclosure-control/actions/workflows/python-package.yml)
[](https://www.repostatus.org/#wip)
A privacy plugin for [metasyn](https://github.com/sodascience/metasyn), based on statistical disclosure control (SDC) rules of thumb as found in the following documents:
- The [SDC handbook](https://securedatagroup.org/guides-and-resources/sdc-handbook/) of the Secure Data group in the UK
- The Data Without Boundaries document [Guidelines for output checking](https://wayback.archive-it.org/12090/*/https:/cros-legacy.ec.europa.eu/system/files/dwb_standalone-document_output-checking-guidelines.pdf) (pdf)
- Statistics Netherlands' output guidelines
Producing synthetic data with [metasyn](https://github.com/sodascience/metasyn) is already a great first step towards protecting privacy, but it doesn't adhere to official standards. For example, fitting a uniform distribution will disclose the lowest and highest values in the dataset, which may be a privacy issue in particularly sensitive data. This plugin solves these kinds of problems.
> [!WARNING]
> Currently, the disclosure control plugin is work in progress. Especially in light of this, we disclaim
any responsibility as a result of using this plugin.
## Installing the plugin
To install the package with pip, run the following:
```sh
pip install metasyn-disclosure
```
For the development, installed the package directly through git with the following command:
```sh
pip install git+https://github.com/sodascience/metasyn-disclosure-control.git
```
## Usage
Basic usage for our built-in titanic dataset is as follows:
```py
from metasyncontrib.disclosure import DisclosurePrivacy
from metasyncontrib.disclosure.string import DisclosureFaker
from metasyn import MetaFrame, VarSpec, demo_dataframe
df = demo_dataframe("titanic")
spec = [
VarSpec(name="PassengerId", unique=True),
VarSpec(name="Name", distribution=DisclosureFaker("name")),
]
mf = MetaFrame.fit_dataframe(
df=df,
var_specs=spec,
privacy=DisclosurePrivacy(),
)
mf.synthesize(5)
```
```
shape: (5, 13)
┌─────────────┬────────────────────┬────────┬──────┬───┬────────────┬────────────┬─────────────────────┬────────┐
│ PassengerId ┆ Name ┆ Sex ┆ Age ┆ … ┆ Birthday ┆ Board time ┆ Married since ┆ all_NA │
│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ cat ┆ i64 ┆ ┆ date ┆ time ┆ datetime[μs] ┆ f32 │
╞═════════════╪════════════════════╪════════╪══════╪═══╪════════════╪════════════╪═════════════════════╪════════╡
│ 0 ┆ Benjamin Cox ┆ female ┆ 27 ┆ … ┆ 1931-12-01 ┆ 14:33:06 ┆ 2022-07-30 02:16:37 ┆ null │
│ 1 ┆ Mr. David Robinson ┆ female ┆ null ┆ … ┆ 1906-02-18 ┆ null ┆ 2022-08-03 13:09:19 ┆ null │
│ 2 ┆ Randy Mosley ┆ male ┆ 24 ┆ … ┆ 1933-01-06 ┆ 15:52:54 ┆ 2022-07-18 18:52:05 ┆ null │
│ 3 ┆ Vincent Maddox ┆ female ┆ 24 ┆ … ┆ 1937-02-10 ┆ 16:58:30 ┆ 2022-07-23 20:29:49 ┆ null │
│ 4 ┆ Kristin Holland ┆ male ┆ 17 ┆ … ┆ 1939-12-09 ┆ 18:07:45 ┆ 2022-08-05 02:41:51 ┆ null │
└─────────────┴────────────────────┴────────┴──────┴───┴────────────┴────────────┴─────────────────────┴────────┘
```
## Implementation details
The rules of thumb, roughly, are:
- at least 10 units
- at least 10 degrees of freedom
- no group disclosure
- no dominance
For most distributions, we implemented micro-aggregation. This technique pre-averages a sorted version of the data, which then supplied to the original fitting mechanism. The idea is that during this pre-averaging step, we ensure that the rules of thumb are followed, so that the fitting method doesn't need to do anything in particular. While from a statistical point of view, we are losing more information than we probably need, it should ensure the safety of the data.
## Contributing
You can contribute to this metasyn plugin by giving feedback in the "Issues" tab, or by creating a pull request.
To create a pull request:
1. Fork the Project
2. Create your Feature Branch (`git checkout -b feature/AmazingFeature`)
3. Commit your Changes (`git commit -m 'Add some AmazingFeature'`)
4. Push to the Branch (`git push origin feature/AmazingFeature`)
5. Open a Pull Request
## Contact
This is a project by the [ODISSEI Social Data Science (SoDa)](https://odissei-data.nl/nl/soda/) team. Do you have questions, suggestions, or remarks on the technical implementation? File an issue in the issue tracker or feel free to contact [Raoul Schram](https://github.com/qubixes) or [Erik-Jan van Kesteren](https://github.com/vankesteren).