https://github.com/alfurka/synloc

A Python Package to Create Synthetic Tabular Data
https://github.com/alfurka/synloc

clustering constrained-clustering copulas data-augmentation distributions k-means knn local-sampling machine-learning multivariate-distributions nonparametric-distribution oversampling python resampling sampling semi-parametric-modeling statistics synthetic synthetic-data synthetic-dataset-generation

Last synced: 5 months ago
JSON representation

A Python Package to Create Synthetic Tabular Data

Host: GitHub
URL: https://github.com/alfurka/synloc
Owner: alfurka
License: mit
Created: 2022-08-30T04:20:27.000Z (almost 4 years ago)
Default Branch: main
Last Pushed: 2026-01-10T09:26:18.000Z (5 months ago)
Last Synced: 2026-01-11T02:57:43.823Z (5 months ago)
Topics: clustering, constrained-clustering, copulas, data-augmentation, distributions, k-means, knn, local-sampling, machine-learning, multivariate-distributions, nonparametric-distribution, oversampling, python, resampling, sampling, semi-parametric-modeling, statistics, synthetic, synthetic-data, synthetic-dataset-generation
Language: Jupyter Notebook
Homepage:
Size: 41.8 MB
Stars: 3
Watchers: 1
Forks: 0
Open Issues: 1
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE

Awesome Lists containing this project

README

          


# synloc: An Algorithm to Create Synthetic Tabular Data



[Overview](#overview) | [Installation](#installation) | [A Quick Example](#a-quick-example) | [Documentation](https://alfurka.github.io/synloc/) | [How to cite?](#how-to-cite) | [Replication](#replication)

[![PyPI](https://img.shields.io/pypi/v/synloc)](https://pypi.org/project/synloc) [![Python](https://img.shields.io/pypi/pyversions/synloc)](https://pypi.org/project/synloc) [![Downloads](https://static.pepy.tech/badge/synloc)](https://pepy.tech/project/synloc)



## Overview

`synloc` is an open-source Python package implementing the **Local Resampler (LR)** algorithm for generating synthetic tabular data while safeguarding privacy. It provides a computationally efficient and flexible approach to synthetic data generation, enabling researchers to work with privacy-preserving datasets that maintain statistical utility.

### Two Subsampling Strategies

Both approaches provide effective disclosure control. Choose based on your priorities:

| Approach | Best for | Key advantage |

|----------|----------|---------------|

| **k-Nearest Neighbors (k-NN)** | Stronger disclosure control | Naturally underrepresents outliers, reducing privacy risks |

| **Clustering-based** | Efficiency & accuracy | Better data utility and computational performance |

**Key features:**

- Natural disclosure risk reduction by underrepresenting outliers (k-NN variant)

- Accurate replication of complex distributions, including multimodal and non-convex-support data

- Flexible trade-off between data utility and privacy protection

- Compatible with parametric and nonparametric distributions

This implementation aligns with statistical agencies' safe data regulations, including the **k-anonymity** criterion and the **Five Safes** framework adopted by organizations such as the Australian Bureau of Statistics. For the full methodology and theoretical foundations, see the [paper referenced below](#how-to-cite).

## Installation

`synloc` can be installed through [PyPI](https://pypi.org/):

```

pip install synloc

```

## A Quick Example

Assume that we have a sample with three variables with the following distributions:

$$x \sim Beta(0.1,\,0.1)$$

$$y \sim Beta(0.1,\, 0.5)$$

$$z \sim 10 y + Normal(0,\,1)$$

The distribution can be generated by `tools` module in `synloc`:

```python

from synloc.tools import sample_trivariate_xyz

data = sample_trivariate_xyz() # Generates a sample with size 1000 by default. 

```

Initializing the resampler:

```python

from synloc import LocalCov

resampler = LocalCov(data = data, K = 30)

```

**Subsample** size is defined as `K=30`. Now, we locally estimate the multivariate normal distribution and from each estimated distributions we draw "synthetic values."

```python

syn_data = resampler.fit() 

```

    100%|██████████| 1000/1000 [00:01<00:00, 687.53it/s]

    

`syn_data` is a [pandas.DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) where all variables are synthesized. Comparing the original sample using a 3-D Scatter:

```python

resampler.comparePlots(['x','y','z'])

```    

![](https://raw.githubusercontent.com/alfurka/synloc/main/assets/README_7_0.png)

## How to cite?

If you use `synloc` in your research, please cite the following paper:

```bibtex

@article{kalay2025generating,

  author    = {Kalay, Ali Furkan},

  title     = {Generating Synthetic Data With Locally Estimated Distributions for Disclosure Control},

  journal   = {Australian \& New Zealand Journal of Statistics},

  year      = {2025},

  volume    = {n/a},

  number    = {n/a},

  keywords  = {clustering algorithms, computational statistics, k-nearest neighbours, statistical disclosure control, synthetic data},

  doi       = {10.1111/anzs.70032},

  url       = {https://onlinelibrary.wiley.com/doi/abs/10.1111/anzs.70032}

}

```

## Replication

For replication materials of the paper, see the [replication folder](replication/).

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/alfurka/synloc

Awesome Lists containing this project

README