{"id":37086014,"url":"https://github.com/alfurka/synloc","last_synced_at":"2026-01-14T10:34:57.394Z","repository":{"id":59053719,"uuid":"530493680","full_name":"alfurka/synloc","owner":"alfurka","description":"A Python Package to Create Synthetic Tabular Data","archived":false,"fork":false,"pushed_at":"2026-01-10T09:26:18.000Z","size":43836,"stargazers_count":3,"open_issues_count":1,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2026-01-11T02:57:43.823Z","etag":null,"topics":["clustering","constrained-clustering","copulas","data-augmentation","distributions","k-means","knn","local-sampling","machine-learning","multivariate-distributions","nonparametric-distribution","oversampling","python","resampling","sampling","semi-parametric-modeling","statistics","synthetic","synthetic-data","synthetic-dataset-generation"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/alfurka.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2022-08-30T04:20:27.000Z","updated_at":"2026-01-10T09:26:23.000Z","dependencies_parsed_at":"2025-10-07T02:29:32.305Z","dependency_job_id":"5dc99ce7-e1bb-4fea-b6bb-b554e61b3942","html_url":"https://github.com/alfurka/synloc","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/alfurka/synloc","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alfurka%2Fsynloc","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alfurka%2Fsynloc/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alfurka%2Fsynloc/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alfurka%2Fsynloc/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/alfurka","download_url":"https://codeload.github.com/alfurka/synloc/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alfurka%2Fsynloc/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28417661,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-14T10:25:19.714Z","status":"ssl_error","status_checked_at":"2026-01-14T10:22:49.371Z","response_time":107,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["clustering","constrained-clustering","copulas","data-augmentation","distributions","k-means","knn","local-sampling","machine-learning","multivariate-distributions","nonparametric-distribution","oversampling","python","resampling","sampling","semi-parametric-modeling","statistics","synthetic","synthetic-data","synthetic-dataset-generation"],"created_at":"2026-01-14T10:34:56.748Z","updated_at":"2026-01-14T10:34:57.389Z","avatar_url":"https://github.com/alfurka.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cdiv align=\"center\"\u003e\r\n\r\n# synloc: An Algorithm to Create Synthetic Tabular Data\r\n\r\n\u003cimg src=\"https://raw.githubusercontent.com/alfurka/synloc/main/assets/logo_white_bc.png\" alt = 'synloc'\u003e\r\n\r\n[Overview](#overview) | [Installation](#installation) | [A Quick Example](#a-quick-example) | [Documentation](https://alfurka.github.io/synloc/) | [How to cite?](#how-to-cite) | [Replication](#replication)\r\n\r\n[![PyPI](https://img.shields.io/pypi/v/synloc)](https://pypi.org/project/synloc) [![Python](https://img.shields.io/pypi/pyversions/synloc)](https://pypi.org/project/synloc) [![Downloads](https://static.pepy.tech/badge/synloc)](https://pepy.tech/project/synloc)\r\n\r\n\u003c/div\u003e\r\n\r\n## Overview\r\n\r\n`synloc` is an open-source Python package implementing the **Local Resampler (LR)** algorithm for generating synthetic tabular data while safeguarding privacy. It provides a computationally efficient and flexible approach to synthetic data generation, enabling researchers to work with privacy-preserving datasets that maintain statistical utility.\r\n\r\n### Two Subsampling Strategies\r\n\r\nBoth approaches provide effective disclosure control. Choose based on your priorities:\r\n\r\n| Approach | Best for | Key advantage |\r\n|----------|----------|---------------|\r\n| **k-Nearest Neighbors (k-NN)** | Stronger disclosure control | Naturally underrepresents outliers, reducing privacy risks |\r\n| **Clustering-based** | Efficiency \u0026 accuracy | Better data utility and computational performance |\r\n\r\n**Key features:**\r\n- Natural disclosure risk reduction by underrepresenting outliers (k-NN variant)\r\n- Accurate replication of complex distributions, including multimodal and non-convex-support data\r\n- Flexible trade-off between data utility and privacy protection\r\n- Compatible with parametric and nonparametric distributions\r\n\r\nThis implementation aligns with statistical agencies' safe data regulations, including the **k-anonymity** criterion and the **Five Safes** framework adopted by organizations such as the Australian Bureau of Statistics. For the full methodology and theoretical foundations, see the [paper referenced below](#how-to-cite).\r\n\r\n## Installation\r\n\r\n`synloc` can be installed through [PyPI](https://pypi.org/):\r\n\r\n```\r\npip install synloc\r\n```\r\n\r\n## A Quick Example\r\n\r\nAssume that we have a sample with three variables with the following distributions:\r\n\r\n$$x \\sim Beta(0.1,\\,0.1)$$\r\n\r\n$$y \\sim Beta(0.1,\\, 0.5)$$\r\n\r\n$$z \\sim 10 y + Normal(0,\\,1)$$\r\n\r\nThe distribution can be generated by `tools` module in `synloc`:\r\n\r\n\r\n```python\r\nfrom synloc.tools import sample_trivariate_xyz\r\ndata = sample_trivariate_xyz() # Generates a sample with size 1000 by default. \r\n```\r\n\r\nInitializing the resampler:\r\n\r\n\r\n```python\r\nfrom synloc import LocalCov\r\nresampler = LocalCov(data = data, K = 30)\r\n```\r\n\r\n**Subsample** size is defined as `K=30`. Now, we locally estimate the multivariate normal distribution and from each estimated distributions we draw \"synthetic values.\"\r\n\r\n\r\n```python\r\nsyn_data = resampler.fit() \r\n```\r\n\r\n    100%|██████████| 1000/1000 [00:01\u003c00:00, 687.53it/s]\r\n    \r\n\r\n`syn_data` is a [pandas.DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) where all variables are synthesized. Comparing the original sample using a 3-D Scatter:\r\n\r\n\r\n```python\r\nresampler.comparePlots(['x','y','z'])\r\n```    \r\n![](https://raw.githubusercontent.com/alfurka/synloc/main/assets/README_7_0.png)\r\n\r\n## How to cite?\r\n\r\nIf you use `synloc` in your research, please cite the following paper:\r\n\r\n```bibtex\r\n@article{kalay2025generating,\r\n  author    = {Kalay, Ali Furkan},\r\n  title     = {Generating Synthetic Data With Locally Estimated Distributions for Disclosure Control},\r\n  journal   = {Australian \\\u0026 New Zealand Journal of Statistics},\r\n  year      = {2025},\r\n  volume    = {n/a},\r\n  number    = {n/a},\r\n  keywords  = {clustering algorithms, computational statistics, k-nearest neighbours, statistical disclosure control, synthetic data},\r\n  doi       = {10.1111/anzs.70032},\r\n  url       = {https://onlinelibrary.wiley.com/doi/abs/10.1111/anzs.70032}\r\n}\r\n```\r\n\r\n## Replication\r\n\r\nFor replication materials of the paper, see the [replication folder](replication/).\r\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Falfurka%2Fsynloc","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Falfurka%2Fsynloc","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Falfurka%2Fsynloc/lists"}