Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

awesome-data-synthesis

A curated list of awesome resources for creating synthetic data
https://github.com/joofio/awesome-data-synthesis

Last synced: 4 days ago
JSON representation

  • Data-driven methods

    • Tabular

      • CTGAN - CTGAN is a GAN-based data synthesizer that can generate synthetic tabular data with high fidelity. - [Paper](https://arxiv.org/pdf/1907.00503.pdf)
      • TGAN - Outdated and superseded by **CTGAN**
      • gretel - create fake, synthetic datasets with enhanced privacy guarantees
      • On the Generation and Evaluation of Synthetic Tabular Data using GANs - we propose using the WGAN-GP architecture for training the GAN, which suffers less from mode-collapse and has a more meaningful loss.
      • DataSynthesizer - DataSynthesizer generates synthetic data that simulates a given dataset. It applies Differential Privacy techniques to achieve strong privacy guarantee.
      • MedGAN - medGAN is a generative adversarial network for generating multi-label discrete patient records. It can generate both binary and count variables (i.e. medical codes such as diagnosis codes, medication codes or procedure codes) - [Paper](https://arxiv.org/abs/1703.06490)
      • MC-MedGAN - Multi-Categorical GANs - [Paper](https://arxiv.org/pdf/1807.01202.pdf)
      • tableGAN - tableGAN is a synthetic data generation technique (Data Synthesis based on Generative Adversarial Networks paper) based on Generative Adversarial Network architecture (DCGAN). - [Paper](http://www.vldb.org/pvldb/vol11/p1071-park.pdf)
      • VEEGAN - Reducing Mode Collapse in GANs using Implicit Variational Learning - [Paper](https://arxiv.org/abs/1705.07761)
      • DP-GAN - Differentially private release of semantic rich data - [Paper](https://arxiv.org/abs/1801.01594)
      • DP-GAN 2 - Source code of paper "Differentially Private Generative Adversarial Network" - [Paper](https://arxiv.org/abs/1802.06739)
      • CLGP - categorical latent Gaussian process is a generative model for multivariate categorical data - [Paper](http://proceedings.mlr.press/v37/gala15.html)
      • COR-GAN - Correlation-Capturing Convolutional Neural Networks for Generating Synthetic Healthcare Records - [Paper](https://arxiv.org/pdf/2001.09346v2.pdf)
      • synergetr - An R package to generate synthetic data with empirical probability distributions - [Paper]()
      • SynC - SynC: A Unified Framework for Generating Synthetic Population with Gaussian Copula - [Paper]()
      • NIST-PSCR - Code and Data for NIST PSCR Differential Privacy Synthetic Data Challenge - [Paper]()
      • Python synthpop - Python implementation of the R package synthpop.
      • Repo on generating synthetic data using GAN - Repo on generating synthetic data using GAN
      • synthia - 📈 🐍 Multidimensional synthetic data generation in Python
      • QUIPP - Privacy preserving synthetic data generation workflows
      • MSFT synthetic data showcase - Generates synthetic data and user interfaces for privacy-preserving data sharing and analysis.
      • extended-MedGan - Synthetic patient data using generative adversarial networks.
      • Synthesizing quality open data - Synthesizing Quality Open Data Assets from Private Health Research Studies
      • bayesian-synthetic-generator - Repository of a software system for generating synthetic personal data based on the Bayesian network block structure
      • synthetic health data
      • Synthetic data Copula
      • HoloClean - A Machine Learning System for Data Enrichment.
      • SYNDATA - Generation and evaluation of synthetic patient data - [Paper](https://bmcmedresmethodol.biomedcentral.com/track/pdf/10.1186/s12874-020-00977-1.pdf)
      • FakeR - Generates fake data from a dataset of different variable types
      • Synthpop - A tool for producing synthetic versions of microdata containing confidential information so that they are safe to be released to users for exploratory analysis.
      • PateGAN
      • bnomics - Synthetic data generation with probabilistic Bayesian Networks - [Paper](https://www.biorxiv.org/content/10.1101/2020.06.14.151084v1.full.pdf)
      • MPoM
      • DPautoGAN - Code for the paper Differentially Private Mixed-Type Data Generation for Unsupervised Learning - [Paper]()
      • Bn-learn Latent Model - Generating High-Fidelity Synthetic Patient Data for Assessing Machine Learning Healthcare Software - [Paper](https://www.nature.com/articles/s41746-020-00353-9)
      • SAP Security research sample - SAP Security research sample code and tutorials for generating differentially private synthetic datasets using generative deep learning models
      • Synthetic_Data_System - The Alpha Build of the SDS for ideas gathering, testing and commentary
      • Generating-Synthetic-data-using-GANs - How can we safely and efficiently share encrypted data that is also useful. We use the mechanism of GANs used to generate fake images to generate synthetic tabular data
      • PrivBayes
      • pategan
      • UCLANesl - UCLANesl - NIST Differential Privacy Challenge (Match 3)
      • SAP Security research sample - SAP Security research sample code and tutorials for generating differentially private synthetic datasets using generative deep learning models
      • bayesian-synthetic-generator - Repository of a software system for generating synthetic personal data based on the Bayesian network block structure
      • HoloClean - A Machine Learning System for Data Enrichment.
    • Time Series

    • Sensor data

  • Process-driven methods

  • Metrics and dataset evaluation

    • Tabular

      • datagene
      • SDMetrics
      • table-evaluator
      • SDGym - Synthetic Data Gym (SDGym) is a framework to benchmark the performance of synthetic data generators for tabular data. SDGym is a project of the Data to AI Laboratory at MIT.
      • virtualdatalab - Benchmarking synthetic data generators for sequential data in terms of accuracy and **privacy.**
      • Statistical-Similarity-Measurement - A methodology designed to validate the statistical similarity of synthetic data generated by GAN models. The metrics contain Auto-encoder, PCA, t-SNE, KL-divergence, Clustering, and Cosine Similarity.
      • SDV evaluation functions