Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
awesome-data-synthesis
A curated list of awesome resources for creating synthetic data
https://github.com/joofio/awesome-data-synthesis
Last synced: about 8 hours ago
JSON representation
-
Data-driven methods
-
Tabular
- CTGAN - CTGAN is a GAN-based data synthesizer that can generate synthetic tabular data with high fidelity. - [Paper](https://arxiv.org/pdf/1907.00503.pdf)
- TGAN - Outdated and superseded by **CTGAN**
- gretel - create fake, synthetic datasets with enhanced privacy guarantees
- On the Generation and Evaluation of Synthetic Tabular Data using GANs - we propose using the WGAN-GP architecture for training the GAN, which suffers less from mode-collapse and has a more meaningful loss.
- DataSynthesizer - DataSynthesizer generates synthetic data that simulates a given dataset. It applies Differential Privacy techniques to achieve strong privacy guarantee.
- MedGAN - medGAN is a generative adversarial network for generating multi-label discrete patient records. It can generate both binary and count variables (i.e. medical codes such as diagnosis codes, medication codes or procedure codes) - [Paper](https://arxiv.org/abs/1703.06490)
- MC-MedGAN - Multi-Categorical GANs - [Paper](https://arxiv.org/pdf/1807.01202.pdf)
- tableGAN - tableGAN is a synthetic data generation technique (Data Synthesis based on Generative Adversarial Networks paper) based on Generative Adversarial Network architecture (DCGAN). - [Paper](http://www.vldb.org/pvldb/vol11/p1071-park.pdf)
- VEEGAN - Reducing Mode Collapse in GANs using Implicit Variational Learning - [Paper](https://arxiv.org/abs/1705.07761)
- DP-GAN - Differentially private release of semantic rich data - [Paper](https://arxiv.org/abs/1801.01594)
- DP-GAN 2 - Source code of paper "Differentially Private Generative Adversarial Network" - [Paper](https://arxiv.org/abs/1802.06739)
- CLGP - categorical latent Gaussian process is a generative model for multivariate categorical data - [Paper](http://proceedings.mlr.press/v37/gala15.html)
- COR-GAN - Correlation-Capturing Convolutional Neural Networks for Generating Synthetic Healthcare Records - [Paper](https://arxiv.org/pdf/2001.09346v2.pdf)
- synergetr - An R package to generate synthetic data with empirical probability distributions - [Paper]()
- SynC - SynC: A Unified Framework for Generating Synthetic Population with Gaussian Copula - [Paper]()
- NIST-PSCR - Code and Data for NIST PSCR Differential Privacy Synthetic Data Challenge - [Paper]()
- Python synthpop - Python implementation of the R package synthpop.
- Repo on generating synthetic data using GAN - Repo on generating synthetic data using GAN
- synthia - 📈 🐍 Multidimensional synthetic data generation in Python
- QUIPP - Privacy preserving synthetic data generation workflows
- MSFT synthetic data showcase - Generates synthetic data and user interfaces for privacy-preserving data sharing and analysis.
- extended-MedGan - Synthetic patient data using generative adversarial networks.
- Synthesizing quality open data - Synthesizing Quality Open Data Assets from Private Health Research Studies
- bayesian-synthetic-generator - Repository of a software system for generating synthetic personal data based on the Bayesian network block structure
- synthetic health data
- Synthetic data Copula
- HoloClean - A Machine Learning System for Data Enrichment.
- SYNDATA - Generation and evaluation of synthetic patient data - [Paper](https://bmcmedresmethodol.biomedcentral.com/track/pdf/10.1186/s12874-020-00977-1.pdf)
- FakeR - Generates fake data from a dataset of different variable types
- Synthpop - A tool for producing synthetic versions of microdata containing confidential information so that they are safe to be released to users for exploratory analysis.
- PateGAN
- bnomics - Synthetic data generation with probabilistic Bayesian Networks - [Paper](https://www.biorxiv.org/content/10.1101/2020.06.14.151084v1.full.pdf)
- MPoM
- DPautoGAN - Code for the paper Differentially Private Mixed-Type Data Generation for Unsupervised Learning - [Paper]()
- Bn-learn Latent Model - Generating High-Fidelity Synthetic Patient Data for Assessing Machine Learning Healthcare Software - [Paper](https://www.nature.com/articles/s41746-020-00353-9)
- SAP Security research sample - SAP Security research sample code and tutorials for generating differentially private synthetic datasets using generative deep learning models
- Synthetic_Data_System - The Alpha Build of the SDS for ideas gathering, testing and commentary
- Generating-Synthetic-data-using-GANs - How can we safely and efficiently share encrypted data that is also useful. We use the mechanism of GANs used to generate fake images to generate synthetic tabular data
- PrivBayes
- pategan
- UCLANesl - UCLANesl - NIST Differential Privacy Challenge (Match 3)
-
Time Series
- mtss-gan
- data-generator
- RGAN
- tsBNgen - - [Paper](https://arxiv.org/pdf/2009.04595.pdf)
- Sythetic Data Generation - Material for QuantUniversity talk on Sythetic Data Generation for Finance.
- LSTM GAN model - The LSTM GAN model can be used for generation of synthetic multi-dimension time series data.
- Machine-learning for trading Ch21
-
Sensor data
- synsys - create sensor data
-
-
Process-driven methods
-
Tabular
- plaitpy
- pySyntheticDatasetGenerator - Generate relational fictive dataset from a simple yaml description
- datasynthR
- synner - Generating Realistic Synthetic Data
- OpenSDPsynthR
- genstar
- conjurer - R Package to generate synthetic data.
- synthea
- BadMedicine
- bindata
- GenOrd
- MultiOrd
- PoisBinOrdNonNor
- SimMultiCorrData
- SimPop - Tools and methods to simulate populations for surveys based on auxiliary data. The tools include model-based methods, calibration and combinatorial optimization algorithms.
- charlatan
- fabricatr
- synthetico
-
-
Metrics and dataset evaluation
-
Tabular
- datagene
- SDMetrics
- table-evaluator
- SDGym - Synthetic Data Gym (SDGym) is a framework to benchmark the performance of synthetic data generators for tabular data. SDGym is a project of the Data to AI Laboratory at MIT.
- virtualdatalab - Benchmarking synthetic data generators for sequential data in terms of accuracy and **privacy.**
- Statistical-Similarity-Measurement - A methodology designed to validate the statistical similarity of synthetic data generated by GAN models. The metrics contain Auto-encoder, PCA, t-SNE, KL-divergence, Clustering, and Cosine Similarity.
- SDV evaluation functions
-
Programming Languages
Sub Categories
Keywords
synthetic-data
22
generative-adversarial-network
7
machine-learning
5
synthetic-dataset-generation
5
synthetic-data-generation
4
tabular-data
4
finance
4
data-generator
4
statistics
3
time-series
3
simulation
3
similarity-measures
3
deep-learning
3
data-generation
3
privacy
3
gan
3
gans
2
data-science
2
differential-privacy
2
synthetic
2
r
2
pytorch
2
multivariate-timeseries
2
generative-ai
2
data-evaluation
2
stress-test
1
model-validation
1
timeseries-data
1
adverserial
1
timeseriesclassification
1
arxiv
1
clinical-research
1
medical
1
multivariate-data
1
mnist
1
paper
1
rnn
1
algorithmic-trading
1
declarative
1
modeling
1
database
1
faker
1
generator
1
angularjs
1
chi
1
d3
1
gui
1
hci
1
synthesizing-tabular-data
1
artificial-intelligence
1