https://github.com/scai-bio/syndat
Synthetic data quality evaluation & visualization
https://github.com/scai-bio/syndat
Last synced: about 2 months ago
JSON representation
Synthetic data quality evaluation & visualization
- Host: GitHub
- URL: https://github.com/scai-bio/syndat
- Owner: SCAI-BIO
- License: mit
- Created: 2023-12-21T15:05:08.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2024-07-15T19:08:32.000Z (10 months ago)
- Last Synced: 2024-07-25T10:28:03.162Z (10 months ago)
- Language: Python
- Homepage: https://syndat.readthedocs.io
- Size: 57.6 KB
- Stars: 1
- Watchers: 2
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Syndat
  Syndat is a software package that provides basic functionalities for the evaluation and visualisation of synthetic data. Quality scores can be computed on 3 base metrics (Discrimation, Correlation and Distribution) and data may be visualized to inspect correlation structures or statistical distribution plots.
# Installation
Install via pip:
```bash
pip install syndat
```# Usage
## Quality metrics
Compute data quality metrics by comparing real and synthetic data in terms of their separation complexity,
distribution similarity or pairwise feature correlations:```python
import pandas as pd
import syndatreal = pd.read_csv("real.csv")
synthetic = pd.read_csv("synthetic.csv")# How similar are the statistical distributions of real and synthetic features
distribution_similarity_score = syndat.scores.distribution(real, synthetic)# How hard is it for a classifier to discriminate real and synthetic data
discrimination_score = syndat.scores.discrimination(real, synthetic)# How well are pairwise feature correlations preserved
correlation_score = syndat.scores.correlation(real, synthetic)
```Scores are defined in a range of 0-100, with a higher score corresponding to better data fidelity.
## Visualization
Visualize real vs. synthetic data distributions, summary statistics and discriminating features:
```python
import pandas as pd
import syndatreal = pd.read_csv("real.csv")
synthetic = pd.read_csv("synthetic.csv")# plot *all* feature distribution and store image files
syndat.visualization.plot_distributions(real, synthetic, store_destination="results/plots")
syndat.visualization.plot_correlations(real, synthetic, store_destination="results/plots")# plot and display specific feature distribution plot
syndat.visualization.plot_numerical_feature("feature_xy", real, synthetic)
syndat.visualization.plot_numerical_feature("feature_xy", real, synthetic)# plot a shap plot of differentiating feature for real and synthetic data
syndat.visualization.plot_shap_discrimination(real, synthetic)
```## Postprocessing
Postprocess synthetic data to improve data fidelity:
```python
import pandas as pd
import syndatreal = pd.read_csv("real.csv")
synthetic = pd.read_csv("synthetic.csv")# postprocess synthetic data
synthetic_post = syndat.postprocessing.assert_minmax(real, synthetic)
synthetic_post = syndat.postprocessing.normalize_float_precision(real, synthetic)
```