Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/sdv-dev/SDMetrics
Metrics to evaluate quality and efficacy of synthetic datasets.
https://github.com/sdv-dev/SDMetrics
metrics quality synthetic-data
Last synced: 3 months ago
JSON representation
Metrics to evaluate quality and efficacy of synthetic datasets.
- Host: GitHub
- URL: https://github.com/sdv-dev/SDMetrics
- Owner: sdv-dev
- License: mit
- Created: 2020-03-20T14:15:48.000Z (almost 5 years ago)
- Default Branch: main
- Last Pushed: 2024-11-12T14:10:15.000Z (3 months ago)
- Last Synced: 2024-11-12T15:20:49.626Z (3 months ago)
- Topics: metrics, quality, synthetic-data
- Language: Python
- Homepage: https://docs.sdv.dev/sdmetrics
- Size: 2.49 MB
- Stars: 212
- Watchers: 13
- Forks: 45
- Open Issues: 72
-
Metadata Files:
- Readme: README.md
- Changelog: HISTORY.md
- Contributing: CONTRIBUTING.rst
- License: LICENSE
- Codeowners: .github/CODEOWNERS
- Authors: AUTHORS.rst
Awesome Lists containing this project
- awesome-data-synthesis - SDMetrics
README
This repository is part of The Synthetic Data Vault Project, a project from DataCebo.[](https://pypi.org/search/?c=Development+Status+%3A%3A+2+-+Pre-Alpha)
[](https://pypi.python.org/pypi/sdmetrics)
[](https://pepy.tech/project/sdmetrics)
[](https://github.com/sdv-dev/SDMetrics/actions?query=workflow%3A%22Run+Tests%22+branch%3Amain)
[](https://codecov.io/gh/sdv-dev/SDMetrics)
[](https://bit.ly/sdv-slack-invite)
[](https://bit.ly/sdmetrics-demo)# Overview
The SDMetrics library evaluates synthetic data by comparing it to the real data that you're trying to mimic. It includes a variety of metrics to capture different aspects of the data, for example **quality and privacy**. It also includes reports that you can run to generate insights, visualize data and share with your team.
The SDMetrics library is **model-agnostic**, meaning you can use any synthetic data. The library does not need to know how you created the data.
# Install
Install SDMetrics using pip or conda. We recommend using a virtual environment to avoid conflicts with other software on your device.
```bash
pip install sdmetrics
``````bash
conda install -c conda-forge sdmetrics
```For more information about using SDMetrics, visit the [SDMetrics Documentation](https://docs.sdv.dev/sdmetrics).
# Usage
Get started with **SDMetrics Reports** using some demo data,
```python
from sdmetrics import load_demo
from sdmetrics.reports.single_table import QualityReportreal_data, synthetic_data, metadata = load_demo(modality='single_table')
my_report = QualityReport()
my_report.generate(real_data, synthetic_data, metadata)
```
```
Creating report: 100%|██████████| 4/4 [00:00<00:00, 5.22it/s]Overall Quality Score: 82.84%
Properties:
Column Shapes: 82.78%
Column Pair Trends: 82.9%
```Once you generate the report, you can drill down on the details and visualize the results.
```python
my_report.get_visualization(property_name='Column Pair Trends')
```Save the report and share it with your team.
```python
my_report.save(filepath='demo_data_quality_report.pkl')# load it at any point in the future
my_report = QualityReport.load(filepath='demo_data_quality_report.pkl')
```**Want more metrics?** You can also manually apply any of the metrics in this library to your data.
```python
# calculate whether the synthetic data respects the min/max bounds
# set by the real data
from sdmetrics.single_column import BoundaryAdherenceBoundaryAdherence.compute(
real_data['start_date'],
synthetic_data['start_date']
)
```
```
0.8503937007874016
``````python
# calculate whether the synthetic data is new or whether it's an exact copy of the real data
from sdmetrics.single_table import NewRowSynthesisNewRowSynthesis.compute(
real_data,
synthetic_data,
metadata
)
```
```
1.0
```# What's next?
To learn more about the reports and metrics, visit the [SDMetrics Documentation](https://docs.sdv.dev/sdmetrics).
---
[The Synthetic Data Vault Project](https://sdv.dev) was first created at MIT's [Data to AI Lab](
https://dai.lids.mit.edu/) in 2016. After 4 years of research and traction with enterprise, we
created [DataCebo](https://datacebo.com) in 2020 with the goal of growing the project.
Today, DataCebo is the proud developer of SDV, the largest ecosystem for
synthetic data generation & evaluation. It is home to multiple libraries that support synthetic
data, including:* 🔄 Data discovery & transformation. Reverse the transforms to reproduce realistic data.
* 🧠 Multiple machine learning models -- ranging from Copulas to Deep Learning -- to create tabular,
multi table and time series data.
* 📊 Measuring quality and privacy of synthetic data, and comparing different synthetic data
generation models.[Get started using the SDV package](https://sdv.dev/SDV/getting_started/install.html) -- a fully
integrated solution and your one-stop shop for synthetic data. Or, use the standalone libraries
for specific needs.