https://github.com/martinjurkovic/syntherela
A package for benchmarking synthetic relational data generation methods
https://github.com/martinjurkovic/syntherela
benchmark deep-learning graph-neural-networks machine-learning pytorch pytorch-geometric relational-data relational-deep-learning synthetic-data tabular-data
Last synced: 3 months ago
JSON representation
A package for benchmarking synthetic relational data generation methods
- Host: GitHub
- URL: https://github.com/martinjurkovic/syntherela
- Owner: martinjurkovic
- License: mit
- Created: 2024-02-05T14:23:07.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2026-03-25T11:56:31.000Z (3 months ago)
- Last Synced: 2026-03-26T14:22:51.617Z (3 months ago)
- Topics: benchmark, deep-learning, graph-neural-networks, machine-learning, pytorch, pytorch-geometric, relational-data, relational-deep-learning, synthetic-data, tabular-data
- Language: Python
- Homepage:
- Size: 6.89 MB
- Stars: 61
- Watchers: 3
- Forks: 1
- Open Issues: 8
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
Awesome Lists containing this project
README
# SyntheRela - Synthetic Relational Data Generation Benchmark
## About SyntheRela
SyntheRela is a comprehensive benchmark designed to evaluate and compare synthetic relational database generation methods. It provides a standardized framework for assessing both the fidelity and utility of synthetic data across multiple real-world databases. The benchmark includes novel evaluation metrics, particularly for relational data, and supports various open-source and commercial synthetic data generation methods.
SyntheRela is highly extensible, allowing users to benchmark on their own custom datasets and implement new evaluation metrics to suit specific use cases.
Our research on SyntheRela is presented in the paper **"SyntheRela: A Benchmark For Synthetic Relational Database Generation"** at the ICLR 2025 Workshop "Will Synthetic Data Finally Solve the Data Access Problem?", available on [OpenReview](https://openreview.net/forum?id=ZfQofWYn6n).
We maintain a [public leaderboard on Hugging Face](https://huggingface.co/spaces/SyntheRela/leaderboard) where you can compare the performance of different synthetic data generation methods.
## Installation
To install only the benchmark package, run the following command:
```bash
pip install syntherela
```
## Using SyntheRela
To evaluate your synthetic relational data, configure the `Benchmark` class with your desired metrics and run the evaluation pipeline:
```python
from syntherela.benchmark import Benchmark
from syntherela.metrics.single_column.statistical import ChiSquareTest
from syntherela.metrics.single_table.distance import MaximumMeanDiscrepancy
from syntherela.metrics.multi_table.statistical import CardinalityShapeSimilarity
from syntherela.metrics.multi_table.detection import AggregationDetection
from xgboost import XGBClassifier
# Initialize the benchmark with specific metrics
benchmark = Benchmark(
real_data_dir="path/to/real_data",
synthetic_data_dir="path/to/synthetic_data",
results_dir="results",
single_column_metrics=[ChiSquareTest()],
single_table_metrics=[MaximumMeanDiscrepancy()],
multi_table_metrics=[
CardinalityShapeSimilarity(),
AggregationDetection(classifier_cls=XGBClassifier, random_state=42)
],
datasets=["your_dataset_name"],
methods=["your_method_name"]
)
# Execute evaluation
benchmark.run()
```
## Examples
We provide example notebooks to help you get started with `syntherela` in the [examples/](examples/) directory.
- [Evaluating Rossmann Subsampled Dataset](examples/evaluate_rossmann_subsampled.ipynb): A step-by-step guide to evaluating a subsampled version of the Rossmann dataset using various metrics.
## Replicating the paper's results
For detailed instructions on how to replicate the paper's results, please refer to [docs/REPLICATING_RESULTS.md](/docs/REPLICATING_RESULTS.md).
## Adding a new metric
The documentation for adding a new metric can be found in [docs/ADDING_A_METRIC.md](/docs/ADDING_A_METRIC.md).
\* Denotes the method does not have a public implementation available.
## 🏆 Leaderboard Submission
We maintain an official leaderboard to benchmark synthetic relational data generation methods. To ensure fairness and reproducibility, **all evaluations are performed by the SyntheRela maintainers** on standardized hardware.
### Evaluation Overview
| Feature | Specification |
| :--- | :--- |
| **Compute** | Single NVIDIA H100 (80GB) |
| **Time Limit** | 48 hours execution time **per dataset** |
| **Submission Frequency** | 1 submission per 30-day period |
| **Capacity** | Up to 2 model variants/checkpoints per submission |
### How to Submit
1. **Prepare your code:** Ensure your method is reproducible and includes a clear `README` and `requirements.txt`.
2. **Open an Issue:** Create a new [GitHub Issue](https://github.com/martinjurkovic/syntherela/issues) using the title prefix `[Model Submission]`.
For the complete requirements regarding environment setup, logging, and our privacy/confidentiality policy, please refer to our **[Full Submission Guidelines](https://docs.google.com/document/d/1ae16L_vvT5PFt2OeN7FJauA_ayd_A6xCkhVJFoYcx04)**.
## Conflicts of Interest
The authors declare no conflict of interest and are not associated with any of the evaluated commercial synthetic data providers.
## Citation
If you use SyntheRela in your work, please cite our paper:
```
@inproceedings{
iclrsyntheticdata2025syntherela,
title={SyntheRela: A Benchmark For Synthetic Relational Database Generation},
author={Martin Jurkovic and Valter Hudovernik and Erik {\v{S}}trumbelj},
booktitle={Will Synthetic Data Finally Solve the Data Access Problem?},
year={2025},
url={https://openreview.net/forum?id=ZfQofWYn6n}
}
```
## License
This project is licensed under the [MIT License](/LICENSE).