{"id":19015224,"url":"https://github.com/sdv-dev/sdgym","last_synced_at":"2026-02-28T12:46:25.181Z","repository":{"id":38340669,"uuid":"173002255","full_name":"sdv-dev/SDGym","owner":"sdv-dev","description":"Benchmarking synthetic data generation methods.","archived":false,"fork":false,"pushed_at":"2025-04-28T16:03:22.000Z","size":3198,"stargazers_count":273,"open_issues_count":18,"forks_count":63,"subscribers_count":15,"default_branch":"main","last_synced_at":"2025-05-02T21:43:17.082Z","etag":null,"topics":["benchmark","deep-learning","generative-adversarial-network","generative-ai","generative-models","sdgym-synthesizers","synthetic-data","synthetic-data-vault","tabular-data"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/sdv-dev.png","metadata":{"files":{"readme":"README.md","changelog":"HISTORY.md","contributing":"CONTRIBUTING.rst","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":".github/CODEOWNERS","security":null,"support":null,"governance":null,"roadmap":null,"authors":"AUTHORS.rst","dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2019-02-27T22:46:19.000Z","updated_at":"2025-04-28T16:03:24.000Z","dependencies_parsed_at":"2023-02-17T09:16:08.124Z","dependency_job_id":"d5130ed9-62b9-4f20-8651-3030ea8ae9e8","html_url":"https://github.com/sdv-dev/SDGym","commit_stats":{"total_commits":298,"total_committers":20,"mean_commits":14.9,"dds":0.6711409395973154,"last_synced_commit":"581e1e790f06aeaec8c47fede8dcb070eea21087"},"previous_names":[],"tags_count":18,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sdv-dev%2FSDGym","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sdv-dev%2FSDGym/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sdv-dev%2FSDGym/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sdv-dev%2FSDGym/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/sdv-dev","download_url":"https://codeload.github.com/sdv-dev/SDGym/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254319715,"owners_count":22051072,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["benchmark","deep-learning","generative-adversarial-network","generative-ai","generative-models","sdgym-synthesizers","synthetic-data","synthetic-data-vault","tabular-data"],"created_at":"2024-11-08T19:36:14.017Z","updated_at":"2026-02-28T12:46:25.146Z","avatar_url":"https://github.com/sdv-dev.png","language":"Python","readme":"\u003cdiv align=\"center\"\u003e\n\u003cbr/\u003e\n\u003cp align=\"center\"\u003e\n    \u003ci\u003eThis repository is part of \u003ca href=\"https://sdv.dev\"\u003eThe Synthetic Data Vault Project\u003c/a\u003e, a project from \u003ca href=\"https://datacebo.com\"\u003eDataCebo\u003c/a\u003e.\u003c/i\u003e\n\u003c/p\u003e\n\n[![Development Status](https://img.shields.io/badge/Development%20Status-2%20--%20Pre--Alpha-yellow)](https://pypi.org/search/?c=Development+Status+%3A%3A+2+-+Pre-Alpha)\n[![Travis](https://travis-ci.org/sdv-dev/SDGym.svg?branch=main)](https://travis-ci.org/sdv-dev/SDGym)\n[![PyPi Shield](https://img.shields.io/pypi/v/sdgym.svg)](https://pypi.python.org/pypi/sdgym)\n[![Downloads](https://pepy.tech/badge/sdgym)](https://pepy.tech/project/sdgym)\n[![Slack](https://img.shields.io/badge/Community-Slack-blue?style=plastic\u0026logo=slack)](https://bit.ly/sdv-slack-invite)\n\n\u003cdiv align=\"left\"\u003e\n\u003cbr/\u003e\n\u003cp align=\"center\"\u003e\n\u003ca href=\"https://github.com/sdv-dev/SDGym\"\u003e\n\u003cimg align=\"center\" width=40% src=\"https://github.com/sdv-dev/SDV/blob/stable/docs/images/SDGym-DataCebo.png\"\u003e\u003c/img\u003e\n\u003c/a\u003e\n\u003c/p\u003e\n\u003c/div\u003e\n\n\u003c/div\u003e\n\n# Overview\n\nThe Synthetic Data Gym (SDGym) is a benchmarking framework for modeling and generating\nsynthetic data. Measure performance and memory usage across different synthetic data modeling\ntechniques – classical statistics, deep learning and more!\n\n\u003cimg align=\"center\" src=\"docs/images/SDGym_Results.png\"\u003e\u003c/img\u003e\n\nThe SDGym library integrates with the Synthetic Data Vault ecosystem. You can use any of its\nsynthesizers, datasets or metrics for benchmarking. You can also customize the process to include\nyour own work.\n\n* **Datasets**: Select any of the publicly available datasets from the SDV project, or input your own data.\n* **Synthesizers**: Choose from any of the SDV synthesizers and baselines. Or write your own custom\nmachine learning model.\n* **Evaluation**: In addition to performance and memory usage, you can also measure synthetic data\nquality and privacy through a variety of metrics.\n\n# Install\n\nInstall SDGym using pip or conda. We recommend using a virtual environment to avoid conflicts with other software on your device.\n\n```bash\npip install sdgym\n```\n\n```bash\nconda install -c pytorch -c conda-forge sdgym\n```\n\nFor more information about using SDGym, visit the [SDGym Documentation](https://docs.sdv.dev/sdgym).\n\n# Usage\n\nLet's benchmark synthetic data generation for single tables. First, let's define which modeling\ntechniques we want to use. Let's choose a few synthesizers from the SDV library and a few others\nto use as baselines.\n\n```python\n# these synthesizers come from the SDV library\n# each one uses different modeling techniques\nsdv_synthesizers = ['GaussianCopulaSynthesizer', 'CTGANSynthesizer']\n\n# these basic synthesizers are available in SDGym\n# as baselines\nbaseline_synthesizers = ['UniformSynthesizer']\n```\n\nNow, we can benchmark the different techniques:\n```python\nimport sdgym\n\nsdgym.benchmark_single_table(\n    synthesizers=(sdv_synthesizers + baseline_synthesizers)\n)\n```\n\nThe result is a detailed performance, memory and quality evaluation across the synthesizers\non a variety of publicly available datasets.\n\n## Supplying a custom synthesizer\n\nBenchmark your own synthetic data generation techniques. Define your synthesizer by\nspecifying the training logic (using machine learning) and the sampling logic.\n\n```python\ndef my_training_logic(data, metadata):\n    # create an object to represent your synthesizer\n    # train it using the data\n    return synthesizer\n\ndef my_sampling_logic(trained_synthesizer, num_rows):\n    # use the trained synthesizer to create\n    # num_rows of synthetic data\n    return synthetic_data\n```\n\nLearn more in the [Custom Synthesizers Guide](https://docs.sdv.dev/sdgym/customization/synthesizers/custom-synthesizers).\n\n## Customizing your datasets\n\nThe SDGym library includes many publicly available datasets that you can include right away.\nList these using the ``get_available_datasets`` feature.\n\n```python\nsdgym.get_available_datasets()\n```\n\n```\ndataset_name   size_MB     num_tables\nKRK_v1         0.072128    1\nadult          3.907448    1\nalarm          4.520128    1\nasia           1.280128    1\n...\n```\n\nYou can also include any custom, private datasets that are stored on your computer on an\nAmazon S3 bucket.\n\n```\nmy_datasets_folder = 's3://my-datasets-bucket'\n```\n\nFor more information, see the docs for [Customized Datasets](https://docs.sdv.dev/sdgym/customization/datasets).\n\n# What's next?\n\nVisit the [SDGym Documentation](https://docs.sdv.dev/sdgym) to learn more!\n\n---\n\n\n\u003cdiv align=\"center\"\u003e\n\u003ca href=\"https://datacebo.com\"\u003e\u003cimg align=\"center\" width=40% src=\"https://github.com/sdv-dev/SDV/blob/stable/docs/images/DataCebo.png\"\u003e\u003c/img\u003e\u003c/a\u003e\n\u003c/div\u003e\n\u003cbr/\u003e\n\u003cbr/\u003e\n\n[The Synthetic Data Vault Project](https://sdv.dev) was first created at MIT's [Data to AI Lab](\nhttps://dai.lids.mit.edu/) in 2016. After 4 years of research and traction with enterprise, we\ncreated [DataCebo](https://datacebo.com) in 2020 with the goal of growing the project.\nToday, DataCebo is the proud developer of SDV, the largest ecosystem for\nsynthetic data generation \u0026 evaluation. It is home to multiple libraries that support synthetic\ndata, including:\n\n* 🔄 Data discovery \u0026 transformation. Reverse the transforms to reproduce realistic data.\n* 🧠 Multiple machine learning models -- ranging from Copulas to Deep Learning -- to create tabular,\n  multi table and time series data.\n* 📊 Measuring quality and privacy of synthetic data, and comparing different synthetic data\n  generation models.\n\n[Get started using the SDV package](https://sdv.dev/SDV/getting_started/install.html) -- a fully\nintegrated solution and your one-stop shop for synthetic data. Or, use the standalone libraries\nfor specific needs.\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsdv-dev%2Fsdgym","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsdv-dev%2Fsdgym","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsdv-dev%2Fsdgym/lists"}