{"id":15037411,"url":"https://github.com/hitsz-ids/synthetic-data-generator","last_synced_at":"2025-05-13T20:09:42.113Z","repository":{"id":212208576,"uuid":"676793232","full_name":"hitsz-ids/synthetic-data-generator","owner":"hitsz-ids","description":"SDG is a specialized framework designed to generate high-quality structured tabular data.","archived":false,"fork":false,"pushed_at":"2025-03-06T05:54:45.000Z","size":4395,"stargazers_count":2350,"open_issues_count":21,"forks_count":379,"subscribers_count":308,"default_branch":"main","last_synced_at":"2025-04-28T10:55:30.266Z","etag":null,"topics":["agent","data-generator","deep-learning","gan","generative-ai","llm","machine-learning","privacy","synthetic-data","tabular-data"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/hitsz-ids.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":"ROADMAP.md","authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2023-08-10T03:08:07.000Z","updated_at":"2025-04-28T09:47:57.000Z","dependencies_parsed_at":"2023-12-19T05:02:55.307Z","dependency_job_id":"8a24ea84-8f7c-4d11-aac8-d80c674041a3","html_url":"https://github.com/hitsz-ids/synthetic-data-generator","commit_stats":{"total_commits":152,"total_committers":15,"mean_commits":"10.133333333333333","dds":0.7302631578947368,"last_synced_commit":"fc5201e25733b35a6f8460671c4ae1e7f7453da9"},"previous_names":["hitsz-ids/synthetic-data-generator"],"tags_count":15,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hitsz-ids%2Fsynthetic-data-generator","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hitsz-ids%2Fsynthetic-data-generator/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hitsz-ids%2Fsynthetic-data-generator/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hitsz-ids%2Fsynthetic-data-generator/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/hitsz-ids","download_url":"https://codeload.github.com/hitsz-ids/synthetic-data-generator/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254020606,"owners_count":22000753,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["agent","data-generator","deep-learning","gan","generative-ai","llm","machine-learning","privacy","synthetic-data","tabular-data"],"created_at":"2024-09-24T20:34:34.242Z","updated_at":"2025-05-13T20:09:42.088Z","avatar_url":"https://github.com/hitsz-ids.png","language":"Python","funding_links":[],"categories":["7. Training \u0026 Fine-tuning Ecosystem","Data Processing \u0026 ETL Agents"],"sub_categories":["NL AI Frameworks"],"readme":"\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"assets/sdg_logo.png\" width=\"400\" \u003e\n\u003c/div\u003e\n\u003cdiv align=\"center\"\u003e\n\u003cp align=\"center\"\u003e\n\n\u003cp align=\"center\"\u003e\n\u003ca href=\"https://github.com/hitsz-ids/synthetic-data-generator/actions\"\u003e\u003cimg alt=\"Actions Status\" src=\"https://github.com/hitsz-ids/synthetic-data-generator/actions/workflows/ci-test-python-package.yml/badge.svg\"\u003e\u003c/a\u003e\n\u003ca href='https://synthetic-data-generator.readthedocs.io/en/latest/?badge=latest'\u003e\u003cimg src='https://readthedocs.org/projects/synthetic-data-generator/badge/?version=latest' alt='Documentation Status' /\u003e\u003c/a\u003e\n\u003ca href=\"https://results.pre-commit.ci/latest/github/hitsz-ids/synthetic-data-generator/main\"\u003e\u003cimg alt=\"pre-commit.ci status\" src=\"https://results.pre-commit.ci/badge/github/hitsz-ids/synthetic-data-generator/main.svg\"\u003e\u003c/a\u003e\n\u003ca href=\"https://github.com/hitsz-ids/synthetic-data-generator/blob/main/LICENSE\"\u003e\u003cimg alt=\"LICENSE\" src=\"https://img.shields.io/github/license/hitsz-ids/synthetic-data-generator\"\u003e\u003c/a\u003e\n\u003ca href=\"https://github.com/hitsz-ids/synthetic-data-generator/releases/\"\u003e\u003cimg alt=\"Releases\" src=\"https://img.shields.io/github/v/release/hitsz-ids/synthetic-data-generator\"\u003e\u003c/a\u003e\n\u003ca href=\"https://github.com/hitsz-ids/synthetic-data-generator/releases/\"\u003e\u003cimg alt=\"Pre Releases\" src=\"https://img.shields.io/github/v/release/hitsz-ids/synthetic-data-generator?include_prereleases\u0026label=pre-release\u0026logo=github\"\u003e\u003c/a\u003e\n\u003ca href=\"https://github.com/hitsz-ids/synthetic-data-generator\"\u003e\u003cimg alt=\"Last Commit\" src=\"https://img.shields.io/github/last-commit/hitsz-ids/synthetic-data-generator\"\u003e\u003c/a\u003e\n\u003ca href=\"https://github.com/hitsz-ids/synthetic-data-generator\"\u003e\u003cimg alt=\"Python version\" src=\"https://img.shields.io/pypi/pyversions/sdgx\"\u003e\u003c/a\u003e\n\u003ca href=\"https://github.com/hitsz-ids/synthetic-data-generator/contributors\"\u003e\u003cimg alt=\"contributors\" src=\"https://img.shields.io/github/all-contributors/hitsz-ids/synthetic-data-generator?color=ee8449\u0026style=flat-square\"\u003e\u003c/a\u003e\n\u003ca href=\"https://join.slack.com/t/hitsz-ids/shared_invite/zt-2395mt6x2-dwf0j_423QkAgGvlNA5E1g\"\u003e\u003cimg alt=\"slack\" src=\"https://img.shields.io/badge/slack-join%20chat-ff69b4.svg?style=flat-square\"\u003e\u003c/a\u003e\n\u003c/p\u003e\n\n# 🚀 Synthetic Data Generator\n\n\u003cp style=\"font-size: small;\"\u003eSwitch Language:\n    \u003ca href=\"https://github.com/hitsz-ids/synthetic-data-generator/blob/main/README_ZH_CN.md\" target=\"_blank\"\u003e简体中文\u003c/a\u003e \u0026nbsp;| \u0026nbsp;\n    Latest \u003ca href=\"https://synthetic-data-generator.readthedocs.io/en/latest/\" target=\"value\"\u003eAPI Docs\u003c/a\u003e \u0026nbsp;| \u0026nbsp;\n    \u003ca href=\"ROADMAP.md\" target=\"value\"\u003eRoadmap\u003c/a\u003e \u0026nbsp;| \u0026nbsp;\n    Join \u003ca href=\"assets/live_QR_code.jpg\" target=\"value\"\u003eWechat Group\u003c/a\u003e\n\u003c/p\u003e\n\n\u003cp style=\"font-size: small;\"\u003e\n    Colab Examples:\u0026nbsp;\n    \u003ca href=\"https://colab.research.google.com/drive/1VFnP59q3eoVtMJ1PvcYjmuXtx9N8C7o0?usp=sharing\" target=\"value\"\u003e LLM: Data Synthesis\u003c/a\u003e\n    \u0026nbsp;| \u0026nbsp;\n    \u003ca href=\"https://colab.research.google.com/drive/1_chuTVZECpj5fklj-RAp7ZVrew8weLW_?usp=sharing\" target=\"value\"\u003e LLM: Off-Table Inference\u003c/a\u003e\n    \u0026nbsp;| \u0026nbsp;\n    \u003ca href=\"https://colab.research.google.com/drive/1cMB336jN3kb-m_pr1aJjshnNep_6bhsf?usp=sharing\" target=\"value\"\u003e Billion-Level-Data supported CTGAN\u003c/a\u003e\n\u003c/p\u003e\n\n\u003c/p\u003e\n\u003c/div\u003e\n\nThe Synthetic Data Generator (SDG) is a specialized framework designed to generate high-quality structured tabular data.\n\nSynthetic data does not contain any sensitive information, yet it retains the essential characteristics of the original data, making it exempt from privacy regulations such as GDPR and ADPPA.\n\nHigh-quality synthetic data can be safely utilized across various domains including data sharing, model training and debugging, system development and testing, etc.\n\nWe are excited to have you here and look forward to your contributions, get started with the project through this [Contributing Overview Guide](CONTRIBUTING.md)!\n\n## 💥News\n\nOur current key achievements and timelines are as follows:\n\n🔥 Nov 21, 2024: 1) Model Integration - We've integrated the `GaussianCopula` model into our Data Processor System. Check out the code example in this [PR](https://github.com/hitsz-ids/synthetic-data-generator/pull/241); 2) Synthetic Quality - We implemented automatic detection of data column relationships and allowed for relationship specification, improved the quality of synthetic data([Code Example](https://synthetic-data-generator.readthedocs.io/en/latest/user_guides/single_table_column_combinations.html)); 3) Performance Enhancement - We significantly reduced the memory usage of GaussianCopula when handling discrete data, enabling training on thousands of categorical data entries with a `2C4G` setup!\n\n🔥 May 30, 2024: The Data Processor module was officially merged. This module will: 1) help SDG convert the format of some data columns (such as Datetime columns) before feeded into the model (so as to avoid being treated as discrete types), and reversely convert the model-generated data into the original format; 2) perform more customized pre-processing and post-processing on various data types; 3) easily deal with problems such as null values ​​in the original data; 4) support the plug-in system.\n\n🔥 Feb 20, 2024: a single-table data synthesis model based on LLM is included, view colab example: \u003ca href=\"https://colab.research.google.com/drive/1VFnP59q3eoVtMJ1PvcYjmuXtx9N8C7o0?usp=sharing\" target=\"value\"\u003e LLM: Data Synthesis\u003c/a\u003e and \u003ca href=\"https://colab.research.google.com/drive/1_chuTVZECpj5fklj-RAp7ZVrew8weLW_?usp=sharing\" target=\"value\"\u003e LLM: Off-table Feature Inference\u003c/a\u003e.\n\n🔧 Feb 7, 2024: We improved `sdgx.data_models.metadata` to support metadata information describing for single tables and multiple tables, support multiple data types, support automatic data type inference. view colab example: \u003ca href=\"https://colab.research.google.com/drive/1b4ZTpgSYjOt7ekp1Wj8CxDknbOHEwA7s?usp=sharing\" target=\"value\"\u003eSDG Single-Table Metadata\u003c/a\u003e。\n\n🔶 Dec 20, 2023: v0.1.0 released, a CTGAN model that supports billions of data processing capabilities is included, view our \u003ca href=\"https://github.com/hitsz-ids/synthetic-data-generator/tree/main/benchmarks#results\" target=\"value\"\u003e benchmark against SDV\u003c/a\u003e, where SDG achieved less memory consumption and avoided crashing during training. For specific use, view colab example: \u003ca href=\"https://colab.research.google.com/drive/1cMB336jN3kb-m_pr1aJjshnNep_6bhsf?usp=sharing\" target=\"value\"\u003e Billion-Level-Data supported CTGAN\u003c/a\u003e.\n\n🔆 Aug 10, 2023: First line of SDG code committed.\n\n## 🎉 LLM-integrated synthetic data generation\n\nFor a long time, LLM has been used to understand and generate various types of data. In fact, LLM also has certain capabilities in tabular data generation. Also, it has some abilities that cannot be achieved by traditional (based on GAN methods or statistical methods) .\n\nOur `sdgx.models.LLM.single_table.gpt.SingleTableGPTModel` implements two new features:\n\n### Synthetic data generation without Data\n\nNo training data is required, synthetic data can be generated based on metadata data, view in our \u003ca href=\"https://colab.research.google.com/drive/1VFnP59q3eoVtMJ1PvcYjmuXtx9N8C7o0?usp=sharing\" target=\"value\"\u003e colab example\u003c/a\u003e.\n\n![Synthetic data generation without Data](assets/LLM_Case_1.gif)\n\n### Off-Table feature inference\n\nInfer new column data based on the existing data in the table and the knowledge mastered by LLM, view in our \u003ca href=\"https://colab.research.google.com/drive/1_chuTVZECpj5fklj-RAp7ZVrew8weLW_?usp=sharing\" target=\"value\"\u003e colab example\u003c/a\u003e.\n\n![Off-Table feature inference](assets/LLM_Case_2.gif)\n\n## 💫 Why SDG ?\n\n- Technological advancements:\n  - Supports a wide range of statistical data synthesis algorithms, LLM-based synthetic data generation model is also integrated;\n  - Optimized for big data, effectively reducing memory consumption;\n  - Continuously tracking the latest advances in academia and industry, and introducing support for excellent algorithms and models in a timely manner.\n- Privacy enhancements:\n  - SDG supports differential privacy, anonymization and other methods to enhance the security of synthetic data.\n- Easy to extend:\n  - Supports expansion of models, data processing, data connectors, etc. in the form of plug-in packages.\n\n## 🌀 Quick Start\n\n### Pre-build image\n\nYou can use pre-built images to quickly experience the latest features.\n\n```bash\ndocker pull idsteam/sdgx:latest\n```\n\n### Install from PyPi\n\n```bash\npip install sdgx\n```\n\n### Local Install (Recommended)\n\nUse SDG by installing it through the source code.\n\n```bash\ngit clone git@github.com:hitsz-ids/synthetic-data-generator.git\npip install .\n# Or install from git\npip install git+https://github.com/hitsz-ids/synthetic-data-generator.git\n```\n\n### Quick Demo of Single Table Data Generation and Metric\n\n#### Demo code\n\n```python\nfrom sdgx.data_connectors.csv_connector import CsvConnector\nfrom sdgx.models.ml.single_table.ctgan import CTGANSynthesizerModel\nfrom sdgx.synthesizer import Synthesizer\nfrom sdgx.utils import download_demo_data\n\n# This will download demo data to ./dataset\ndataset_csv = download_demo_data()\n\n# Create data connector for csv file\ndata_connector = CsvConnector(path=dataset_csv)\n\n# Initialize synthesizer, use CTGAN model\nsynthesizer = Synthesizer(\n    model=CTGANSynthesizerModel(epochs=1),  # For quick demo\n    data_connector=data_connector,\n)\n\n# Fit the model\nsynthesizer.fit()\n\n# Sample\nsampled_data = synthesizer.sample(1000)\nprint(sampled_data)\n```\n\n#### Comparison\n\nReal data are as follows：\n\n```python\n\u003e\u003e\u003e data_connector.read()\n       age         workclass  fnlwgt  education  ...  capitalloss hoursperweek native-country  class\n0        2         State-gov   77516  Bachelors  ...            0            2  United-States  \u003c=50K\n1        3  Self-emp-not-inc   83311  Bachelors  ...            0            0  United-States  \u003c=50K\n2        2           Private  215646    HS-grad  ...            0            2  United-States  \u003c=50K\n3        3           Private  234721       11th  ...            0            2  United-States  \u003c=50K\n4        1           Private  338409  Bachelors  ...            0            2           Cuba  \u003c=50K\n...    ...               ...     ...        ...  ...          ...          ...            ...    ...\n48837    2           Private  215419  Bachelors  ...            0            2  United-States  \u003c=50K\n48838    4               NaN  321403    HS-grad  ...            0            2  United-States  \u003c=50K\n48839    2           Private  374983  Bachelors  ...            0            3  United-States  \u003c=50K\n48840    2           Private   83891  Bachelors  ...            0            2  United-States  \u003c=50K\n48841    1      Self-emp-inc  182148  Bachelors  ...            0            3  United-States   \u003e50K\n\n[48842 rows x 15 columns]\n\n```\n\nSynthetic data are as follows：\n\n```python\n\u003e\u003e\u003e sampled_data\n     age workclass  fnlwgt     education  ...  capitalloss hoursperweek native-country  class\n0      1       NaN   28219  Some-college  ...            0            2    Puerto-Rico  \u003c=50K\n1      2   Private  250166       HS-grad  ...            0            2  United-States   \u003e50K\n2      2   Private   50304       HS-grad  ...            0            2  United-States  \u003c=50K\n3      4   Private   89318     Bachelors  ...            0            2    Puerto-Rico   \u003e50K\n4      1   Private  172149     Bachelors  ...            0            3  United-States  \u003c=50K\n..   ...       ...     ...           ...  ...          ...          ...            ...    ...\n995    2       NaN  208938     Bachelors  ...            0            1  United-States  \u003c=50K\n996    2   Private  166416     Bachelors  ...            2            2  United-States  \u003c=50K\n997    2       NaN  336022       HS-grad  ...            0            1  United-States  \u003c=50K\n998    3   Private  198051       Masters  ...            0            2  United-States   \u003e50K\n999    1       NaN   41973       HS-grad  ...            0            2  United-States  \u003c=50K\n\n[1000 rows x 15 columns]\n```\n\n## 👩‍🎓 Related Work\n\n- CTGAN：[Modeling Tabular Data using Conditional GAN](https://proceedings.neurips.cc/paper/2019/hash/254ed7d2de3b23ab10936522dd547b78-Abstract.html)\n- C3-TGAN: [C3-TGAN- Controllable Tabular Data Synthesis with Explicit Correlations and Property Constraints](https://www.researchgate.net/publication/374652636_C3-TGAN-_Controllable_Tabular_Data_Synthesis_with_Explicit_Correlations_and_Property_Constraints)\n- TVAE：[Modeling Tabular Data using Conditional GAN](https://proceedings.neurips.cc/paper/2019/hash/254ed7d2de3b23ab10936522dd547b78-Abstract.html)\n- table-GAN：[Data Synthesis based on Generative Adversarial Networks](https://arxiv.org/pdf/1806.03384.pdf)\n- CTAB-GAN:[CTAB-GAN: Effective Table Data Synthesizing](https://proceedings.mlr.press/v157/zhao21a/zhao21a.pdf)\n- OCT-GAN: [OCT-GAN: Neural ODE-based Conditional Tabular GANs](https://arxiv.org/pdf/2105.14969.pdf)\n\n## 🤝 Join Community\n\nThe SDG project was initiated by **Institute of Data Security, Harbin Institute of Technology**. If you are interested in out project, welcome to join our community. We welcome organizations, teams, and individuals who share our commitment to data protection and security through open source:\n\n- Read [CONTRIBUTING](./CONTRIBUTING.md) before draft a pull request.\n- Submit an issue by viewing [View Good First Issue](https://github.com/hitsz-ids/synthetic-data-generator/labels/good%20first%20issue) or submit a Pull Request.\n- Join our Wechat Group through QR code.\n\n\u003cdiv align=\"left\"\u003e\n  \u003cimg src=\"assets/live_QR_code.jpg\" width=\"200\" \u003e\n\u003c/div\u003e\n\n## 📄 License\n\nThe SDG open source project uses Apache-2.0 license, please refer to the [LICENSE](https://github.com/hitsz-ids/synthetic-data-generator/blob/main/LICENSE).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhitsz-ids%2Fsynthetic-data-generator","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhitsz-ids%2Fsynthetic-data-generator","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhitsz-ids%2Fsynthetic-data-generator/lists"}