{"id":47603897,"url":"https://github.com/Diyago/Tabular-data-generation","last_synced_at":"2026-04-01T19:01:18.813Z","repository":{"id":39724574,"uuid":"248953490","full_name":"Diyago/Tabular-data-generation","owner":"Diyago","description":"We well know GANs for success in the realistic image generation. However, they can be applied in tabular data generation. We will review and examine some recent papers about tabular GANs in action.","archived":false,"fork":false,"pushed_at":"2026-03-28T05:43:46.000Z","size":55531,"stargazers_count":566,"open_issues_count":0,"forks_count":84,"subscribers_count":9,"default_branch":"master","last_synced_at":"2026-03-28T10:35:34.774Z","etag":null,"topics":["adversarial-filtering","deep-learning","feature-engineering","gan","gans","machine-learning","python","tabular-data","train-dataframe"],"latest_commit_sha":null,"homepage":"https://towardsdatascience.com/review-of-gans-for-tabular-data-a30a2199342","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Diyago.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.rst","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":"SECURITY.md","support":null,"governance":null,"roadmap":null,"authors":"AUTHORS.rst","dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2020-03-21T10:31:31.000Z","updated_at":"2026-03-28T06:29:52.000Z","dependencies_parsed_at":"2024-05-29T10:38:54.922Z","dependency_job_id":null,"html_url":"https://github.com/Diyago/Tabular-data-generation","commit_stats":{"total_commits":132,"total_committers":7,"mean_commits":"18.857142857142858","dds":"0.24242424242424243","last_synced_commit":"400d26ffe36621d8ba7f4d2c2d023da01a94d849"},"previous_names":["diyago/tabular-data-generation"],"tags_count":55,"template":false,"template_full_name":null,"purl":"pkg:github/Diyago/Tabular-data-generation","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Diyago%2FTabular-data-generation","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Diyago%2FTabular-data-generation/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Diyago%2FTabular-data-generation/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Diyago%2FTabular-data-generation/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Diyago","download_url":"https://codeload.github.com/Diyago/Tabular-data-generation/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Diyago%2FTabular-data-generation/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31291007,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-01T13:12:26.723Z","status":"ssl_error","status_checked_at":"2026-04-01T13:12:25.102Z","response_time":53,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["adversarial-filtering","deep-learning","feature-engineering","gan","gans","machine-learning","python","tabular-data","train-dataframe"],"created_at":"2026-04-01T19:00:38.740Z","updated_at":"2026-04-01T19:01:18.805Z","avatar_url":"https://github.com/Diyago.png","language":"Python","readme":"\u003cp align=\"center\"\u003e\n  \u003cimg src=\"images/tabular_gan.png\" height=\"120\" alt=\"TabGAN logo\"\u003e\n\u003c/p\u003e\n\n\u003ch1 align=\"center\"\u003eTabGAN\u003c/h1\u003e\n\u003cp align=\"center\"\u003e\u003cstrong\u003eHigh-quality synthetic tabular data generation\u003c/strong\u003e\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"https://pypi.org/project/tabgan/\"\u003e\u003cimg src=\"https://img.shields.io/pypi/v/tabgan.svg\" alt=\"PyPI Version\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://pypi.org/project/tabgan/\"\u003e\u003cimg src=\"https://img.shields.io/pypi/pyversions/tabgan?v=3.0.2\" alt=\"Python Version\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://pepy.tech/project/tabgan\"\u003e\u003cimg src=\"https://pepy.tech/badge/tabgan\" alt=\"Downloads\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://opensource.org/licenses/Apache-2.0\"\u003e\u003cimg src=\"https://img.shields.io/badge/License-Apache%202.0-blue.svg\" alt=\"License\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://github.com/psf/black\"\u003e\u003cimg src=\"https://img.shields.io/badge/code%20style-black-000000.svg\" alt=\"Code style: black\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://www.codefactor.io/repository/github/diyago/tabular-data-generation\"\u003e\u003cimg src=\"https://www.codefactor.io/repository/github/diyago/tabular-data-generation/badge\" alt=\"CodeFactor\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://github.com/diyago/Tabular-data-generation/actions/workflows/codeql.yml\"\u003e\u003cimg src=\"https://github.com/diyago/Tabular-data-generation/workflows/CodeQL/badge.svg\" alt=\"CodeQL\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://insafq-tabgan.hf.space\"\u003e\u003cimg src=\"https://img.shields.io/badge/%F0%9F%A4%97%20Spaces-TabGAN%20Demo-blue\" alt=\"HF Space\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://colab.research.google.com/github/Diyago/Tabular-data-generation/blob/master/examples/tabgan_examples.ipynb\"\u003e\u003cimg src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"\u003e\u003c/a\u003e\n\u003c/p\u003e\n\n---\n\n## Overview\n\nTabGAN provides a unified Python interface for generating synthetic tabular data using multiple state-of-the-art generative approaches:\n\n| Approach | Backend | Strengths |\n|----------|---------|-----------|\n| **GANs** | Conditional Tabular GAN (CTGAN) | Mixed data types, complex multivariate distributions |\n| **Diffusion Models** | ForestDiffusion (tree-based gradient boosting) | High-fidelity generation for structured data |\n| **Large Language Models** | GReaT framework | Capturing semantic dependencies, conditional text generation |\n| **Baseline** | Random sampling with replacement | Quick benchmarking and comparison |\n\nAll generators share a common pipeline: **generate \u0026rarr; post-process \u0026rarr; adversarial filter**, ensuring synthetic data stays close to the real data distribution.\n\n*Based on the paper: [Tabular GANs for uneven distribution](https://arxiv.org/abs/2010.00638) (arXiv:2010.00638)*\n\n## Key Features\n\n- **Unified API** \u0026mdash; switch between GANs, diffusion models, and LLMs with a single parameter change\n- **Adversarial filtering** \u0026mdash; built-in LightGBM-based validation keeps synthetic samples distribution-consistent\n- **Mixed data types** \u0026mdash; native handling of continuous, categorical, and free-text columns\n- **Conditional generation** \u0026mdash; generate text conditioned on categorical attributes via LLM prompting\n- **LLM API support** \u0026mdash; integrate with LM Studio, OpenAI, Ollama, or any OpenAI-compatible endpoint\n- **Quality validation** \u0026mdash; compare original and synthetic distributions with a single function call\n- **AutoSynth** \u0026mdash; automatically run all generators, compare quality \u0026 privacy, pick the best one\n- **HuggingFace integration** \u0026mdash; synthesize any HF dataset in one call, push results back to Hub\n- **[Live Demo](https://insafq-tabgan.hf.space)** \u0026mdash; try it in browser on HuggingFace Spaces\n\n## Installation\n\n```bash\npip install tabgan\n```\n\n## Quick Start\n\n```python\nimport pandas as pd\nimport numpy as np\nfrom tabgan.sampler import GANGenerator\n\ntrain = pd.DataFrame(np.random.randint(-10, 150, size=(150, 4)), columns=list(\"ABCD\"))\ntarget = pd.DataFrame(np.random.randint(0, 2, size=(150, 1)), columns=list(\"Y\"))\ntest = pd.DataFrame(np.random.randint(0, 100, size=(100, 4)), columns=list(\"ABCD\"))\n\nnew_train, new_target = GANGenerator().generate_data_pipe(train, target, test)\n```\n\n## Available Generators\n\n| Generator | Description | Best For |\n|-----------|-------------|----------|\n| `GANGenerator` | CTGAN-based generation | General tabular data with mixed types |\n| `ForestDiffusionGenerator` | Diffusion models with tree-based methods | Complex tabular structures |\n| `BayesianGenerator` | Gaussian Copula with marginal preservation | Fast, correlation-preserving generation |\n| `LLMGenerator` | Large Language Model based | Semantic dependencies, text columns |\n| `OriginalGenerator` | Baseline random sampler | Benchmarking and comparison |\n\n## API Reference\n\n### Common Parameters\n\nAll generators accept the following parameters:\n\n| Parameter | Type | Default | Description |\n|-----------|------|---------|-------------|\n| `gen_x_times` | `float` | `1.1` | Multiplier for synthetic sample count relative to training size |\n| `cat_cols` | `list` | `None` | Column names to treat as categorical |\n| `bot_filter_quantile` | `float` | `0.001` | Lower quantile for post-processing filters |\n| `top_filter_quantile` | `float` | `0.999` | Upper quantile for post-processing filters |\n| `is_post_process` | `bool` | `True` | Enable quantile-based post-filtering |\n| `pregeneration_frac` | `float` | `2` | Oversampling factor before filtering |\n| `only_generated_data` | `bool` | `False` | Return only synthetic rows (exclude originals) |\n| `gen_params` | `dict` | See below | Generator-specific hyperparameters |\n\n### Generator-Specific Parameters (`gen_params`)\n\n**GANGenerator:**\n```python\n{\"batch_size\": 500, \"patience\": 25, \"epochs\": 500}\n```\n\n**LLMGenerator:**\n```python\n{\"batch_size\": 32, \"epochs\": 4, \"llm\": \"distilgpt2\", \"max_length\": 500}\n```\n\n### `generate_data_pipe` Method\n\n```python\nnew_train, new_target = generator.generate_data_pipe(\n    train_df,           # pd.DataFrame - training features\n    target,             # pd.DataFrame - target variable (or None)\n    test_df,            # pd.DataFrame - test features for distribution alignment\n    deep_copy=True,     # bool - copy input DataFrames\n    only_adversarial=False,  # bool - skip generation, only filter\n    use_adversarial=True,    # bool - enable adversarial filtering\n)\n```\n\n**Returns:** `Tuple[pd.DataFrame, pd.DataFrame]` \u0026mdash; `(new_train, new_target)`\n\n## Data Format\n\nTabGAN accepts `pandas.DataFrame` inputs with:\n\n- **Continuous columns** \u0026mdash; any real-valued numerical data\n- **Categorical columns** \u0026mdash; discrete columns with a finite set of values\n\n\u003e **Note:** TabGAN processes values as floating-point internally. Apply rounding after generation for integer-valued outputs.\n\n## Examples\n\n### Basic Usage with All Generators\n\n```python\nfrom tabgan.sampler import (\n    OriginalGenerator, GANGenerator, ForestDiffusionGenerator,\n    BayesianGenerator, LLMGenerator,\n)\nimport pandas as pd\nimport numpy as np\n\ntrain = pd.DataFrame(np.random.randint(-10, 150, size=(150, 4)), columns=list(\"ABCD\"))\ntarget = pd.DataFrame(np.random.randint(0, 2, size=(150, 1)), columns=list(\"Y\"))\ntest = pd.DataFrame(np.random.randint(0, 100, size=(100, 4)), columns=list(\"ABCD\"))\n\nnew_train1, new_target1 = OriginalGenerator().generate_data_pipe(train, target, test)\nnew_train2, new_target2 = GANGenerator(\n    gen_params={\"batch_size\": 500, \"epochs\": 10, \"patience\": 5}\n).generate_data_pipe(train, target, test)\nnew_train3, new_target3 = ForestDiffusionGenerator().generate_data_pipe(train, target, test)\nnew_train4, new_target4 = BayesianGenerator().generate_data_pipe(train, target, test)\nnew_train5, new_target5 = LLMGenerator(\n    gen_params={\"batch_size\": 32, \"epochs\": 4, \"llm\": \"distilgpt2\", \"max_length\": 500}\n).generate_data_pipe(train, target, test)\n```\n\n### Full Parameter Example\n\n```python\nnew_train, new_target = GANGenerator(\n    gen_x_times=1.1,\n    cat_cols=None,\n    bot_filter_quantile=0.001,\n    top_filter_quantile=0.999,\n    is_post_process=True,\n    adversarial_model_params={\n        \"metrics\": \"AUC\", \"max_depth\": 2, \"max_bin\": 100,\n        \"learning_rate\": 0.02, \"random_state\": 42, \"n_estimators\": 100,\n    },\n    pregeneration_frac=2,\n    only_generated_data=False,\n    gen_params={\"batch_size\": 500, \"patience\": 25, \"epochs\": 500},\n).generate_data_pipe(\n    train, target, test,\n    deep_copy=True,\n    only_adversarial=False,\n    use_adversarial=True,\n)\n```\n\n### LLM Conditional Text Generation\n\nGenerate synthetic rows with novel text values conditioned on categorical attributes:\n\n```python\nimport pandas as pd\nfrom tabgan.sampler import LLMGenerator\n\ntrain = pd.DataFrame({\n    \"Name\": [\"Anna\", \"Maria\", \"Ivan\", \"Sergey\", \"Olga\", \"Boris\"],\n    \"Gender\": [\"F\", \"F\", \"M\", \"M\", \"F\", \"M\"],\n    \"Age\": [25, 30, 35, 40, 28, 32],\n    \"Occupation\": [\"Engineer\", \"Doctor\", \"Artist\", \"Teacher\", \"Manager\", \"Pilot\"],\n})\n\nnew_train, _ = LLMGenerator(\n    gen_x_times=1.5,\n    text_generating_columns=[\"Name\"],      # columns to generate novel text for\n    conditional_columns=[\"Gender\"],         # columns that condition text generation\n    gen_params={\"batch_size\": 32, \"epochs\": 4, \"llm\": \"distilgpt2\", \"max_length\": 500},\n    is_post_process=False,\n).generate_data_pipe(train, target=None, test_df=None, only_generated_data=True)\n```\n\n**How it works:**\n1. Sample conditional column values from their empirical distributions\n2. Impute remaining non-text columns using the fitted GReaT model\n3. Generate novel text via prompt-based generation\n4. Ensure generated text values differ from the original data\n\n### LLM API-Based Text Generation\n\nUse external LLM APIs (LM Studio, OpenAI, Ollama) instead of local models:\n\n```python\nimport pandas as pd\nfrom tabgan.sampler import LLMGenerator\nfrom tabgan.llm_config import LLMAPIConfig\n\ntrain = pd.DataFrame({\n    \"Name\": [\"Anna\", \"Maria\", \"Ivan\", \"Sergey\", \"Olga\", \"Boris\"],\n    \"Gender\": [\"F\", \"F\", \"M\", \"M\", \"F\", \"M\"],\n    \"Age\": [25, 30, 35, 40, 28, 32],\n    \"Occupation\": [\"Engineer\", \"Doctor\", \"Artist\", \"Teacher\", \"Manager\", \"Pilot\"],\n})\n\n# LM Studio\napi_config = LLMAPIConfig.from_lm_studio(\n    base_url=\"http://localhost:1234\",\n    model=\"google/gemma-3-12b\",\n    timeout=90,\n)\n\n# Or OpenAI:  LLMAPIConfig.from_openai(api_key=\"...\", model=\"gpt-4\")\n# Or Ollama:  LLMAPIConfig.from_ollama(model=\"llama3\")\n\nnew_train, _ = LLMGenerator(\n    gen_x_times=1.5,\n    text_generating_columns=[\"Name\"],\n    conditional_columns=[\"Gender\"],\n    gen_params={\"batch_size\": 32, \"epochs\": 4, \"llm\": \"distilgpt2\", \"max_length\": 500},\n    llm_api_config=api_config,\n    is_post_process=False,\n).generate_data_pipe(train, target=None, test_df=None, only_generated_data=True)\n```\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003cstrong\u003eLLM API Configuration Options\u003c/strong\u003e\u003c/summary\u003e\n\n| Parameter | Type | Default | Description |\n|-----------|------|---------|-------------|\n| `base_url` | `str` | `\"http://localhost:1234\"` | API server base URL |\n| `model` | `str` | `\"google/gemma-3-12b\"` | Model identifier |\n| `api_key` | `str` | `None` | API key for authentication |\n| `timeout` | `int` | `90` | Request timeout in seconds |\n| `max_tokens` | `int` | `256` | Maximum tokens to generate |\n| `temperature` | `float` | `0.7` | Sampling temperature |\n| `system_prompt` | `str` | `None` | System prompt for generation |\n\n**Testing the connection:**\n\n```python\nfrom tabgan.llm_config import LLMAPIConfig\nfrom tabgan.llm_api_client import LLMAPIClient\n\nconfig = LLMAPIConfig.from_lm_studio()\nwith LLMAPIClient(config) as client:\n    print(f\"API available: {client.check_connection()}\")\n    print(f\"Generated: {client.generate('Generate a female name: ')}\")\n```\n\n\u003c/details\u003e\n\n### Improving Model Performance\n\n```python\nimport sklearn\nimport pandas as pd\nfrom tabgan.sampler import GANGenerator\n\ndef evaluate(clf, X_train, y_train, X_test, y_test):\n    clf.fit(X_train, y_train)\n    return sklearn.metrics.roc_auc_score(y_test, clf.predict_proba(X_test)[:, 1])\n\ndataset = sklearn.datasets.load_breast_cancer()\nclf = sklearn.ensemble.RandomForestClassifier(n_estimators=25, max_depth=6)\nX_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(\n    pd.DataFrame(dataset.data),\n    pd.DataFrame(dataset.target, columns=[\"target\"]),\n    test_size=0.33, random_state=42,\n)\n\nprint(\"Baseline:\", evaluate(clf, X_train, y_train, X_test, y_test))\n\nnew_train, new_target = GANGenerator().generate_data_pipe(X_train, y_train, X_test)\nprint(\"With GAN:\", evaluate(clf, new_train, new_target, X_test, y_test))\n```\n\n### Time-Series Data Generation\n\n```python\nimport pandas as pd\nimport numpy as np\nfrom tabgan.utils import get_year_mnth_dt_from_date, collect_dates\nfrom tabgan.sampler import GANGenerator\n\ntrain = pd.DataFrame(np.random.randint(-10, 150, size=(100, 4)), columns=list(\"ABCD\"))\nmin_date, max_date = pd.to_datetime(\"2019-01-01\"), pd.to_datetime(\"2021-12-31\")\nd = (max_date - min_date).days + 1\ntrain[\"Date\"] = min_date + pd.to_timedelta(np.random.randint(d, size=100), unit=\"d\")\ntrain = get_year_mnth_dt_from_date(train, \"Date\")\n\nnew_train, _ = GANGenerator(\n    gen_x_times=1.1, cat_cols=[\"year\"],\n    bot_filter_quantile=0.001, top_filter_quantile=0.999,\n    is_post_process=True, pregeneration_frac=2,\n).generate_data_pipe(train.drop(\"Date\", axis=1), None, train.drop(\"Date\", axis=1))\n\nnew_train = collect_dates(new_train)\n```\n\n## Quality Report\n\nGenerate a self-contained HTML report comparing original and synthetic data across multiple quality axes: column statistics, PSI, correlation heatmaps, distribution plots, and ML utility (TSTR vs TRTR).\n\n```python\nfrom tabgan import QualityReport\n\nreport = QualityReport(\n    original_df, synthetic_df,\n    cat_cols=[\"gender\"],\n    target_col=\"target\",      # enables ML utility evaluation\n).compute()\n\n# Export to a single HTML file (charts embedded as base64)\nreport.to_html(\"quality_report.html\")\n\n# Or access metrics programmatically\nsummary = report.summary()\nprint(f\"Overall score: {summary['overall_score']}\")\nprint(f\"Mean PSI: {summary['psi']['mean']}\")\nprint(f\"ML utility ratio: {summary['ml_utility']['utility_ratio']}\")\n```\n\nFor a quick comparison without the full report:\n\n```python\nfrom tabgan.utils import compare_dataframes\n\nscore = compare_dataframes(original_df, generated_df)  # 0.0 (poor) to 1.0 (excellent)\n```\n\n## Constraints\n\nEnforce business rules on generated data. Constraints are applied as a post-generation step — invalid rows are repaired or filtered out.\n\n```python\nfrom tabgan import GANGenerator, RangeConstraint, UniqueConstraint, FormulaConstraint, RegexConstraint\n\nnew_train, new_target = GANGenerator(gen_x_times=1.5).generate_data_pipe(\n    train, target, test,\n    constraints=[\n        RangeConstraint(\"age\", min_val=0, max_val=120),\n        UniqueConstraint(\"email\"),\n        FormulaConstraint(\"end_date \u003e start_date\"),\n        RegexConstraint(\"zip_code\", r\"\\d{5}\"),\n    ],\n)\n```\n\n**Available constraints:**\n\n| Constraint | Description | Fix strategy |\n|------------|-------------|--------------|\n| `RangeConstraint` | Numeric values within `[min, max]` | Clips values to bounds |\n| `UniqueConstraint` | No duplicate values in a column | Drops duplicate rows |\n| `FormulaConstraint` | Boolean expression via `df.eval()` | Filters violating rows |\n| `RegexConstraint` | String values match a regex pattern | Filters non-matching rows |\n\nThe `ConstraintEngine` supports two strategies: `\"fix\"` (repair then filter) and `\"filter\"` (drop violations only):\n\n```python\nfrom tabgan import ConstraintEngine, RangeConstraint\n\nengine = ConstraintEngine(\n    constraints=[RangeConstraint(\"price\", min_val=0)],\n    strategy=\"fix\",  # or \"filter\"\n)\ncleaned_df = engine.apply(generated_df)\n```\n\n## Privacy Metrics\n\nAssess re-identification risk of synthetic data before sharing. Includes Distance to Closest Record (DCR), Nearest Neighbor Distance Ratio (NNDR), and membership inference risk.\n\n```python\nfrom tabgan import PrivacyMetrics\n\npm = PrivacyMetrics(original_df, synthetic_df, cat_cols=[\"gender\"])\nsummary = pm.summary()\n\nprint(f\"Overall privacy score: {summary['overall_privacy_score']}\")  # 0 (risky) to 1 (private)\nprint(f\"DCR mean: {summary['dcr']['mean']}\")\nprint(f\"NNDR mean: {summary['nndr']['mean']}\")\nprint(f\"Membership inference AUC: {summary['membership_inference']['auc']}\")  # closer to 0.5 = better\n```\n\n**Metrics explained:**\n\n| Metric | What it measures | Good value |\n|--------|-----------------|------------|\n| **DCR** | Distance from each synthetic row to nearest real row | Higher = more private |\n| **NNDR** | Ratio of 1st/2nd nearest neighbor distances | Closer to 1.0 |\n| **MI AUC** | Can a classifier tell if a record was in training data? | Closer to 0.5 |\n\n## sklearn Pipeline Integration\n\nUse `TabGANTransformer` to insert synthetic data augmentation into an sklearn `Pipeline`:\n\n```python\nfrom sklearn.pipeline import Pipeline\nfrom sklearn.ensemble import RandomForestClassifier\nfrom tabgan import TabGANTransformer\n\npipe = Pipeline([\n    (\"augment\", TabGANTransformer(gen_x_times=1.5, cat_cols=[\"gender\"])),\n    (\"model\", RandomForestClassifier()),\n])\n\n# fit() generates synthetic data and trains the model on augmented data\npipe.fit(X_train, y_train)\n```\n\nWorks with any generator and supports constraints:\n\n```python\nfrom tabgan import TabGANTransformer, GANGenerator, RangeConstraint\n\ntransformer = TabGANTransformer(\n    generator_class=GANGenerator,\n    gen_x_times=2.0,\n    gen_params={\"batch_size\": 500, \"epochs\": 10, \"patience\": 5},\n    constraints=[RangeConstraint(\"age\", min_val=0, max_val=120)],\n)\n\nX_augmented = transformer.fit_transform(X_train, y_train)\ny_augmented = transformer.get_augmented_target()\n```\n\n## AutoSynth\n\nDon't know which generator works best for your data? **AutoSynth** runs all of them and picks the winner based on quality and privacy scores:\n\n```python\nfrom tabgan import AutoSynth\n\nresult = AutoSynth(df, target_col=\"label\").run()\n\nprint(result.report)\n#   Generator          Status  Score  Quality  Privacy  Rows  Time (s)\n# 0 GAN (CTGAN)        OK      0.847  0.891    0.743    165   12.3\n# 1 Forest Diffusion   OK      0.812  0.834    0.761    165   45.1\n# 2 Random Baseline    OK      0.654  0.621    0.732    165   0.1\n\nbest_synthetic = result.best_data\nprint(f\"Winner: {result.best_name}\")\n```\n\nCustomize scoring weights:\n\n```python\nresult = AutoSynth(\n    df,\n    target_col=\"label\",\n    quality_weight=0.5,   # equal weight\n    privacy_weight=0.5,\n).run()\n```\n\n## HuggingFace Hub Integration\n\nSynthesize any tabular dataset from HuggingFace Hub in one call:\n\n```python\nfrom tabgan import synthesize_hf_dataset\n\n# Load → Generate → Evaluate automatically\nresult = synthesize_hf_dataset(\"scikit-learn/iris\", target_col=\"target\")\nprint(result.synthetic_df.head())\nprint(f\"Quality: {result.quality_summary['overall_score']}\")\n\n# Push synthetic dataset back to Hub\nresult = synthesize_hf_dataset(\n    \"scikit-learn/iris\",\n    target_col=\"target\",\n    push_to_hub=True,\n    hub_repo_id=\"your-username/iris-synthetic\",\n)\n```\n\n## Command-Line Interface\n\n```bash\ntabgan-generate \\\n    --input-csv train.csv \\\n    --target-col target \\\n    --generator gan \\\n    --gen-x-times 1.5 \\\n    --cat-cols year,gender \\\n    --output-csv synthetic_train.csv\n```\n\n## Pipeline Architecture\n\n![Experiment design and workflow](images/workflow.png)\n\n```\nInput (train_df, target, test_df)\n  |\n  v\n[Preprocess] --\u003e Validate DataFrames, prepare columns\n  |\n  v\n[Generate]  --\u003e CTGAN / ForestDiffusion / GReaT LLM / Random sampling\n  |\n  v\n[Post-process] --\u003e Quantile-based filtering against test distribution\n  |\n  v\n[Adversarial Filter] --\u003e LightGBM classifier removes dissimilar samples\n  |\n  v\nOutput (synthetic_df, synthetic_target)\n```\n\n## Benchmark Results\n\nNormalized ROC AUC scores (higher is better):\n\n| Dataset | No augmentation | GAN | Sample Original |\n|---------|:-:|:-:|:-:|\n| credit | 0.997 | **0.998** | 0.997 |\n| employee | **0.986** | 0.966 | 0.972 |\n| mortgages | 0.984 | 0.964 | **0.988** |\n| poverty_A | 0.937 | **0.950** | 0.933 |\n| taxi | 0.966 | 0.938 | **0.987** |\n| adult | 0.995 | 0.967 | **0.998** |\n\n## Citation\n\n```bibtex\n@misc{ashrapov2020tabular,\n    title={Tabular GANs for uneven distribution},\n    author={Insaf Ashrapov},\n    year={2020},\n    eprint={2010.00638},\n    archivePrefix={arXiv},\n    primaryClass={cs.LG}\n}\n```\n\n## References\n\n1. Xu, L., \u0026 Veeramachaneni, K. (2018). *Synthesizing Tabular Data using Generative Adversarial Networks*. arXiv:1811.11264.\n2. Jolicoeur-Martineau, A., Fatras, K., \u0026 Kachman, T. (2023). *Generating and Imputing Tabular Data via Diffusion and Flow-based Gradient-Boosted Trees*. SamsungSAILMontreal/ForestDiffusion.\n3. Xu, L., Skoularidou, M., Cuesta-Infante, A., \u0026 Veeramachaneni, K. (2019). *Modeling Tabular data using Conditional GAN*. NeurIPS.\n4. Borisov, V., Sessler, K., Leemann, T., Pawelczyk, M., \u0026 Kasneci, G. (2023). *Language Models are Realistic Tabular Data Generators*. ICLR.\n\n## License\n\nApache License 2.0 \u0026mdash; see [LICENSE](LICENSE) for details.\n","funding_links":[],"categories":["Data Annotation and Synthesis","Table of Contents","Machine Learning","The Data Science Toolbox","Uncategorized"],"sub_categories":["Miscellaneous Tools","Uncategorized"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FDiyago%2FTabular-data-generation","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FDiyago%2FTabular-data-generation","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FDiyago%2FTabular-data-generation/lists"}