{"id":50632985,"url":"https://github.com/jns-m/at-gan","last_synced_at":"2026-06-07T00:01:52.846Z","repository":{"id":359862914,"uuid":"1247513234","full_name":"Jns-M/at-gan","owner":"Jns-M","description":"A Tabular GAN framework for generating synthetic tabular data from arbitrary mixed-type tabular datasets.","archived":false,"fork":false,"pushed_at":"2026-05-31T14:13:22.000Z","size":101,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-05-31T16:12:09.240Z","etag":null,"topics":["gan","generative-adversarial-network","generative-adversarial-networks","keras","machine-learning","synthetic-data","synthetic-data-generation","synthetic-tabular-data","tabular-data","tensorflow"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Jns-M.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-05-23T12:19:05.000Z","updated_at":"2026-05-31T14:13:26.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/Jns-M/at-gan","commit_stats":null,"previous_names":["jns-m/at-gan"],"tags_count":19,"template":false,"template_full_name":null,"purl":"pkg:github/Jns-M/at-gan","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Jns-M%2Fat-gan","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Jns-M%2Fat-gan/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Jns-M%2Fat-gan/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Jns-M%2Fat-gan/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Jns-M","download_url":"https://codeload.github.com/Jns-M/at-gan/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Jns-M%2Fat-gan/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34003814,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-06T02:00:07.033Z","response_time":107,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["gan","generative-adversarial-network","generative-adversarial-networks","keras","machine-learning","synthetic-data","synthetic-data-generation","synthetic-tabular-data","tabular-data","tensorflow"],"created_at":"2026-06-07T00:01:08.660Z","updated_at":"2026-06-07T00:01:52.810Z","avatar_url":"https://github.com/Jns-M.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cdiv align=\"center\"\u003e\n\n# AT-GAN\n\n### Arbitrary Tabular Generative Adversarial Network\n\n*A Tabular GAN framework for generating synthetic tabular data from arbitrary mixed-type tabular datasets.*\n\n[![Python](https://img.shields.io/badge/python-3.10--3.12-blue.svg)](https://www.python.org/)\n[![TensorFlow](https://img.shields.io/badge/TensorFlow-2.x-purple.svg)](https://www.tensorflow.org/)\n[![Keras](https://img.shields.io/badge/Keras-3.x-red.svg)](https://keras.io/)\n[![W\u0026B](https://img.shields.io/badge/tracking-Weights%20%26%20Biases-yellow.svg)](https://wandb.ai/)\n[![License](https://img.shields.io/badge/license-MIT-green.svg)](https://github.com/Jns-M/at-gan/blob/main/LICENSE)\n[![PyPI](https://img.shields.io/pypi/v/at-gan)](https://pypi.org/project/at-gan/)\n\n\u003c/div\u003e\n\n---\n\n## Table of Contents\n\n1. [Overview](#overview)\n1. [Key Features](#key-features)\n1. [Installation](#installation)\n1. [CLI Usage](#cli-usage)\n1. [API Usage](#api-usage)\n1. [Configuration Reference](#configuration-reference)\n1. [In-Training Evaluation Suite](#in-training-evaluation-suite-1)\n1. [Synthetic Data Evaluation (Post-Training)](#synthetic-data-evaluation-post-training-1)\n\n---\n\n## Overview\n\n**AT-GAN** is a framework for training Generative Adversarial Networks on **arbitrary tabular data**. It is designed to work with *continuous*, *binary*, *discrete count*, and *categorical* features within a single pipeline.\n\nThe framework combines a **multi-branch generator** (G), a **PacGAN-style discriminator** (D), an integrated **evaluation \nsuite**, and **[Weights \u0026 Biases](https://wandb.ai/)** (W\u0026B) sweep orchestration, experiment tracking, and training monitoring + visualization.\n\n\n\u003e **Goal:** Training a GAN that is capable of producing realistic synthetic tabular data from a given dataset with minimal manual tuning and a transparent, observable training process.\n\n---\n\n## Key Features\n\n### Dynamic, Config-Driven Architectures\n- Generator and Discriminator built **entirely from YAML-config**.\n- Configurable amount of `layers` and `units`.\n- Configurable activations: `relu`, `leaky_relu`, `elu`, or any other activation supported in Keras.\n- Configurable `dropout` layers.\n- Optional `Batch Normalization` for G.\n\n### Mixed-Type Data Handling\n- The `TabularPreprocessor` handles **types of input features**:\n  - **Continuous** → `MinMaxScaler(-1, 1)` → `tanh` output branch.\n  - **Discrete Count** → `MinMaxScaler(0, 1)` → `sigmoid` output branch.\n  - **Binary** → 0/1 and optional β-distributed noise application → `sigmoid` output branch.\n  - **Categorical** → One-hot encoding and optional label-preserving smoothing → `softmax` output branch.\n- Per-column decimal precision preservation.\n- Scalers and encoders are stored and reused for inference.\n\n### GAN Training and Stabilization Techniques\n| Technique                           | Controlled by                       | What it does                                                                                    |\n|-------------------------------------|-------------------------------------|-------------------------------------------------------------------------------------------------|\n| **PacGAN packing**                  | `discriminator.pack_size`           | Concatenates *k* rows into a single D input → fights mode collapse                              |\n| **One-sided label smoothing**       | `discriminator.label_smoothing_min` | Real labels sampled from `[min, 1.0]` instead of hard `1.0`                                     |\n| **Label flipping**                  | `discriminator.label_flipping`      | Random fraction of real labels flipped to `0` to prevent D overconfidence                       |\n| **TTUR**                            | `g_lr` / `d_lr`                     | Different LRs for G and D. Sweeps auto-clamp `d_lr ≤ g_lr`                                      |\n| **G:D update ratio**                | `g_updates_per_epoch`               | Multiple G steps per D step to balance the training process                                     |\n| **LR Cosine decay + warm restarts** | `lr_cosine_decay`                   | `CosineDecayRestarts` schedule with configurable `alpha` floor for the learning rate of G and D |\n| **Adam `beta_1` override**          | `adam_beta_1`                       | Typically lowered from default `0.9` for training stability                                     |\n| **Gradient clipping**               | *always-on*                         | `clipnorm=1.0` on both Adam optimizers                                                          |\n\n### In-Training Evaluation Suite\nRuns every `eval_frequency` epochs on held-out real samples, logs results to W\u0026B, and saves the **best** checkpoint by error score. See [In-Training Evaluation Suite](#in-training-evaluation-suite-1).\n\n### Experiment Tracking\n[Weights \u0026 Biases](https://wandb.ai/) integration:\n- Per-epoch loss/metric logging via a dedicated `WandbCallback`.\n- Training visuals: **correlation heatmaps** + **PCA overlap scatter plots**.\n- Local-only mode when `--no-wandb` is set (uses `run_id=\"offline_run\"`).\n\n### Sweeps \u0026 Neural Architecture Search\n- W\u0026B sweeps for **Neural Architecture Search** (NAS) and **Hyperparameter Optimization**.\n- Mechanic to resume existing W\u0026B sweeps (and single runs).\n\n### Synthetic Data Evaluation (Post-Training)\n- **Privacy**: Distance to Closest Record (DCR)\n- **Statistic Fidelity**: [Synthetic Data Vault](https://github.com/sdv-dev/sdv) (SDV)\n- **Utility Retention**: Train on Synthetic, Test on Real (TSTR)\n\n### Usage Modes\n- 🖥️ **CLI**: `train`, `sweep`, `generate`, `evaluate`.\n- 🐍 **Python API** (`at_gan.api`): `train`, `sweep`, `generate`, `evaluate`.\n\n---\n\n## Installation\n\n**Requirements:** Python `3.10 – 3.12` and dependencies listed in `pyproject.toml`.\n\n### Option A: Standard-Installation from [PyPI](https://pypi.org/project/at-gan/) (recommended)\n\n```shell script\npip install at-gan\n```\n\n### Option B: Core-Only Installation from [PyPI](https://pypi.org/project/at-gan-core) (recommended for Docker \u0026 GPU-Support)\n\n```shell script\npip install at-gan-core\n```\nNote: This installation does not include *TensorFlow* in its dependencies, making it ideal for training with GPU-Support enabled, e.g. in Docker containers with preconfigured CUDA/cuDNN environments.\n\n### Option C: Editable install from the [GitHub Repository](https://github.com/Jns-M/at-gan)\n\n1. Clone the [GitHub repository](https://github.com/Jns-M/at-gan)\n1. Run the following command:\n\n```shell script\npip install -e .\n```\n\n\n### Verify installation\n\n```shell script\nat-gan --help\npython -c \"import at_gan; print(at_gan.__version__)\"\n```\n\n\n### Weights \u0026 Biases Login (one-time)\n\n```shell script\nwandb login\n```\n\n\n\u003e 💡 You can use this framework without W\u0026B by passing `--no-wandb` (CLI) or `enable_wandb=False` (API).\n\n---\n\n## CLI Usage\n\n```shell script\nat-gan --help\n```\n\n\n### `train`: Run or resume a single GAN training run\n\n| Flag                     | Short      | Default    | Description                        |\n|--------------------------|------------|------------|------------------------------------|\n| `--config`               | `-c`       | *required* | Path to the YAML experiment config |\n| `--wandb / --no-wandb`   | `-w / -nw` | `--wandb`  | Toggle W\u0026B tracking                |\n| `--export / --no-export` | `-e / -ne` | `--export` | Save `.keras` generator file       |\n| `--generate-samples`     | `-g`       | `1000`     | Auto-generate *N* samples post-training |\n\n**Examples:**\n\n```shell script\nat-gan train -c configs/config.yaml -w -e -g 5000\n```\n\nNote: A run can be resumed via the `resume_run_id` config key. See [Configuration Reference](#configuration-reference).\n\n---\n\n### `sweep`: Run or resume a W\u0026B sweep\n\n\n| Flag | Short | Description                                      |\n|---|---|--------------------------------------------------|\n| `--base-config` | `-c` | Baseline experiment config                       |\n| `--sweep-config` | `-s` | W\u0026B sweep config (required for new sweeps)       |\n| `--count` | `-n` | Max runs this agent will execute                 |\n| `--sweep-id` | `-id` | Resume an existing sweep instead of creating one |\n\n```shell script\n# Launch a new 50-run sweep\nat-gan sweep -c configs/config.yaml -s configs/sweep_config.yaml -n 50\n\n# Resume an existing sweep\nat-gan sweep -c configs/config.yaml -id abc123 -n 20\n```\n\n---\n\n### `generate`: Generate synthetic samples from a trained generator\n\n| Flag | Short | Description                                    |\n|---|---|------------------------------------------------|\n| `--config` | `-c` | YAML used during the **original** training run |\n| `--run-id` | `-r` | W\u0026B run ID or `\"offline_run\"`                  |\n| `--samples` | `-n` | Number of samples to generate                  |\n| `--output` | `-o` | Optional override for CSV output path          |\n\n```shell script\nat-gan generate -c configs/config.yaml -r a1b2c3 -n 10000 -o synthetic_data.csv\n```\n\nNote: `generate` always loads **`best_generator.keras`**, not the latest.\n\n---\n\n### `evaluate`: Run synthetic data evaluation (post-training)\n\n| Flag          | Short | Description                                           |\n|---------------|-------|-------------------------------------------------------|\n| `--real`      | `-r`  | Path to the real data CSV                             |\n| `--synthetic` | `-s`  | Path to the synthetic data CSV                        |\n| `--target`    | `-t`  | Discrete target column for (optional) TSTR evaluation |\n\n```shell script\nat-gan evaluate -c real_data.csv -r synthetic_data.csv -t target_column\n```\n\nNote: TSTR evaluation is only performed if a discrete feature (i.e. binary or categorical) is specified as the target.\n\n---\n\n## API Usage\n\nThe Python API exposes the same primary functions as a CLI, making it easy to integrate into existing projects.\n\nSee `examples/api_example.py` and `examples/api_example.ipynb` in the [GitHub Repository](https://github.com/Jns-M/at-gan) for a full API usage example.\n\n\u003e Note: The `train` entry point also accepts a `dict` instead of a path to a YAML file as input.\n\n---\n\n## Configuration Reference\n\nExperiments are driven by **two YAML files**: a base config and a sweep config.\n\nSee `configs/config.yaml` and `configs/sweep_config.yaml` in the [GitHub Repository](https://github.com/Jns-M/at-gan) for examples and recommended default values for most datasets.\n\n### Base Config Reference\n\n```yaml\n# =============================================================\n#  EXPERIMENT META\n# =============================================================\nexperiment_name: \"test_experiment\"   # also output directory name\nresume_run_id:   null                # W\u0026B run id to resume from checkpoint (optional)\nseed:            1130                # seeds Python, NumPy, TensorFlow\n\n# =============================================================\n#  DATA\n# =============================================================\ndata:\n  dataset_path: \"datasets/example.csv\"\n  output_path:  \"experiments/\"          # run artifacts found in 'output_path/experiment_name/run_id/'\n\n  # Column routing — every column the GAN should learn MUST be listed here\n  continuous_cols:     [\"age\", \"heart_rate\", \"glucose\"]\n  binary_cols:         [\"male\", \"smoker\"]\n  discrete_count_cols: [\"cigs_per_day\"]\n  categorical_cols:    [\"education\"]\n\n  # Preprocessing toggles\n  treat_bin_as_cat:    false    # route binary cols through OHE + softmax\n  beta_noise:          true     # Apply Beta-distributed noise on binary cols\n  smooth_categorical:  true     # Apply label-preserving noise on OHE groups\n\n# =============================================================\n#  MODEL\n# =============================================================\nmodel:\n  latent_dim: 32             \n\n  generator:\n    units:        [64, 64]     \n    dropout:      0.0\n    activation:   \"relu\"        # relu | leaky_relu | elu | ...\n    batch_norm:   true          # BatchNorm after each Dense layer\n    # negative_slope: 0.2       # used only when activation == \"leaky_relu\"\n\n  discriminator:\n    units:               [256, 256]\n    dropout:             0.2\n    activation:          \"leaky_relu\"\n    negative_slope:      0.2\n    pack_size:           3      # PacGAN packing factor (1 disables packing)\n    label_smoothing_min: 0.9    # e.g. real labels ~ [0.9, 1.0]\n    label_flipping:      0.05   # e.g. 5% of real labels flipped to 0 each step\n\n# =============================================================\n#  TRAINING\n# =============================================================\ntraining:\n  device:               \"cpu\"   # \"cpu\" or \"gpu\"\n  epochs:               2000\n  batch_size:           512\n  g_updates_per_epoch:  2       # G steps per D step\n\n  # Optimizers\n  adam_beta_1:          0.5     # GAN-stable Adam beta_1\n  g_lr:                 0.0002  # G Learning Rate\n  d_lr:                 0.0003  # D Learning Rate\n\n  # LR schedule\n  lr_cosine_decay:                 true\n  lr_cosine_decay_restart_epochs:  2000   # restart every N epochs\n  g_lr_decay_alpha:                0.1    # minimum G LR fraction (floor)\n  d_lr_decay_alpha:                0.1    # minimum D LR fraction (floor)\n\n  # Evaluation \u0026 checkpointing\n  checkpoint_frequency: 100     # save \"latest\" every N epochs\n  eval_frequency:       100     # run evaluation suite every N epochs\n  test_split_pct:       0.2     # percentage of data to hold out for in-training evaluation\n```\n\n\n### Sweep Config Reference\n\n```yaml\n# =============================================================\n#  SWEEP STRATEGY \u0026 METRICS\n# =============================================================\nmethod: bayes              \n\nmetric:\n  name: Eval/Total_Error     # W\u0026B log key\n  goal: minimize\n\nearly_terminate:\n  type: hyperband            # Kills unpromising runs early to save compute time\n  min_iter: 300              # Don't kill any run before e.g. epoch 300\n  eta: 3                     # The halving rate for the Hyperband brackets\n\n# =============================================================\n# PARAMETERS\n# =============================================================\nparameters:\n\n  # Sweeps choose from a fixed set of hyperparameter values\n  model.latent_dim:\n    values: [ 16, 32, 64, 128, 256 ]\n\n  # -----------------------------------------------------------\n  #  Generator Architecture\n  # -----------------------------------------------------------\n  generator.num_hidden_layers:\n    values: [ 2, 3, 4 ]\n  generator.base_units:\n    values: [ 32, 64, 128, 256, 512 ]\n  generator.max_units:\n    value: 512                         \n  generator.architecture_shape:\n    values: [ \"block\", \"ascending\", \"descending\" ] \n  \n  generator.dropout:\n    value: 0.0                                   # e.g. Fixed to 0.0\n  generator.activation:\n    values: [ 'relu', 'leaky_relu' ]\n  generator.batch_norm:\n    values: [ true, false ]\n\n  # -----------------------------------------------------------\n  #  Discriminator Architecture\n  # -----------------------------------------------------------\n  discriminator.num_hidden_layers:\n    values: [ 2, 3, 4 ]\n  discriminator.base_units:\n    values: [ 32, 64, 128, 256, 512 ]\n  discriminator.max_units:\n    value: 512\n  discriminator.architecture_shape:\n    values: [ \"block\", \"ascending\", \"descending\" ]\n  \n  discriminator.dropout:\n    values: [ 0.0, 0.2, 0.3, 0.5 ]              \n  discriminator.activation:\n    values: [ 'relu', 'leaky_relu' ]\n  discriminator.negative_slope:\n    values: [ 0.1, 0.2, 0.3 ]                    \n  discriminator.pack_size:\n    values: [ 1, 3 ]                             \n  discriminator.label_smoothing_min:\n    values: [ 0.85, 0.9, 0.95, 1.0 ]             \n  discriminator.label_flipping:\n    values: [ 0.0, 0.05, 0.1 ]                  \n\n  # -----------------------------------------------------------\n  #  Training Loop \u0026 Optimizers\n  # -----------------------------------------------------------\n  training.batch_size:\n    values: [ 64, 128, 256, 512 ]\n  training.g_updates_per_epoch:\n    values: [ 1, 2, 3 ]                          \n  training.adam_beta_1:\n    values: [ 0.2, 0.5, 0.7, 0.9 ] \n  \n  # Learning Rates\n  training.g_lr:\n    distribution: log_uniform_values\n    min: 0.00001                                 \n    max: 0.001                                  \n  training.d_lr:\n    distribution: log_uniform_values\n    min: 0.000005                               \n    max: 0.0005                      # at-gan ensures d_lr \u003c= g_lr\n\n  # Cosine Decay Warm Restart Parameters\n  training.lr_cosine_decay_restart_epochs:\n    distribution: int_uniform\n    min: 100\n    max: 1000\n  training.g_lr_decay_alpha:\n    distribution: log_uniform_values\n    min: 0.01                                    # Decay to 1% of max LR\n    max: 1                                       # No decay\n  training.d_lr_decay_alpha:\n    distribution: log_uniform_values\n    min: 0.01\n    max: 1\n```\n\n---\n\n## In-Training Evaluation Suite\n\nEvery `eval_frequency` epochs, `GANCallback` generates synthetic samples and runs an evaluation against the held-out real samples to guide the hyperparameter sweep:\n\n| Metric            | Computation                                                                                                               |\n|-------------------|---------------------------------------------------------------------------------------------------------------------------|\n| PCA Error         | First Wasserstein distance between real and synthetic data across the first five PCA components                           |\n| Adversarial Error | Absolute AUC deviation of a Random Forest classifier trained to distinguish real and synthetic data (`\\|AUC - 0.5\\| × 2`) |\n| **Total Error**   | `sqrt((pca_error² + adv_error²) / 2.0)`                                                                                   |\n\nRaw errors are passed through a squashing function (`1 - exp(-x)`) so components ∈ `[0, 1]`.\n\n### Visual artifacts (auto-logged to W\u0026B)\n- **Correlation heatmaps**: real, synthetic, and absolute difference.\n- **PCA scatter overlay**: first two principal components of real vs. synthetic.\n\n## Synthetic Data Evaluation (Post-Training)\n\nThe `evaluate` command runs a comprehensive benchmark suite that assesses the quality of the synthetic data generated by the GAN:\n\n1. **Privacy (DCR)**: Distance to Closest Record. Measures the minimum Euclidean distance (in standard deviations) between synthetic rows and real training rows. Absence of exact memorization is guaranteed if ``Min. DCR \u003e 0``.\n2. **Statistic Fidelity (SDV)**: Uses the [Synthetic Data Vault](https://github.com/sdv-dev/SDV) (`sdmetrics` package) to generate a Quality Report, comparing 1D marginal distributions (Column Shapes) and 2D correlations (Column Pair Trends).\n3. **Utility Retention (TSTR)**: Train on Synthetic, Test on Real.\n    * Splits real data into `real_train` (80%) and `real_test` (20%).\n    * Trains a **TRTR** baseline (`RandomForest`, `GradientBoosting`, `LogisticRegression`) on `real_train` → baseline F1 on `real_test`.\n    * Trains **TSTR** models on the entire synthetic set → F1 on the **same** `real_test`.\n    * Reports `TSTR_Mean_F1 / TRTR_Mean_F1 × 100` (F1-Score Retention in %).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjns-m%2Fat-gan","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjns-m%2Fat-gan","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjns-m%2Fat-gan/lists"}