{"id":48684749,"url":"https://github.com/yandex-research/tab-ddpm","last_synced_at":"2026-04-11T03:57:27.563Z","repository":{"id":60899516,"uuid":"544618040","full_name":"yandex-research/tab-ddpm","owner":"yandex-research","description":"[ICML 2023] The official implementation of the paper \"TabDDPM: Modelling Tabular Data with Diffusion Models\"","archived":false,"fork":false,"pushed_at":"2024-07-13T04:02:14.000Z","size":187,"stargazers_count":543,"open_issues_count":23,"forks_count":132,"subscribers_count":5,"default_branch":"main","last_synced_at":"2026-04-11T03:57:24.586Z","etag":null,"topics":["ai","deep-learning","diffusion-models","pytorch","synthetic-data","tabular"],"latest_commit_sha":null,"homepage":"https://arxiv.org/abs/2209.15421","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/yandex-research.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-10-02T23:01:07.000Z","updated_at":"2026-04-10T03:49:14.000Z","dependencies_parsed_at":"2024-09-04T22:37:38.172Z","dependency_job_id":"61b41c86-674b-4bf5-b61e-d15a07517b2f","html_url":"https://github.com/yandex-research/tab-ddpm","commit_stats":null,"previous_names":["rotot0/tab-ddpm"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/yandex-research/tab-ddpm","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yandex-research%2Ftab-ddpm","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yandex-research%2Ftab-ddpm/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yandex-research%2Ftab-ddpm/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yandex-research%2Ftab-ddpm/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/yandex-research","download_url":"https://codeload.github.com/yandex-research/tab-ddpm/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yandex-research%2Ftab-ddpm/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31668050,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-10T17:19:37.612Z","status":"online","status_checked_at":"2026-04-11T02:00:05.776Z","response_time":54,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","deep-learning","diffusion-models","pytorch","synthetic-data","tabular"],"created_at":"2026-04-11T03:57:26.958Z","updated_at":"2026-04-11T03:57:27.555Z","avatar_url":"https://github.com/yandex-research.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# TabDDPM: Modelling Tabular Data with Diffusion Models\nThis is the official code for our paper \"TabDDPM: Modelling Tabular Data with Diffusion Models\" ([paper](https://arxiv.org/abs/2209.15421))\n\n\u003c!-- ## Results\nYou can view all the results and build your own tables with this [notebook](notebooks/Reports.ipynb). --\u003e\n\n## Setup the environment\n1. Install [conda](https://docs.conda.io/en/latest/miniconda.html) (just to manage the env).\n2. Run the following commands\n    ```bash\n    export REPO_DIR=/path/to/the/code\n    cd $REPO_DIR\n\n    conda create -n tddpm python=3.9.7\n    conda activate tddpm\n\n    pip install torch==1.10.1+cu111 -f https://download.pytorch.org/whl/torch_stable.html\n    pip install -r requirements.txt\n\n    # if the following commands do not succeed, update conda\n    conda env config vars set PYTHONPATH=${PYTHONPATH}:${REPO_DIR}\n    conda env config vars set PROJECT_DIR=${REPO_DIR}\n\n    conda deactivate\n    conda activate tddpm\n    ```\n\n## Running the experiments\n\nHere we describe the neccesary info for reproducing the experimental results.  \nUse `agg_results.ipynb` to print results for all dataset and all methods.\n\n### Datasets\n\nWe upload the datasets used in the paper with our train/val/test splits (link below). We do not impose additional restrictions to the original dataset licenses, the sources of the data are listed in the paper appendix. \n\nYou could load the datasets with the following commands:\n\n``` bash\nconda activate tddpm\ncd $PROJECT_DIR\nwget \"https://www.dropbox.com/s/rpckvcs3vx7j605/data.tar?dl=0\" -O data.tar\ntar -xvf data.tar\n```\n\n### File structure\n`tab-ddpm/` -- implementation of the proposed method  \n`tuned_models/` -- tuned hyperparameters of evaluation model (CatBoost or MLP)\n\nAll main scripts are in `scripts/` folder:\n\n- `scripts/pipeline.py` are used to train, sample and eval TabDDPM using a given config  \n- `scripts/tune_ddpm.py` -- tune hyperparameters of TabDDPM\n- `scripts/eval_[catboost|mlp|simple].py` -- evaluate synthetic data using a tuned evaluation model or simple models\n- `scripts/eval_seeds.py` -- eval using multiple sampling and multuple eval seeds\n- `scripts/eval_seeds_simple.py` --  eval using multiple sampling and multuple eval seeds (for simple models)\n- `scripts/tune_evaluation_model.py` -- tune hyperparameters of eval model (CatBoost or MLP)\n- `scripts/resample_privacy.py` -- privacy calculation  \n\nExperiments folder (`exp/`):\n- All results and synthetic data are stored in `exp/[ds_name]/[exp_name]/` folder\n- `exp/[ds_name]/config.toml` is a base config for tuning TabDDPM\n- `exp/[ds_name]/eval_[catboost|mlp].json` stores results of evaluation (`scripts/eval_seeds.py`)  \n\nTo understand the structure of `config.toml` file, read `CONFIG_DESCRIPTION.md`.\n\nBaselines:\n- `smote/`\n- `CTGAN/` -- TVAE [official repo](https://github.com/sdv-dev/CTGAN)\n- `CTAB-GAN/` --  [official repo](https://github.com/Team-TUD/CTAB-GAN)\n- `CTAB-GAN-Plus/` -- [official repo](https://github.com/Team-TUD/CTAB-GAN-Plus)\n\n### Examples\n\n\u003cins\u003eRun TabDDPM tuning.\u003c/ins\u003e   \n\nTemplate and example (`--eval_seeds` is optional): \n```bash\npython scripts/tune_ddpm.py [ds_name] [train_size] synthetic [catboost|mlp] [exp_name] --eval_seeds\npython scripts/tune_ddpm.py churn2 6500 synthetic catboost ddpm_tune --eval_seeds\n```\n\n\u003cins\u003eRun TabDDPM pipeline.\u003c/ins\u003e   \n\nTemplate and example  (`--train`, `--sample`, `--eval` are optional): \n```bash\npython scripts/pipeline.py --config [path_to_your_config] --train --sample --eval\npython scripts/pipeline.py --config exp/churn2/ddpm_cb_best/config.toml --train --sample\n```\nIt takes approximately 7min to run the script above (NVIDIA GeForce RTX 2080 Ti).  \n\n\u003cins\u003eRun evaluation over seeds\u003c/ins\u003e   \nBefore running evaluation, you have to train the model with the given hyperparameters (the example above).  \n\nTemplate and example: \n```bash\npython scripts/eval_seeds.py --config [path_to_your_config] [n_eval_seeds] [ddpm|smote|ctabgan|ctabgan-plus|tvae] synthetic [catboost|mlp] [n_sample_seeds]\npython scripts/eval_seeds.py --config exp/churn2/ddpm_cb_best/config.toml 10 ddpm synthetic catboost 5\n```","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fyandex-research%2Ftab-ddpm","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fyandex-research%2Ftab-ddpm","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fyandex-research%2Ftab-ddpm/lists"}