{"id":20837765,"url":"https://github.com/astrazeneca/subtab","last_synced_at":"2025-05-08T20:29:48.610Z","repository":{"id":62096774,"uuid":"414183752","full_name":"AstraZeneca/SubTab","owner":"AstraZeneca","description":"The official implementation of the paper, \"SubTab: Subsetting Features of Tabular Data for Self-Supervised Representation Learning\"","archived":false,"fork":false,"pushed_at":"2022-07-01T09:03:38.000Z","size":44696,"stargazers_count":143,"open_issues_count":1,"forks_count":19,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-05-01T20:37:14.623Z","etag":null,"topics":["contrastive-learning","multi-view-learning","representation-learning","self-supervised-learning","tabular-data"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/AstraZeneca.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2021-10-06T11:26:17.000Z","updated_at":"2025-04-16T09:33:54.000Z","dependencies_parsed_at":"2022-10-26T10:45:37.600Z","dependency_job_id":null,"html_url":"https://github.com/AstraZeneca/SubTab","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AstraZeneca%2FSubTab","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AstraZeneca%2FSubTab/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AstraZeneca%2FSubTab/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AstraZeneca%2FSubTab/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/AstraZeneca","download_url":"https://codeload.github.com/AstraZeneca/SubTab/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":253144311,"owners_count":21861035,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["contrastive-learning","multi-view-learning","representation-learning","self-supervised-learning","tabular-data"],"created_at":"2024-11-18T01:08:31.968Z","updated_at":"2025-05-08T20:29:48.574Z","avatar_url":"https://github.com/AstraZeneca.png","language":"Python","readme":"# SubTab: \n##### Author: Talip Ucar (ucabtuc@gmail.com)\n\nThe official implementation of the paper, \n\n[SubTab: Subsetting Features of Tabular Data for Self-Supervised Representation Learning](https://arxiv.org/abs/2110.04361)\n\n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/subtab-subsetting-features-of-tabular-data/unsupervised-mnist-on-mnist)](https://paperswithcode.com/sota/unsupervised-mnist-on-mnist?p=subtab-subsetting-features-of-tabular-data)\n\n:large_orange_diamond: **Note:** The extended version of SubTab with codes and pre-processed data for Adult Income and BlogFeedback datasets can be found at: https://github.com/talipucar/SubTab_extended\n\n# Table of Contents:\n\n1. [Model](#model)\n2. [Environment](#environment)\n3. [Data](#data)\n4. [Configuration](#configuration)\n5. [Training and Evaluation](#training-and-evaluation)\n6. [Adding New Datasets](#adding-new-datasets)\n7. [Results](#results)\n8. [Experiment tracking](#experiment-tracking)\n9. [Citing the paper](#citing-the-paper)\n10. [Citing this repo](#citing-this-repo)\n\n\nNeurIPS 2021 slides        |  NeurIPS 2021 poster\n:-------------------------:|:-------------------------:\n[![NeurIPS 2021 slides](./assets/presentation_cover.png)](./assets/NeurIPS_2021_slides.pdf)  |  [![NeurIPS 2021 poster](./assets/poster_cover.png)](./assets/NeurIPS_2021_poster.pdf)\n\n\n# Model\n\n![SubTab](./assets/SubTab.gif)\n\n\u003cdetails\u003e\n  \u003csummary\u003eClick for a slower version of the animation\u003c/summary\u003e\n\n![SubTab](./assets/SubTab_slow.gif)\n\n\u003c/details\u003e\n\n\n# Environment\nWe used Python 3.7 for our experiments. The environment can be set up by following three steps:\n\n```\npip install pipenv             # To install pipenv if you don't have it already\npipenv install --skip-lock     # To install required packages. \npipenv shell                   # To activate virtual env\n```\n\nIf the second step results in issues, you can install packages in Pipfile individually by using pip i.e. \"pip install package_name\". \n\n# Data\nMNIST dataset is already provided to demo the framework. For your own dataset, follow the instructions in [Adding New Datasets](#adding-new-datasets).\n\n# Configuration\nThere are two types of configuration files:\n```\n1. runtime.yaml\n2. mnist.yaml\n```\n\n1. ```runtime.yaml``` is a high-level configuration file used by all datasets to:\n\n   - define the random seed\n   - turn on/off mlflow (Default: False)\n   - turn on/off python profiler (Default: False)\n   - set data directory\n   - set results directory\n\n2. Second configuration file is dataset-specific and is used to configure the architecture of the model, loss functions, and so on. \n\n   - For example, we set up a configuration file for MNIST dataset with the same name. \n   Please note that the name of the configuration file should be same as name of the dataset with all letters in lowercase. \n   - We can have configuration files for other datasets such as **tcga.yaml** and **income.yaml** for tcga and income datasets respectively.\n\n\n\n# Training and Evaluation\nYou can train and evaluate the model by using:\n\n```\npython train.py # For training. \npython eval.py  # For evaluation\n```\n\n   - ```train.py``` will also run evaluation at the end of the training. \n   - You can also run evaluation separately by using ```eval.py```.\n   - For a list of arguments, please see ```./utils/arguments.py```\n     - Use ```-h``` argument to get help when running scripts.\n     - Use ```-d dataset_name``` to run scripts on new datasets \n\n# Adding New Datasets\n\nFor each new dataset, you can use the following steps:\n\n1. Provide a ```_load_dataset_name()``` function, similar to [MNIST load function](https://github.com/AstraZeneca/SubTab/blob/070b2ef73fceb0531d2b1d1fc32f7eda4fe5c966/utils/load_data.py#L174-L190)\n\n   - For example, you can add ```_load_tcga()``` for tcga dataset, or ```_load_income()``` for income dataset. \n   - The function should return (x_train, y_train, x_test, y_test)\n\n2. Add a separate ```elif``` condition in [this section](https://github.com/AstraZeneca/SubTab/blob/070b2ef73fceb0531d2b1d1fc32f7eda4fe5c966/utils/load_data.py#L110-L112) within ```_load_data()``` method of ```TabularDataset()``` class in ```utils/load_data.py```\n\n3. Create a new config file with the same name as dataset name.\n   - For example, ```tcga.yaml``` for tcga dataset, or ```income.yaml``` for income dataset.\n   - You can also duplicate one of the existing configuration files (e.g. mnist.yaml), and re-name it.\n\n   - Make sure that the new config file is under ```config/``` directory.\n\n4. Provide data folder with pre-processed training and test set, and place it under ```./data/``` directory. \nYou can also do train-test split and pre-processing within your custom ```_load_dataset_name()``` function.\n\n5. (Optional) If you want to place the new dataset under a different directory than the local \"./data/\", then:\n   - Place the dataset folder anywhere, and define the root directory to it in [this line](https://github.com/AstraZeneca/SubTab/blob/070b2ef73fceb0531d2b1d1fc32f7eda4fe5c966/config/runtime.yaml#L5)\nof ```/config/runtime.yaml```. \n\n   - For example, if the path to tcga dataset is ```/home/.../data/tcga/```, \n   you only need to include ```/home/.../data/``` in ```runtime.yaml```. \n   The code will fill in ```tcga``` folder name from the name given in the command line argument\n   (e.g. ```-d dataset_name```. In this case, dataset_name would be tcga).\n\n# Structure of the repo\n\u003cpre\u003e\n- train.py\n- eval.py\n\n- src\n    |-model.py\n    \n- config\n    |-runtime.yaml\n    |-mnist.yaml\n    \n- utils\n    |-load_data.py\n    |-arguments.py\n    |-model_utils.py\n    |-loss_functions.py\n    ...\n    \n- data\n    |-mnist\n    ...\n    \n- results\n    |\n    ...\n\u003c/pre\u003e\n\n# Results\n\nResults at the end of training is saved under ```./results``` directory. Results directory structure is as following:\n\n\u003cpre\u003e\n- results\n    |-dataset name\n            |-evaluation\n                |-clusters (for plotting t-SNE and PCA plots of embeddings)\n                |-reconstructions (not used)\n            |-training\n                |-model_mode (e.g. ae for autoencoder)   \n                     |-model\n                     |-plots\n                     |-loss\n\u003c/pre\u003e\n\nYou can save results of evaluations under \"evaluation\" folder. \n\n\n# Experiment tracking\nMLFlow is used to track experiments. It is turned off by default, but can be turned on by changing option [on this line](https://github.com/AstraZeneca/SubTab/blob/070b2ef73fceb0531d2b1d1fc32f7eda4fe5c966/config/runtime.yaml#L2) in \nruntime config file in ```./config/runtime.yaml```\n\n\n# Citing the paper\n\n```\n@article{ucar2021subtab,\n  title={SubTab: Subsetting Features of Tabular Data for Self-Supervised Representation Learning},\n  author={Ucar, Talip and Hajiramezanali, Ehsan and Edwards, Lindsay},\n  journal={Advances in Neural Information Processing Systems},\n  volume={34},\n  year={2021}\n}\n```\n\n# Citing this repo\nIf you use SubTab framework in your own studies, and work, please cite it by using the following:\n\n```\n@Misc{talip_ucar_2021_SubTab,\n  author =   {Talip Ucar},\n  title =    {{SubTab: Subsetting Features of Tabular Data for Self-Supervised Representation Learning}},\n  howpublished = {\\url{https://github.com/AstraZeneca/SubTab}},\n  month        = June,\n  year = {since 2021}\n}\n```\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fastrazeneca%2Fsubtab","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fastrazeneca%2Fsubtab","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fastrazeneca%2Fsubtab/lists"}