{"id":25323918,"url":"https://github.com/cosbidev/naim","last_synced_at":"2025-10-29T02:31:01.633Z","repository":{"id":248680704,"uuid":"829315683","full_name":"cosbidev/NAIM","owner":"cosbidev","description":"Official implementation for the paper ``Not Another Imputation Method: A Transformer-based Model for Missing Values in Tabular Datasets´´","archived":false,"fork":false,"pushed_at":"2025-01-18T09:14:57.000Z","size":222,"stargazers_count":5,"open_issues_count":0,"forks_count":1,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-01-18T10:24:15.409Z","etag":null,"topics":["attention-mechanism","missing-data","tabular-data","transformers"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":false,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/cosbidev.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-07-16T07:29:07.000Z","updated_at":"2025-01-18T09:14:59.000Z","dependencies_parsed_at":"2024-07-16T13:25:45.721Z","dependency_job_id":"f0c9a586-2231-4fac-93c9-04546a16566c","html_url":"https://github.com/cosbidev/NAIM","commit_stats":null,"previous_names":["cosbidev/naim"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cosbidev%2FNAIM","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cosbidev%2FNAIM/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cosbidev%2FNAIM/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cosbidev%2FNAIM/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/cosbidev","download_url":"https://codeload.github.com/cosbidev/NAIM/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":238759674,"owners_count":19525873,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["attention-mechanism","missing-data","tabular-data","transformers"],"created_at":"2025-02-14T00:54:55.056Z","updated_at":"2025-10-29T02:31:01.594Z","avatar_url":"https://github.com/cosbidev.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# NAIM \n[![arXiv](https://img.shields.io/badge/arXiv-2407.11540-b31b1b.svg)](https://arxiv.org/abs/2407.11540)\n---\n\n1. [Installation](#installation)\n2. [Usage](#usage)\n   1. [Reproducing the experiments](#repr_exp)\n   2. [Train \u0026 Test on your dataset](#new_exp)\n      1. [Experiment declaration](#exp_decl)\n      2. [Dataset preparation](#data_prep)\n      3. [Experiment configuration](#exp_conf)\n3. [Citation](#citation)\n\n---\n\n\u003cp align=\"center\" width=\"100%\"\u003e\n\u003cimg src=\"./images/NAIM.svg\" width=\"800\" style=\"display: block; margin: 0 auto\"\u003e\n\u003c/p\u003e\n\nThis document describes the implementation of *``Not Another Imputation Method´´* ([NAIM](https://arxiv.org/abs/2407.11540)) in Pytorch. \nNAIM is an architecture specifically designed for the analysis of tabular data, with a focus on addressing missing values in \ntabular data without the need for any imputation strategy.\n\nBy leveraging lookup tables, tailored for each type of feature, NAIM assigns a non-trainable vector to missing values, \nthus obtaining an embedded representation for every missing feature scenario.\n\n\u003cp align=\"center\" width=\"100%\"\u003e\n\u003cimg src=\"./images/embedding.svg\" width=\"500\" style=\"display: block; margin: 0 auto\"\u003e\n\u003c/p\u003e\n\nFollowing this, through our innovative self-attention mechanism, all contributions from missing values in the \nattention matrix are ignored.\n\n\u003cp align=\"center\" width=\"100%\"\u003e\n\u003cimg src=\"./images/mask.svg\" width=\"600\" style=\"display: block; margin: 0 auto\"\u003e\n\u003c/p\u003e\n\nUltimately, this approach to handling missing values paves the way for a novel method of data augmentation, inspired by \nmethods used in classical image data augmentation. \nAt every epoch, samples are randomly masked (where possible) to prevent co-adaptations among features and to enhance the \nmodel's generalization capability.\n\n\u003cp align=\"center\" width=\"100%\"\u003e\n\u003cimg src=\"./images/augmentation.svg\" width=\"400\" style=\"display: block; margin: 0 auto\"\u003e\n\u003c/p\u003e\n\n--- \n\n# Installation \u003cdiv id='installation'/\u003e\nWe used Python 3.9 for the development of the code.\nTo install the required packages, it is sufficient to run the following command:\n```bash\npip install -r requirements.txt\n```\nand install a version of Pytorch compatible with the device available. We used torch==1.13.0.\n\n---\n\n# Usage \u003cdiv id='usage'/\u003e\nThe execution of the code heavily relies on Facebook's [Hydra](https://hydra.cc/) library. \nSpecifically, through a multitude of configuration files that define every aspect of the experiment, it is possible to \nconduct the desired experiment without modifying the code. \nThese configuration files have a hierarchical structure through which they are composed into a single configuration \nfile that serves as input to the program. \nMore specifically, the [`main.py`](./main.py) file will call the [`config.yaml`](./confs/config.yaml) file, from which \nthe configuration files tree begins.\n\n## Reproducing the experiments \u003cdiv id='repr_exp'/\u003e\n\nAll the dataset used in the paper are available on the [UCI Datasets](https://archive.ics.uci.edu/) repository and they are listed below:\n\n\u003cdiv style=\"display: block; margin: 0 auto; text-align: center\"\u003e\n\n| Dataset        | UCI Link   | Preprocessing |\n|----------------|------------|---------------|\n| ADULT          | [Link](https://archive.ics.uci.edu/ml/datasets/adult)| Sets joined   |\n| BankMarketing  | [Link](https://archive.ics.uci.edu/ml/datasets/bank+marketing)| -             |\n| OnlineShoppers | [Link](https://archive.ics.uci.edu/ml/datasets/Online+Shoppers+Purchasing+Intention+Dataset)| -             |\n| SeismicBumps   | [Link](https://archive.ics.uci.edu/ml/datasets/seismic-bumps)| -             |\n| Spambase       | [Link](https://archive.ics.uci.edu/ml/datasets/spambase)| -             |\n\n\u003c/div\u003e\n\nTo simplify the reproduction of the experiments done, it is possible to use the [`datasets_download.py`](./datasets_download.py) script to \ndownload the files. \nTherefore, thanks also to the [multirun](https://hydra.cc/docs/tutorials/basic/running_your_app/multi-run/) functionality of Hydra, to reproduce the NAIM experiments carried out in \nthe paper it is sufficient to execute the following lines of code:\n\n```python \npython datasets_download.py\n\npython main.py -m experiment=classification_with_missing_generation experiment/databases@db=adult,bankmarketing,onlineshoppers,seismicbumps,spambase\n```\n\nThese lines of code, assuming the initial configuration files have not been modified, enable the reproduction of the \nexperiments presented in the paper using the NAIM model. \nThese experiments generate different percentages of missing values in the training and testing sets, considering also \nall possible combinations of missing percentages in the different sets. \nSpecifically, the percentages 0%, 5%, 10%, 25%, 50%, and 75% have been used as indicated by `missing_percentages=[0.0, 0.05, 0.1, 0.25, 0.5, 0.75]` in the [`classification_with_missing_generation.yaml`](./confs/experiment/classification_with_missing_generation.yaml) configuration file.\n\nFor each experiment, this code produces a folder named `\u003cdataset-name\u003e_\u003cmodel-name\u003e_\u003cimputation-strategy\u003e_with_missing_generation` which contains everything generated by the code. \nIn particular, the following folders and files are present:\n\n1. `cross_validation`: this folder contains a folder for each training fold, indicated as a composition of test and validation folds `\u003ctest-fold\u003e_\u003cval-fold\u003e`, reporting the information on the train, validation and test sets in 3 separate csv files.\n2. `preprocessing`: this folder contains all the preprocessing information divided into 3 main folders:\n   1. `numerical_preprocessing`: in this folder, for each percentage of missing values considered, there is a csv file for each fold reporting the information on the preprocessing params of numerical features.\n   2. `categorical_preprocessing`: in this folder, for each percentage of missing values considered, there is a csv file for each fold reporting the information on the preprocessing params of categorical features.\n   3. `imputer`: in this folder, for each percentage of missing values considered, there are csv files for each fold with information on the imputation strategy applied to handle missing values and a pkl file containing the imputer fitted on the training data of the fold.\n3. `saved_models`: this folder contains, for each percentage of missing values considered, a folder with the model's name that includes, for each fold, a csv file with the model's parameters and a pkl or pth file containing the trained model.\n4. `predictions`: this folder contains, for each percentage of missing values considered, a folder that reports the predictions obtained from the training and validation sets and separately those of the test set. More specifically, there are two files for each fold reporting the predictions for the train and validation sets, called \u003ctest-fold\u003e_\u003cval-fold\u003e_train and \u003ctest-fold\u003e_\u003cval-fold\u003e_val respectively. Moreover, there are additional folders, one for each percentage of missing values considered, that report for each fold the predictions made on the test set (\u003ctest-fold\u003e_\u003cval-fold\u003e_test).\n5. `results`: this folder reports, for each percentage of missing values considered, the performance on the train, validation, and test sets separately. Specifically, for the training and validation sets and then, for each percentage of missing values considered, also for the test set, two folders named balanced and unbalanced containing the performance of the various sets are reported. These are presented in 3 separate files with increasing levels of averaging:\n   1. `all_test_performance.csv`: this file presents the set's performance evaluated for each fold and each class.\n   2. `classes_average_performance.csv`: this file, computing the average performance of the folds, contains the performance for each class.\n   3. `set_average_performance.csv`: this file, calculating the average performance of the folds and the classes, contains the average performance of the set.\n6. `config.yaml`: this file contains the configuration file used as input for the experiment.\n7. `\u003cexperiment-name\u003e.log`: this is the log file of the experiment.\n\n## Train \u0026 Test on your dataset \u003cdiv id='new_exp'/\u003e\n\n### Experiment declaration  \u003cdiv id='exp_decl'/\u003e\n\nAs mentioned above, the experiment configuration file is created at the time of code execution starting from the \n[`config.yaml`](./confs/config.yaml) file, in which the configuration file for the experiment to be performed is declared, \nalong with the paths from which to load data (`data_path`) and where to save the outputs (`output_path`).\n\n```yaml\ndata_path: ./datasets # Path where the datasets are stored\noutput_path: ./outputs # Path where the outputs will be saved\n\ndefaults: # DO NOT CHANGE\n  - _self_ # DO NOT CHANGE\n  - experiment: classification # Experiment to perform, classification or classification_with_missing_generation\n```\n\nThe possible options for the `experiment` parameter are [`classification`](./confs/experiment/classification.yaml) and [`classification_with_missing_generation`](./confs/experiment/classification_with_missing_generation.yaml).\n\n### Dataset preparation \u003cdiv id='data_prep'/\u003e\n\nTo prepare a dataset for the analysis with this code, it is sufficient to prepare a configuration file, specific for the \ndataset, similar to those already provided in the folder [`./confs/experiment/databases`](./confs/experiment/databases).\nThe path to the data must be specified in the `path` parameter in the dataset's configuration file.\nThanks to the [interpolation](https://hydra.cc/docs/patterns/specializing_config/) functionality of Hydra the path can be composed using the `${data_path}` interpolation key, which refers to the `data_path` parameter of the [`config.yaml`](./confs/config.yaml) file.\nOnce the dataset configuration file is prepared, it is important that it is placed in the same folder [`./confs/experiment/databases`](./confs/experiment/databases) and that in the file [`classification.yaml`](./confs/experiment/classification.yaml) \nthe name of the created configuration file is reported at the `databases@db` key. \nIn particular, it is important that the dataset configuration file is structured as follows:\n\n```yaml \n_target_: CMC_utils.datasets.ClassificationDataset # DO NOT CHANGE\n_convert_: all # DO NOT CHANGE\n\nname: \u003cdataset-name\u003e # Name of the dataset\ndb_type: tabular # DO NOT CHANGE\n\nclasses: [\"\u003cclass-1-name\u003e\", ..., \"\u003cclass-n-name\u003e\"] # List of the classes\nlabel_type: multiclass # multiclass or binary (SPIEGARE BINARY)\n\ntask: classification # DO NOT CHANGE\n\npath: ${data_path}/\u003crelative-path-to-file\u003e # Relative path to the file, KEEP ${data_path}/... to compose the path condisering the data_path parameter defined in the 'config.yaml' file.\n\ncolumns: # Dictionary containing features names as keys and their types as values # DO NOT REMOVE\n  \u003cID-name\u003e:        id             # Name of the ID column if present, DO NOT CHANGE THE VALUE, NAME CORRECTLY THE TARGET VARIABLE\n  \u003cfeature-1-name\u003e: \u003cfeature-type\u003e # int, float or category \n  \u003cfeature-2-name\u003e: \u003cfeature-type\u003e # int, float or category\n  \u003cfeature-3-name\u003e: \u003cfeature-type\u003e # int, float or category\n  # Other features to be inserted\n  \u003clabel-name\u003e:     target         # Name of the target column containing the classes, DO NOT CHANGE THE VALUE, NAME CORRECTLY THE TARGET VARIABLE\n    \n# here any pd.read_csv or pd.read_excel input parameter to correctly load the data can be added (e.g., na_values, header, index_col)\npandas_load_kwargs:\n  na_values: [ \"?\" ]\n  header: 0\n  index_col: 0\n\ndataset_class: # DO NOT CHANGE\n  _target_: CMC_utils.datasets.SupervisedTabularDatasetTorch # DO NOT CHANGE\n  _convert_: all # DO NOT CHANGE\n```\n\nIn the `columns` definition, `id` and `target` feature types can be used to define the ID and classes columns respectively.\n\n### Experiment configuration \u003cdiv id='exp_conf'/\u003e\n\nHere we present the [`classification.yaml`](./confs/experiment/classification.yaml) configuration file, which defines the specifics for conducting a \nclassification pipeline using all the available data.\nIt is also available the [`classification_with_missing_generation.yaml`](./confs/experiment/classification_with_missing_generation.yaml) configuration file, which defines the specifics for the experiment of the paper,\nwhere predefined percentages of missing values are randomly generated in the train and test set.\nTo execute the experiment of the paper, you need to set the `experiment` parameter in the [`config.yaml`](./confs/config.yaml) file to `classification_with_missing_generation`.\nYou can also customize the missing percentages values to be tested by modifying the `missing_percentages` parameter in the [`classification_with_missing_generation.yaml`](./confs/experiment/classification_with_missing_generation.yaml) file.\n\nThe [`classification.yaml`](./confs/experiment/classification.yaml) file is the primary file where all the parameters of the experiment are defined. \nIt begins with some general information, such as the name of the experiment, the pipeline to be executed, the seed for \nrandomness control, training verbosity, and the percentages of missing values to be tested. \n\n```yaml\nexperiment_name: ${db.name}_${model.name}_${preprocessing.imputer.method}_with_missing_generation # DO NOT CHANGE\npipeline: missing # DO NOT CHANGE\n\nseed: 42 # Seed for randomness control\nverbose: 1 # 0 or 1, verbosity of the training\n\ncontinue_experiment: False # True or False, if the experiment should be continued from where it was interrupted\n```\n\n\u003e NOTE: In case an experiment should be interrupted, voluntarily or not, it is possible to resume it from where it was interrupted by setting the `continue_experiment` parameter to `True`.\n\nThen, all other necessary configuration files for the different parts of the experiment are declared.\nIt is possible to define:\n- the dataset to analyse (`databases@db`); \n- the cross-validation strategies to use for the test (`cross_validation@test_cv`) and \nvalidation (`cross_validation@val_cv`) sets separately;\n- the preprocessing steps to be performed for the numerical (`preprocessing/numerical`), categorical (`preprocessing/categorical`) and missing features (`preprocessing/imputer`) ;\n- the model to be used (`model`).\n- the metric to be used in the early stopping process (`metric@train.set_metrics.\u003cmetric-name\u003e`) and in the performance evaluation (`metric@performance_metrics.\u003cmetric-name\u003e`).\n\n```yaml\ndefaults: # DO NOT CHANGE\n  - _self_ # DO NOT CHANGE\n  - paths@: default # DO NOT CHANGE\n  - paths: experiment_paths # DO NOT CHANGE\n    \n  - databases@db: \u003cdataset-configuration-file-name\u003e # Name of the configuration file of the dataset\n\n  - cross_validation@test_cv: stratifiedkfold # Cross-validation strategy for the test set\n  - cross_validation@val_cv: holdout          # Cross-validation strategy for the validation set\n\n  - preprocessing/numerical: normalize # normalize or standardize\n  - preprocessing/categorical: categorical_encode # categorical_encode or one_hot_encode\n  - preprocessing/imputer: no_imputation # simple or knn or iterative or no_imputation\n  \n  - model_type_params@dl_params: dl_params # DO NOT CHANGE\n  - model_type_params@ml_params: ml_params # DO NOT CHANGE\n    \n  - model: naim # Name of the model to use\n\n  - model_type_params@train.dl_params: dl_params # DO NOT CHANGE\n    \n  - initializer@train.initializer: xavier_uniform # DO NOT CHANGE\n  - loss@train.loss.CE: cross_entropy             # DO NOT CHANGE\n  - regularizer@train.regularizer.l1: l1          # DO NOT CHANGE\n  - regularizer@train.regularizer.l2: l2          # DO NOT CHANGE\n  - optimizer@train.optimizer: adam               # DO NOT CHANGE\n  - train_utils@train.manager: train_manager      # DO NOT CHANGE\n    \n  - metric@train.set_metrics.auc: auc # Metric to use for the early stopping\n    \n  - metric@performance_metrics.auc: auc # Metric to use for the performance evaluation\n  - metric@performance_metrics.accuracy: accuracy # Metric to use for the performance evaluation\n  - metric@performance_metrics.recall: recall # Metric to use for the performance evaluation\n  - metric@performance_metrics.precision: precision # Metric to use for the performance evaluation\n  - metric@performance_metrics.f1_score: f1_score # Metric to use for the performance evaluation\n```\n\nThe possible options for these parts are the files contained in the folders listed in the table below.  \n\n| Params                   | Keys                                                                                | Options                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |\n|--------------------------|-------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n| Dataset                  | `databases@db`                                                                      | [adult](./confs/experiment/databases/adult.yaml), [bankmarketing](./confs/experiment/databases/bankmarketing.yaml), [onlineshoppers](./confs/experiment/databases/onlineshoppers.yaml), [seismicbumps](./confs/experiment/databases/seismicbumps.yaml), [spambase](./confs/experiment/databases/spambase.yaml)                                                                                                                                                                                                                                                                                         |\n| Cross Validation         | `cross_validation@test_cv` `cross_validation@val_cv`                                | [bootstrap](./confs/experiment/cross_validation/bootstrap.yaml), [holdout](./confs/experiment/cross_validation/holdout.yaml), [kfold](./confs/experiment/cross_validation/kfold.yaml), [leave_one_out](./confs/experiment/cross_validation/leave_one_out.yaml), [predefined](./confs/experiment/cross_validation/predefined.yaml), [stratifiedkfold](./confs/experiment/cross_validation/stratifiedkfold.yaml)                                                                                                                                                                                         |\n| Numerical Preprocessing  | `preprocessing/numerical`                                                           | [normalize](./confs/experiment/preprocessing/numerical/normalize.yaml), [standardize](./confs/experiment/preprocessing/numerical/standardize.yaml)                                                                                                                                                                                                                                                                                                                                                                                                                                                     |\n| Categorical Preprocessing | `preprocessing/categorical`                                                         | [categorical_encode](./confs/experiment/preprocessing/categorical/categorical_encode.yaml), [one_hot_encode](./confs/experiment/preprocessing/categorical/one_hot_encode.yaml)                                                                                                                                                                                                                                                                                                                                                                                                                         |\n| Imputation Strategy      | `preprocessing/imputer`                                                             | [simple](./confs/experiment/preprocessing/imputer/simple.yaml), [knn](./confs/experiment/preprocessing/imputer/knn.yaml), [iterative](./confs/experiment/preprocessing/imputer/iterative.yaml), [no_imputation](./confs/experiment/preprocessing/imputer/no_imputation.yaml)                                                                                                                                                                                                                                                                                                                           |\n| Model                    | `model`                                                                             | [naim](./confs/experiment/model/naim.yaml), [adaboost](./confs/experiment/model/adaboost.yaml), [dt](./confs/experiment/model/dt.yaml), [fttransformer](./confs/experiment/model/fttransformer.yaml), [histgradientboostingtree](./confs/experiment/model/histgradientboostingtree.yaml), [mlp_sklearn](./confs/experiment/model/mlp_sklearn.yaml), [rf](./confs/experiment/model/rf.yaml), [svm](./confs/experiment/model/svm.yaml), [tabnet](./confs/experiment/model/tabnet.yaml), [tabtransformer](./confs/experiment/model/tabtransformer.yaml), [xgboost](./confs/experiment/model/xgboost.yaml) |\n| Metrics                  | `metric@train.set_metrics.\u003cmetric-name\u003e` `metric@performance_metrics.\u003cmetric-name\u003e` | [auc](./confs/experiment/metric/auc.yaml), [accuracy](./confs/experiment/metric/accuracy.yaml), [recall](./confs/experiment/metric/recall.yaml), [precision](./confs/experiment/metric/precision.yaml), [f1_score](./confs/experiment/metric/f1_score.yaml)                                                                                                                                                                                                                                                                                                                                            |\n\nTo modify some of the hyperparameters of the models, it is possible to modify the [`ml_params`](./confs/experiment/model_type_params/ml_params.yaml) and [`dl_params`](./confs/experiment/model_type_params/dl_params.yaml) files.\nFor the ML models it is possible do define the number of estimators (`n_estimators`), whereas for the DL models it is possible to define the number of epochs (`max_epochs`), the warm-up number of epochs (`min_epochs`),\nthe batch_size (`batch_size`), the early stopping's (`early_stopping_patience`) and the scheduler's (`scheduler_patience`) patience and their tolerance for performance improvement (`performance_tolerance`), the device to use for training (`device`).\nIt is also possible to define the learning rates to be tested (`learning_rates`), but to be compatible with some of the competitor available in the models list, it is necessary to define also the initial learning rate (`init_learning_rate`), and the final learning rate (`end_learning_rate`). \n\n#### [./confs/experiment/model_type_params/ml_params.yaml](./confs/experiment/model_type_params/ml_params.yaml)\n```yaml\nn_estimators: 100 # Number of estimators for the ML models\n```\n\n#### [./confs/experiment/model_type_params/dl_params.yaml](./confs/experiment/model_type_params/dl_params.yaml)\n```yaml\nmax_epochs: 1500 # Maximum number of epochs\n\nmin_epochs: 50 # Warm-up number of epochs \n\nbatch_size: 32 # Batch size\n\ninit_learning_rate: 1e-3 # Initial learning rate\n\nend_learning_rate: 1e-8 # Final learning rate\n\nlearning_rates: [1e-3, 1e-4, 1e-5, 1e-6, 1e-7] # Learning rates for the scheduler\n\nearly_stopping_patience: 50 # Patience for the early stopping\n\nscheduler_patience: 25 # Patience for the scheduler \n\nperformance_tolerance: 1e-3 # Tolerance for the performance improvement\n\nverbose: ${verbose} # DO NOT CHANGE\n\nverbose_batch: 0 # 0 or 1 or ${verbose}, verbosity of the training for the batch\n\ndevice: cuda # cpu or cuda, device to use for training\n```\n\n\nFor a single experiment, this code produces a folder named `\u003cdataset-name\u003e_\u003cmodel-name\u003e_\u003cimputation-strategy\u003e` which contains everything generated by the code. \nIn particular, the following folders and files are present:\n\n1. `cross_validation`: this folder contains a folder for each training fold, indicated as a composition of test and validation folds `\u003ctest-fold\u003e_\u003cval-fold\u003e`, reporting the information on the train, validation and test sets in 3 separate csv files.\n2. `preprocessing`: this folder contains all the preprocessing information divided into 3 main folders:\n   1. `numerical_preprocessing`: in this folder there is a csv file for each fold reporting the information about the preprocessing params of numerical features.\n   2. `categorical_preprocessing`: in this folder there is a csv file for each fold reporting the information about the preprocessing params of categorical features.\n   3. `imputer`: in this folder there are a csv file for each fold reporting the information on the imputation strategy applied to handle missing values and a pkl file containing the imputer fitted on the training data.\n3. `saved_models`: this folder contains a folder, named as the model, that for each fold includes a csv file with the model's parameters and a pkl or pth file containing the trained model.\n4. `predictions`: this folder contains a folder that reports the predictions obtained from the train and validation and test sets separately. \nMore specifically, for each fold there are 3 files reporting the predictions for the train, validation and test sets, called `\u003ctest-fold\u003e_\u003cval-fold\u003e_train`, `\u003ctest-fold\u003e_\u003cval-fold\u003e_val` and `\u003ctest-fold\u003e_\u003cval-fold\u003e_test` respectively.\n5. `results`: this folder reports the performance on the train, validation, and test sets separately. Specifically, for each set two folders, named `balanced` and `unbalanced`, containing the performance of the various sets are reported. The performance are presented in 3 separate files with increasing levels of averaging:\n   1. `all_test_performance.csv`: this file presents the set's performance evaluated for each fold and each class.\n   2. `classes_average_performance.csv`: this file, computing the average performance of the folds, contains the performance for each class.\n   3. `set_average_performance.csv`: this file, calculating the average performance of the folds and the classes, contains the average performance of the respective set.\n6. `config.yaml`: this file contains the configuration file used as input for the experiment.\n7. `\u003cexperiment-name\u003e.log`: this is the log file of the experiment. \n\n---\n# Contact \u003cdiv id='contact'/\u003e\n\nFor any questions, please contact [camillomaria.caruso@unicampus.it](mailto:camillomaria.caruso@unicampus.it) and [valerio.guarrasi@unicampus.it](mailto:valerio.guarrasi@unicampus.it).\n\n---\n\n# Citation \u003cdiv id='citation'/\u003e\n\n```bibtex\n@misc{caruso2024imputationmethodtransformerbasedmodel,\n      title={Not Another Imputation Method: A Transformer-based Model for Missing Values in Tabular Datasets}, \n      author={Camillo Maria Caruso and Paolo Soda and Valerio Guarrasi},\n      year={2024},\n      eprint={2407.11540},\n      archivePrefix={arXiv},\n      primaryClass={cs.LG},\n      url={https://arxiv.org/abs/2407.11540}, \n}\n```\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcosbidev%2Fnaim","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcosbidev%2Fnaim","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcosbidev%2Fnaim/lists"}