{"id":22794562,"url":"https://github.com/colintr/classificationforregression","last_synced_at":"2026-03-02T13:11:41.113Z","repository":{"id":42073695,"uuid":"361680519","full_name":"ColinTr/ClassificationForRegression","owner":"ColinTr","description":"[EGC 2022] Constructing Variables Using Classifiers as an Aid to Regression: An Empirical Assessment","archived":false,"fork":false,"pushed_at":"2024-08-20T09:27:35.000Z","size":12776,"stargazers_count":1,"open_issues_count":0,"forks_count":1,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-02-23T21:25:39.913Z","etag":null,"topics":["classification","deep-learning","machine-learning","random-forest","regression"],"latest_commit_sha":null,"homepage":"https://arxiv.org/abs/2112.03703","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ColinTr.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2021-04-26T08:49:32.000Z","updated_at":"2024-08-20T09:27:40.000Z","dependencies_parsed_at":"2022-08-12T04:10:15.571Z","dependency_job_id":null,"html_url":"https://github.com/ColinTr/ClassificationForRegression","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/ColinTr/ClassificationForRegression","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ColinTr%2FClassificationForRegression","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ColinTr%2FClassificationForRegression/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ColinTr%2FClassificationForRegression/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ColinTr%2FClassificationForRegression/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ColinTr","download_url":"https://codeload.github.com/ColinTr/ClassificationForRegression/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ColinTr%2FClassificationForRegression/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":30003724,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-02T12:19:43.414Z","status":"ssl_error","status_checked_at":"2026-03-02T12:19:02.215Z","response_time":60,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["classification","deep-learning","machine-learning","random-forest","regression"],"created_at":"2024-12-12T04:09:16.660Z","updated_at":"2026-03-02T13:11:41.088Z","avatar_url":"https://github.com/ColinTr.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003ch1 align=\"center\"\u003e\n  Classification For Regression\n\u003c/h1\u003e\n  \n\u003cp align=\"center\"\u003e\n  Code used to generate the results of the EGC 2022 conference paper \u003ca href=\"https://arxiv.org/abs/2112.03703\"\u003eConstruction de variables a l'aide de classifieurs comme aide a la regression\u003c/a\u003e\n\u003c/p\u003e\n\n\u003cdiv align=\"center\"\u003e\n \n  [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n\u003c/div\u003e\n\n\n## 📂 Directory Structure\n    .\n    ├── .gitignore\n    ├── README.md                                   \u003c- This file\n    ├── EGC_2022_TROISEMAINE_Colin_LEMAIRE_Vincent  \u003c- The article with appendices\n    ├── requirements.txt                            \u003c- The required packages\n    ├── data                                        \u003c- The datasets and the data generated by the scripts\n    │   ├── cleaned                                 \u003c- The cleaned datasets ready for use\n    │   ├── extracted_features                      \u003c- The datasets with the extracted features generated by feature_extraction.py\n    │   ├── figures                                 \u003c- The figures generated by visualisation.py\n    │   ├── logs                                    \u003c- The logs generated by the scripts\n    │   ├── metrics                                 \u003c- The metrics generated by compute_metrics.py\n    │   ├── predictions                             \u003c- The predictions generated by generate_predictions.py\n    │   ├── processed                               \u003c- The processed data generated by data_processing.py\n    │   └── raw                                     \u003c- The original datasets\n    │       └── Combined_Cycle_Power_Plant_Dataset  \u003c- A sample dataset\n    ├── notebooks                                   \u003c- The jupyter notebooks\n    │   ├── Datasets_First_Study.ipynb              \u003c- Notebook to check the datasets\n    │   ├── Hyperparameters study.ipynb             \u003c- Notebook used to explore the best hyperparameters\n    │   └── New dataset choice.ipynb                \u003c- Notebook to select difficult datasets to add to our study\n    ├── scripts                                     \u003c- The scripts\n    │   ├── compute_metrics.py                      \u003c- Used to compute regression performance on a results folder\n    │   ├── data_processing.py                      \u003c- Used to pre-process a dataset\n    │   ├── feature_extraction.py                   \u003c- Used to extract the features of a dataset folder\n    │   ├── generate_predictions.py                 \u003c- Generate the predictions of a regressor\n    │   ├── runner.py                               \u003c- Run all the scripts necessary to generate the final figures for a dataset\n    │   └── visualisation.py                        \u003c- Create the figures based on generated metrics\n    └── src                                         \u003c- The source code\n        ├── class_generation                        \u003c- The discretization methods\n        │   ├── BelowThresholdClassGenerator.py     \u003c- Data under the thresholds are given a 1, others a 0\n        │   ├── CustomClassGenerator.py             \u003c- The abstract class to inherit from\n        │   └── InsideBinClassGenerator.py          \u003c- Thresholds define bins with classes numbers\n        ├── models                                  \u003c- The models used for classification or regression\n        │   ├── BaseModel.py                        \u003c- The abstract classification model to inherit from\n        │   ├── DecisionTreeC.py                    \u003c- The Decision Tree classifier\n        │   ├── GaussianNBC.py                      \u003c- The Gaussian Naive Bayes classifier\n        │   ├── LogisticRegressionC.py              \u003c- The Logistic Regression classifier\n        │   ├── PyKhiopsC.py                        \u003c- The PyKhiops classifier\n        │   ├── RandomForestC.py                    \u003c- The Random Forest classifier\n        │   └── XGBoostC.py                         \u003c- The XGBoost classifier\n        ├── steps_encoding                          \u003c- The thresholds generation methods\n        │   ├── EqualFreqStepsEncoder.py            \u003c- Generates thresholds with equal frequency\n        │   ├── EqualWidthStepsEncoder.py           \u003c- Generates thresholds with equal width\n        │   └── StepsEncoder.py                     \u003c- The abstract class to inherit from\n        └── utils                                   \u003c- Various utility methods\n            ├── DataProcessingUtils.py              \u003c- Methods used to pre-prorcess the datasets\n            ├── logging_util.py                     \u003c- Message logging utility methods\n            └── Metrics.py                          \u003c- Methods to compute all the metrics needed\n\n\n## 🐍 Setting up the Python environment\n\nThis project was written using python 3.7.10, and the libraries described in requirements.txt.\n\nIt is recommended to create a virtual environment with virtualenv to install the exact versions of the packages used in this project. You will first need to install *virtualenv* with pip :\n\u003e pip install virtualenv\n\nThen create the virtual environment :\n\u003e virtualenv my_python_environment\n\nFinally, activate it using :\n\u003e source my_python_environment/bin/activate\n\nAt this point, you should see the name of your virtual environment in parentheses on your terminal line.\n\nYou can now install the required libraries inside your virtual environment with :\n\u003e pip install -r requirements.txt\n\n\n## 💻 Scripts usage example\n\nHere is a list of examples of usages of the scripts :\n\n**Note :** The following examples are meant to be used from inside the 'scripts' directory.\n\n1) We start with the Pre-processing of a dataset :\n\u003e python data_processing.py --dataset_path=\"../data/cleaned/Combined_Cycle_Power_Plant_Dataset/data.csv\"\n\n2) We then extract the features of a pre-processed dataset using a classification algorithm :\n\u003e python feature_extraction.py --dataset_folder=\"../data/processed/Combined_Cycle_Power_Plant_Dataset/10_bins_equal_freq_below_threshold/\" --classifier=\"RandomForest\"\n\n3) We can now generate the predictions using a regression model :\n\u003e python generate_predictions.py --dataset_folder=\"../data/extracted_features/Combined_Cycle_Power_Plant_Dataset/10_bins_equal_freq_below_threshold/RandomForest_classifier/\" --regressor=\"RandomForest\"\n\n4) Then, we compute the metrics on the predictions:\n\u003e python compute_metrics.py --predictions_folder=\"../data/predictions/Combined_Cycle_Power_Plant_Dataset/10_bins_equal_freq_below_threshold/RandomForest_classifier/RandomForest_regressor/\"\n\n5) Finally, we can generate figures (note that the figures are meant to represent the evolution of a metric when the number of thresholds varies, so the figures we will generate here will only have a single point)\n\u003e python visualisation.py --parent_folder=\"../data/metrics/Combined_Cycle_Power_Plant_Dataset/\" --metric=\"RMSE\"\n\nFor easier understanding of the flow of the dataset through the scripts, refer to the following diagram :\n\n\u003cdiv style=\"text-align:center\"\u003e\n   \u003cimg src=\"./scripts_diagram.png\" alt=\"scripts_diagram\" width=\"100%\"/\u003e\n\u003c/div\u003e\n\n \n## 📚 Scripts documentation\n\nHere are the scripts and the details about every usable parameter :\n\n1) **data_processing.py :**\n    \u003e python data_processing.py [dataset_path] [options]\n   \n    The mandatory parameters are :\n    * dataset_path : The dataset to process\n   \n    The optional parameters are :\n    * goal_var_index : The index of the column to use as the goal variable (will try to find a .index file beside the dataset's file if not defined)\n    * output_path : The folder where the results will be saved (will be generated if not defined)\n    * split_method : The splitting method to use (Choices : equal_width, equal_freq, kmeans)\n    * output_classes : The method of class generation (Choices : below_threshold, inside_bin)\n    * delimiter : Delimiter to use when reading the dataset\n    * header : Infer the column names or use None if the first line isn't a csv header line (Choices : infer, None)\n    * decimal : Character to recognize as decimal point\n    * na_values : Additional string to recognize as NA/NaN\n    * usecols : The indexes of the columns to keep\n    * n_bins : The number of bins to create\n    * k_folds : The number of folds in the k-folds\n    * log_lvl : Change the log display level (Choices : debug, info, warning)\n\n\n2) **feature_extraction.py :**\n    \u003e python feature_extraction.py [dataset_folder] [options]\n\n    The mandatory parameters are :\n    * dataset_folder : The folder where the k-fold datasets are stored\n    * classifier : The classifier model to use (Choices : RandomForest, LogisticRegression, XGBoost, GaussianNB, Khiops) **Any classifier of sklearn with model.predict_proba() are also supported.** In that case, use full module name (ex : sklearn.naive_bayes.GaussianNB). The full list for the version 0.0 of sklearn of supported classifiers seems to be : [AdaBoostClassifier BaggingClassifier BernoulliNB CalibratedClassifierCV DecisionTree ExtraTreeClassifier ExtraTreesClassifier GaussianNB GradientBoostingClassifier Khiops KNeighborsClassifier LabelPropagation LinearDiscriminantAnalysis LogisticRegression RadiusNeighborsClassifier RandomForest XGBoost]\n\n    The optional parameters are :\n    * output_path : The folder where the results will be saved (will be generated if not defined)\n    * class_cols : The indexes of the classes columns\n    * n_jobs : The number of cores to use\n    * log_lvl : Change the log display level (Choices : debug, info, warning)\n   \n\n3) **generate_predictions.py :**\n    \u003e python generate_predictions.py [dataset_folder] [regressor] [options]\n\n    The mandatory parameters are :\n    * dataset_folder : The folder where the test and train k-fold datasets are stored\n    * regressor : The regression model to use (Choices : RandomForest, LinearRegression, XGBoost, GaussianNB, Khiops)\n\n    The options are :\n    * grid_search : Automatically optimize the hyperparameters for the given dataset using a grid search (Choices : True, False)\n    * tuning_size : The percentage of the training set to reserve for hyper-parameters tuning in the grid search. Only relevant if --grid_search is set to True\n    * extracted_only : Use only the extracted features to train the regressor\n    * use_hyperparam_file : Use the hyperparameters in the hyperparameters.json file that is in the same folder of the dataset (Choices : True, False)\n    * output_path : The folder where the results will be saved (will be generated if not defined)\n    * n_estimators : The number of trees in the forest of RandomForest or the number of gradient boosted trees for XGBoost\n    * max_depth : The maximum depth of the trees in RandomForest, XGBoost or DecisionTree\n    * max_features : number of features to consider when looking for the best split in RandomForest or DecisionTree\n    * learning_rate : Boosting learning rate of XGBoost\n    * n_jobs : The number of cores to use\n    * log_lvl : Change the log display level (Choices : debug, info, warning)\n   \n\n4) **compute_metrics.py :**\n   \u003e python compute_metrics.py [results_folder] [options]\n\n    The mandatory parameters are :\n    * results_folder : The folder where the results of the script *generate_predictions.py* are stored\n\n    The options are :\n    * output_path : The folder where the results will be saved (will be generated if not defined)\n    * log_lvl : Change the log display level (Choices : debug, info, warning)\n   \n\n5) **visualisation.py and visualisation_fused.py:**\n   \u003e python visualisation.py [parent_folder] [options]\n   \n    The fused version plot the train and test curves on the same figure, while the non-fused creates two separate graphs.\n\n    The mandatory parameters are :\n    * parent_folder : The folder where the results of the script *generate_predictions.py* are stored\n\n    The options are :\n    * output_path : The folder where the results will be saved (will be generated if not defined)\n    * show_variance : Whether the variance should be shown on the graph or not (Choices : True, False)\n    * metric : The metric to display (Choices : r_squared, adjusted_r_squared, MSE, RMSE, MAE)\n    * log_lvl : Change the log display level (Choices : debug, info, warning)\n   \n\n6) **runner.py :** Allows to sequentially launch any number of scripts to generate results.\n   \u003e python runner.py [dataset_name] [goal_index] [classifiers]+\n\n    The mandatory parameters are :\n    * dataset_name : The dataset to use\n    * classifiers : The classifiers to compare (choices : RandomForest, LogisticRegression, XGBoost, GaussianNB, Khiops) **Any classifier of sklearn with model.predict_proba() are also supported.** In that case, use full module name (ex : sklearn.naive_bayes.GaussianNB)\n    * regressors : The regression models to use (Choices : RandomForest, LinearRegression, XGBoost, GaussianNB, Khiops)\n\n    The options are :\n    * extract : Run the feature_extraction step or not (Choices : True, False)\n    * grid_search : Automatically optimize the hyperparameters for the given dataset using a grid search (Choices : True, False)\n    * baseline : Compute the baseline or not (Choices : True, False)\n    * extracted_only : Use only the extracted features to train the regressor (Choices : True, False)\n    * n_jobs : The number of cores to use\n    * output_classes : The method of class generation (Choices : below_threshold, inside_bin)\n    * split_method : The splitting method to use (Choices : equal_width, equal_freq, kmeans)\n    * n_estimators : The number of trees in the forest of RandomForest or the number of gradient boosted trees for XGBoost\n    * max_depth : The maximum depth of the trees in RandomForest, XGBoost or DecisionTree\n    * max_features : number of features to consider when looking for the best split in RandomForest or DecisionTree\n    * learning_rate : Boosting learning rate of XGBoost\n    * preprocess : Do the pre-processing step or not\n    * log_lvl : Change the log display level (Choices : debug, info, warning)\n\n\n## 📜 Citation\nIf you found this work useful, please use the following citation:\n```\n@article{tr2022construction,\n   title = {Construction de variables à l'aide de classifieurs comme aide à la régression : une évaluation empirique},\n   author = {Colin Troisemaine and Vincent Lemaire},\n   journal = {Revue des Nouvelles Technologies de l'Information},\n   volume = {Extraction et Gestion des Connaissances, RNTI-E-38},\n   year = {2022},\n   pages = {217--224}\n}\n```\n\n## ⚖️ License\n\nCopyright (c) 2021 Orange.\n\nThis code is released under the MIT license. See the LICENSE file for more information.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcolintr%2Fclassificationforregression","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcolintr%2Fclassificationforregression","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcolintr%2Fclassificationforregression/lists"}