https://github.com/colintr/classificationforregression
[EGC 2022] Constructing Variables Using Classifiers as an Aid to Regression: An Empirical Assessment
https://github.com/colintr/classificationforregression
classification deep-learning machine-learning random-forest regression
Last synced: 4 months ago
JSON representation
[EGC 2022] Constructing Variables Using Classifiers as an Aid to Regression: An Empirical Assessment
- Host: GitHub
- URL: https://github.com/colintr/classificationforregression
- Owner: ColinTr
- License: mit
- Created: 2021-04-26T08:49:32.000Z (about 5 years ago)
- Default Branch: main
- Last Pushed: 2024-08-20T09:27:35.000Z (almost 2 years ago)
- Last Synced: 2025-02-23T21:25:39.913Z (over 1 year ago)
- Topics: classification, deep-learning, machine-learning, random-forest, regression
- Language: Jupyter Notebook
- Homepage: https://arxiv.org/abs/2112.03703
- Size: 12.2 MB
- Stars: 1
- Watchers: 1
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
Classification For Regression
Code used to generate the results of the EGC 2022 conference paper Construction de variables a l'aide de classifieurs comme aide a la regression
[](https://opensource.org/licenses/MIT)
## π Directory Structure
.
βββ .gitignore
βββ README.md <- This file
βββ EGC_2022_TROISEMAINE_Colin_LEMAIRE_Vincent <- The article with appendices
βββ requirements.txt <- The required packages
βββ data <- The datasets and the data generated by the scripts
β βββ cleaned <- The cleaned datasets ready for use
β βββ extracted_features <- The datasets with the extracted features generated by feature_extraction.py
β βββ figures <- The figures generated by visualisation.py
β βββ logs <- The logs generated by the scripts
β βββ metrics <- The metrics generated by compute_metrics.py
β βββ predictions <- The predictions generated by generate_predictions.py
β βββ processed <- The processed data generated by data_processing.py
β βββ raw <- The original datasets
β βββ Combined_Cycle_Power_Plant_Dataset <- A sample dataset
βββ notebooks <- The jupyter notebooks
β βββ Datasets_First_Study.ipynb <- Notebook to check the datasets
β βββ Hyperparameters study.ipynb <- Notebook used to explore the best hyperparameters
β βββ New dataset choice.ipynb <- Notebook to select difficult datasets to add to our study
βββ scripts <- The scripts
β βββ compute_metrics.py <- Used to compute regression performance on a results folder
β βββ data_processing.py <- Used to pre-process a dataset
β βββ feature_extraction.py <- Used to extract the features of a dataset folder
β βββ generate_predictions.py <- Generate the predictions of a regressor
β βββ runner.py <- Run all the scripts necessary to generate the final figures for a dataset
β βββ visualisation.py <- Create the figures based on generated metrics
βββ src <- The source code
βββ class_generation <- The discretization methods
β βββ BelowThresholdClassGenerator.py <- Data under the thresholds are given a 1, others a 0
β βββ CustomClassGenerator.py <- The abstract class to inherit from
β βββ InsideBinClassGenerator.py <- Thresholds define bins with classes numbers
βββ models <- The models used for classification or regression
β βββ BaseModel.py <- The abstract classification model to inherit from
β βββ DecisionTreeC.py <- The Decision Tree classifier
β βββ GaussianNBC.py <- The Gaussian Naive Bayes classifier
β βββ LogisticRegressionC.py <- The Logistic Regression classifier
β βββ PyKhiopsC.py <- The PyKhiops classifier
β βββ RandomForestC.py <- The Random Forest classifier
β βββ XGBoostC.py <- The XGBoost classifier
βββ steps_encoding <- The thresholds generation methods
β βββ EqualFreqStepsEncoder.py <- Generates thresholds with equal frequency
β βββ EqualWidthStepsEncoder.py <- Generates thresholds with equal width
β βββ StepsEncoder.py <- The abstract class to inherit from
βββ utils <- Various utility methods
βββ DataProcessingUtils.py <- Methods used to pre-prorcess the datasets
βββ logging_util.py <- Message logging utility methods
βββ Metrics.py <- Methods to compute all the metrics needed
## π Setting up the Python environment
This project was written using python 3.7.10, and the libraries described in requirements.txt.
It is recommended to create a virtual environment with virtualenv to install the exact versions of the packages used in this project. You will first need to install *virtualenv* with pip :
> pip install virtualenv
Then create the virtual environment :
> virtualenv my_python_environment
Finally, activate it using :
> source my_python_environment/bin/activate
At this point, you should see the name of your virtual environment in parentheses on your terminal line.
You can now install the required libraries inside your virtual environment with :
> pip install -r requirements.txt
## π» Scripts usage example
Here is a list of examples of usages of the scripts :
**Note :** The following examples are meant to be used from inside the 'scripts' directory.
1) We start with the Pre-processing of a dataset :
> python data_processing.py --dataset_path="../data/cleaned/Combined_Cycle_Power_Plant_Dataset/data.csv"
2) We then extract the features of a pre-processed dataset using a classification algorithm :
> python feature_extraction.py --dataset_folder="../data/processed/Combined_Cycle_Power_Plant_Dataset/10_bins_equal_freq_below_threshold/" --classifier="RandomForest"
3) We can now generate the predictions using a regression model :
> python generate_predictions.py --dataset_folder="../data/extracted_features/Combined_Cycle_Power_Plant_Dataset/10_bins_equal_freq_below_threshold/RandomForest_classifier/" --regressor="RandomForest"
4) Then, we compute the metrics on the predictions:
> python compute_metrics.py --predictions_folder="../data/predictions/Combined_Cycle_Power_Plant_Dataset/10_bins_equal_freq_below_threshold/RandomForest_classifier/RandomForest_regressor/"
5) Finally, we can generate figures (note that the figures are meant to represent the evolution of a metric when the number of thresholds varies, so the figures we will generate here will only have a single point)
> python visualisation.py --parent_folder="../data/metrics/Combined_Cycle_Power_Plant_Dataset/" --metric="RMSE"
For easier understanding of the flow of the dataset through the scripts, refer to the following diagram :
## π Scripts documentation
Here are the scripts and the details about every usable parameter :
1) **data_processing.py :**
> python data_processing.py [dataset_path] [options]
The mandatory parameters are :
* dataset_path : The dataset to process
The optional parameters are :
* goal_var_index : The index of the column to use as the goal variable (will try to find a .index file beside the dataset's file if not defined)
* output_path : The folder where the results will be saved (will be generated if not defined)
* split_method : The splitting method to use (Choices : equal_width, equal_freq, kmeans)
* output_classes : The method of class generation (Choices : below_threshold, inside_bin)
* delimiter : Delimiter to use when reading the dataset
* header : Infer the column names or use None if the first line isn't a csv header line (Choices : infer, None)
* decimal : Character to recognize as decimal point
* na_values : Additional string to recognize as NA/NaN
* usecols : The indexes of the columns to keep
* n_bins : The number of bins to create
* k_folds : The number of folds in the k-folds
* log_lvl : Change the log display level (Choices : debug, info, warning)
2) **feature_extraction.py :**
> python feature_extraction.py [dataset_folder] [options]
The mandatory parameters are :
* dataset_folder : The folder where the k-fold datasets are stored
* classifier : The classifier model to use (Choices : RandomForest, LogisticRegression, XGBoost, GaussianNB, Khiops) **Any classifier of sklearn with model.predict_proba() are also supported.** In that case, use full module name (ex : sklearn.naive_bayes.GaussianNB). The full list for the version 0.0 of sklearn of supported classifiers seems to be : [AdaBoostClassifier BaggingClassifier BernoulliNB CalibratedClassifierCV DecisionTree ExtraTreeClassifier ExtraTreesClassifier GaussianNB GradientBoostingClassifier Khiops KNeighborsClassifier LabelPropagation LinearDiscriminantAnalysis LogisticRegression RadiusNeighborsClassifier RandomForest XGBoost]
The optional parameters are :
* output_path : The folder where the results will be saved (will be generated if not defined)
* class_cols : The indexes of the classes columns
* n_jobs : The number of cores to use
* log_lvl : Change the log display level (Choices : debug, info, warning)
3) **generate_predictions.py :**
> python generate_predictions.py [dataset_folder] [regressor] [options]
The mandatory parameters are :
* dataset_folder : The folder where the test and train k-fold datasets are stored
* regressor : The regression model to use (Choices : RandomForest, LinearRegression, XGBoost, GaussianNB, Khiops)
The options are :
* grid_search : Automatically optimize the hyperparameters for the given dataset using a grid search (Choices : True, False)
* tuning_size : The percentage of the training set to reserve for hyper-parameters tuning in the grid search. Only relevant if --grid_search is set to True
* extracted_only : Use only the extracted features to train the regressor
* use_hyperparam_file : Use the hyperparameters in the hyperparameters.json file that is in the same folder of the dataset (Choices : True, False)
* output_path : The folder where the results will be saved (will be generated if not defined)
* n_estimators : The number of trees in the forest of RandomForest or the number of gradient boosted trees for XGBoost
* max_depth : The maximum depth of the trees in RandomForest, XGBoost or DecisionTree
* max_features : number of features to consider when looking for the best split in RandomForest or DecisionTree
* learning_rate : Boosting learning rate of XGBoost
* n_jobs : The number of cores to use
* log_lvl : Change the log display level (Choices : debug, info, warning)
4) **compute_metrics.py :**
> python compute_metrics.py [results_folder] [options]
The mandatory parameters are :
* results_folder : The folder where the results of the script *generate_predictions.py* are stored
The options are :
* output_path : The folder where the results will be saved (will be generated if not defined)
* log_lvl : Change the log display level (Choices : debug, info, warning)
5) **visualisation.py and visualisation_fused.py:**
> python visualisation.py [parent_folder] [options]
The fused version plot the train and test curves on the same figure, while the non-fused creates two separate graphs.
The mandatory parameters are :
* parent_folder : The folder where the results of the script *generate_predictions.py* are stored
The options are :
* output_path : The folder where the results will be saved (will be generated if not defined)
* show_variance : Whether the variance should be shown on the graph or not (Choices : True, False)
* metric : The metric to display (Choices : r_squared, adjusted_r_squared, MSE, RMSE, MAE)
* log_lvl : Change the log display level (Choices : debug, info, warning)
6) **runner.py :** Allows to sequentially launch any number of scripts to generate results.
> python runner.py [dataset_name] [goal_index] [classifiers]+
The mandatory parameters are :
* dataset_name : The dataset to use
* classifiers : The classifiers to compare (choices : RandomForest, LogisticRegression, XGBoost, GaussianNB, Khiops) **Any classifier of sklearn with model.predict_proba() are also supported.** In that case, use full module name (ex : sklearn.naive_bayes.GaussianNB)
* regressors : The regression models to use (Choices : RandomForest, LinearRegression, XGBoost, GaussianNB, Khiops)
The options are :
* extract : Run the feature_extraction step or not (Choices : True, False)
* grid_search : Automatically optimize the hyperparameters for the given dataset using a grid search (Choices : True, False)
* baseline : Compute the baseline or not (Choices : True, False)
* extracted_only : Use only the extracted features to train the regressor (Choices : True, False)
* n_jobs : The number of cores to use
* output_classes : The method of class generation (Choices : below_threshold, inside_bin)
* split_method : The splitting method to use (Choices : equal_width, equal_freq, kmeans)
* n_estimators : The number of trees in the forest of RandomForest or the number of gradient boosted trees for XGBoost
* max_depth : The maximum depth of the trees in RandomForest, XGBoost or DecisionTree
* max_features : number of features to consider when looking for the best split in RandomForest or DecisionTree
* learning_rate : Boosting learning rate of XGBoost
* preprocess : Do the pre-processing step or not
* log_lvl : Change the log display level (Choices : debug, info, warning)
## π Citation
If you found this work useful, please use the following citation:
```
@article{tr2022construction,
title = {Construction de variables Γ l'aide de classifieurs comme aide Γ la rΓ©gression : une Γ©valuation empirique},
author = {Colin Troisemaine and Vincent Lemaire},
journal = {Revue des Nouvelles Technologies de l'Information},
volume = {Extraction et Gestion des Connaissances, RNTI-E-38},
year = {2022},
pages = {217--224}
}
```
## βοΈ License
Copyright (c) 2021 Orange.
This code is released under the MIT license. See the LICENSE file for more information.