https://github.com/hase3b/class-imbalance-classification-performance-analysis
This repository contains the code, documentation, and datasets for a comprehensive exploration of machine learning techniques to address class imbalance. The project investigates the impact of various methods, like ADASYN, KMeansSMOTE, and Deep Learning Generator, on classification performance while effectively demonstrating benefits of pipelining.
https://github.com/hase3b/class-imbalance-classification-performance-analysis
adasyn class-imbalance classification cross-validation data-cleaning data-pipeline data-preprocessing deep-learning eda feature-selection hyperparameter-tuning kmeanssmote oversampling pipeline synthetic-data
Last synced: 28 days ago
JSON representation
This repository contains the code, documentation, and datasets for a comprehensive exploration of machine learning techniques to address class imbalance. The project investigates the impact of various methods, like ADASYN, KMeansSMOTE, and Deep Learning Generator, on classification performance while effectively demonstrating benefits of pipelining.
- Host: GitHub
- URL: https://github.com/hase3b/class-imbalance-classification-performance-analysis
- Owner: hase3b
- Created: 2024-05-15T16:58:06.000Z (12 months ago)
- Default Branch: main
- Last Pushed: 2024-07-15T16:25:38.000Z (10 months ago)
- Last Synced: 2025-02-09T18:39:27.039Z (3 months ago)
- Topics: adasyn, class-imbalance, classification, cross-validation, data-cleaning, data-pipeline, data-preprocessing, deep-learning, eda, feature-selection, hyperparameter-tuning, kmeanssmote, oversampling, pipeline, synthetic-data
- Language: Jupyter Notebook
- Homepage:
- Size: 6.88 MB
- Stars: 1
- Watchers: 1
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Class-Imbalance-Classification-Performance-Analysis
This repository contains the code, project report, and datasets for a comprehensive exploration of machine learning techniques to address class imbalance. The project investigates the impact of various methods, including ADASYN, KMeansSMOTE, and Deep Learning Generator, on classification performance using three real-world datasets. Implemented in Python and utilizing Jupyter Notebooks for experimentation and analysis, this repository offers detailed documentation and insights, along with the necessary datasets for reproduction of the results.### Project Objectives:
• Explore and compare the performance of ADASYN, KMeansSMOTE, and a Deep Learning Generator in addressing class imbalance.• Evaluate the impact of these techniques on selected classification algorithms.
• Provide comprehensive and detailed understanding of the results in a compiled project report.
### Class Imbalance Solution Techniques Selected:
• ADASYN (Adaptive Synthetic Sampling): Generates synthetic samples for the minority class, focusing on regions with low imbalance ratios.• KMeansSMOTE: Combines K-Means clustering with SMOTE to generate synthetic samples within clusters.
• Deep Learning Generator: Utilizes advanced deep learning models (GANs, Transformers, Variational Autoencoders, Autoregressive models) from MostlyAI to create high-quality synthetic data.
### Following Five Classification Algorithms Are Selected For This Project:
• KNN, Logistic Regression, Gaussian Naive Bayes, Linear SVM, & Decision Trees### Following Evaluation Metrics Are Selected For This Project:
• Precision (Both Classes), Recall (Both Classes), F1-Score (Both Classes), Accuracy, & AUC### Pipeline Implementation:
To efficiently test multiple class imbalance techniques on multiple datasets, a master function and supporting functions have been created to streamline the workflow. This pipeline automates the process of fetching data, data cleaning, EDA, data transformation, feature selection, data splitting and CV, model learning, hyperparameter tuning, and model evaluation. First pipiline is used to get a baseline for each dataset and then after applying each class imbalance technique on each of the datasets pipeline is deployed again and again and the results of each approach are compiled with the baseline for comparison. The modular structure allows for easy extension and modification, facilitating comprehensive experimentation and analysis. For future work, this pipeline can be modified to enable more advance data imputation techniques, more robust CV approach, and usage of GridSearch for tuning of hyperparameters.## Following Three Datasets Are Selected For This Project:
### Dataset 1
Title: Lending Club Loan Data AnalysisDomain: Finance
Number of Rows: 9578
Number of Columns: 14
Feature Type: Mixed
Class Balance: 84:16 (Percentage)
Source: https://www.kaggle.com/datasets/urstrulyvikas/lending-club-loan-data-analysis/data
### Dataset 2
Title: Hotel Reservation Classification DatasetDomain: Hospitality
Number of Rows: 36275
Number of Columns: 19
Feature Type: Mixed
Class Balance: 67:33 (Percentage)
Source: https://www.kaggle.com/datasets/ahsan81/hotel-reservations-classification-dataset
### Dataset 3
Title: Churn ModellingDomain: Finance
Number of Rows: 10000
Number of Columns: 14
Feature Type: Mixed
Class Balance: 79:21 (Percentage)
Source: https://www.kaggle.com/datasets/shrutimechlearn/churn-modelling