https://github.com/pngo1997/multiple-regression-and-feature-selection-analysis
Explores multiple linear regression, feature selection, Ridge & Lasso regression, and Stochastic Gradient Descent (SGD) regression.
https://github.com/pngo1997/multiple-regression-and-feature-selection-analysis
feature-selection lasso-regression multiple-linear-regression python ridge-regression stochastic-gradient-descent
Last synced: about 2 months ago
JSON representation
Explores multiple linear regression, feature selection, Ridge & Lasso regression, and Stochastic Gradient Descent (SGD) regression.
- Host: GitHub
- URL: https://github.com/pngo1997/multiple-regression-and-feature-selection-analysis
- Owner: pngo1997
- Created: 2025-01-30T19:11:50.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2025-01-30T19:15:46.000Z (over 1 year ago)
- Last Synced: 2025-01-30T20:25:34.482Z (over 1 year ago)
- Topics: feature-selection, lasso-regression, multiple-linear-regression, python, ridge-regression, stochastic-gradient-descent
- Language: Jupyter Notebook
- Homepage:
- Size: 1.26 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# 🏗️ Multiple Regression and Feature Selection Analysis
## 📜 Overview
This project explores **multiple linear regression, feature selection, Ridge & Lasso regression, and Stochastic Gradient Descent (SGD) regression**. The dataset is split into **80% training and 20% testing**, and different regression techniques are applied to predict a target variable. The analysis includes **cross-validation, feature selection, and model optimization** to improve regression performance.
## 🎯 Problem Explanation
The project aims to:
1. **Perform standard multiple linear regression** and evaluate its effectiveness.
2. **Select the most informative features** using `SelectPercentile`.
3. **Optimize Ridge & Lasso Regression** by tuning the **alpha parameter**.
4. **Train a Stochastic Gradient Descent (SGD) Regressor** with grid search for hyperparameter selection.
5. **Compare models using RMSE, MAE, and cross-validation performance**.
## 🛠️ Implementation Details
### **1. Data Preprocessing**
- **Missing values handled** using mean imputation.
- **Basic statistics computed** (mean, std dev, min, max).
- **Target variable extracted**, and dataset is split (80% train, 20% test).
### **2. Multiple Linear Regression**
- Standard **multiple linear regression** applied.
- **RMSE computed** on training data.
- **Regression coefficients plotted** to visualize feature importance.
- **10-fold cross-validation RMSE** compared to training RMSE.
### **3. Feature Selection with Regression**
- `SelectPercentile` with `f_regression` used to identify top features.
- **K-fold cross-validation (k=5)** determines the optimal percentage of features.
- **Mean Absolute Error (MAE) plotted** vs. feature selection percentage.
### **4. Ridge & Lasso Regression with Alpha Optimization**
- A function is implemented to:
- Accept **data, target variable, parameter range (alpha), and model type**.
- Perform **K-fold cross-validation (k=5)**.
- **Plot error values** vs. alpha for Ridge & Lasso regression.
- Train on the best alpha and **evaluate on test data**.
- **Bias-variance trade-off analyzed**.
### **5. Stochastic Gradient Descent Regression**
- **Features standardized** using `StandardScaler`.
- **GridSearchCV applied** to compare penalty parameters (`l2`, `l1`) and different alpha values (0.0001 to 10).
- **Elastic Net model selection** performed to find the best `l1_ratio`.
## 🚀 Technologies Used
- **Python** (for regression modeling and evaluation).
- **Pandas & NumPy** (for data preprocessing and statistical computations).
- **Scikit-learn** (for regression models, feature selection, and cross-validation).
- **Matplotlib & Seaborn** (for data visualization).