https://github.com/harshnevse/rmse-based-assessment-of-model-complexity-and-overfitting-in-polynomial-regression-frameworks
RMSE-Based Assessment of Model Complexity and Overfitting in Polynomial Regression Frameworks
https://github.com/harshnevse/rmse-based-assessment-of-model-complexity-and-overfitting-in-polynomial-regression-frameworks
jupyter-notebook machine-learning polynomial-regression python regression-analysis regression-models
Last synced: 4 days ago
JSON representation
RMSE-Based Assessment of Model Complexity and Overfitting in Polynomial Regression Frameworks
- Host: GitHub
- URL: https://github.com/harshnevse/rmse-based-assessment-of-model-complexity-and-overfitting-in-polynomial-regression-frameworks
- Owner: HarshNevse
- Created: 2024-12-15T09:17:46.000Z (6 months ago)
- Default Branch: main
- Last Pushed: 2024-12-15T09:29:23.000Z (6 months ago)
- Last Synced: 2025-03-17T20:39:45.418Z (3 months ago)
- Topics: jupyter-notebook, machine-learning, polynomial-regression, python, regression-analysis, regression-models
- Language: Jupyter Notebook
- Homepage:
- Size: 103 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# RMSE-Based Assessment of Model Complexity and Overfitting in Polynomial Regression Frameworks
## Overview
This project demonstrates overfitting in machine learning using polynomial regression on the **Advertising dataset**. By progressively increasing the degree of polynomial features, we analyze the bias-variance trade-off and how model complexity impacts training and testing performance.## Key Features
- Implementation of **Polynomial Regression** with varying degrees of complexity.
- Quantitative analysis using **Root Mean Squared Error (RMSE)** as the evaluation metric.
- Visualization of training and testing RMSE trends to highlight overfitting behavior.## Dataset
The dataset used is `Advertising.csv`, which contains the following features:
- **TV**: Advertising spend on TV (in thousands of dollars).
- **Radio**: Advertising spend on radio (in thousands of dollars).
- **Newspaper**: Advertising spend on newspapers (in thousands of dollars).
Label:
- **Sales**: Product sales (in thousands of units).## Requirements
This project uses Python and the following libraries:
- `numpy`
- `pandas`
- `matplotlib`
- `seaborn`
- `scikit-learn`## Code Description
### Data Preprocessing
- The dataset is loaded and separated into predictors (`X`) and target variable (`y`).
- Polynomial features are generated for increasing degrees (1 to 9).### Training and Testing
- Data is split into training (70%) and testing (30%) sets.
- A `LinearRegression` model is fit on the training data for each polynomial degree.### Error Metrics
- The RMSE for both training and testing sets is calculated for each degree of polynomial features.
- Results are stored and visualized to show the impact of increasing model complexity.### Visualization
1. **Scatter Plots**: Relationship between each predictor (TV, Newspaper, Radio) and the target variable (Sales).
3. **Regression Plot**: Comparison between predicted and actual sales for testing data.
5. **RMSE Trends**: Line plot showing train and test RMSE as a function of polynomial degree.
## Results
The RMSE trends demonstrate:
- **Low-degree models (e.g., degree 1-3)**: Both train and test RMSE decrease, indicating underfitting.
- **Optimal degree (e.g., degree 4)**: Balanced performance on training and testing data.
- **High-degree models (e.g., degree 5-9)**: Training RMSE decreases further, but testing RMSE rises sharply, showing overfitting.## Key Code Snippets
### RMSE Calculation Loop
```python
test_RMSE = []
train_RMSE = []for i in range(1,10):
ob2 = PolynomialFeatures(degree=i, include_bias=False)
polymod2 = ob2.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(polymod2, y, test_size=0.3, random_state=101)
ob3 = LinearRegression()
mod3 = ob3.fit(X_train, y_train)
ypredtest = mod3.predict(X_test)
ypredtrain = mod3.predict(X_train)
train_rmse = np.sqrt(mean_squared_error(y_train,ypredtrain))
test_rmse = np.sqrt(mean_squared_error(y_test,ypredtest))
train_RMSE.append(train_rmse)
test_RMSE.append(test_rmse)
```### RMSE Trend Visualization
```python
plt.figure(figsize=(10,6))
plt.plot(err['degree'][0:6],err['train_RMSE'][0:6])
plt.plot(err['degree'][0:6],err['test_RMSE'][0:6])
plt.legend(labels=['train_RMSE','test_RMSE'])
plt.xlabel('Dergree')
plt.ylabel('RMSE')
plt.ylim(0,4.6)
```## Conclusion
This project successfully demonstrates how increasing polynomial degrees can lead to overfitting. By analyzing RMSE trends, we highlight the trade-off between bias and variance, providing valuable insights into model complexity and its impact on generalization.## Future Work
- Introduce regularization techniques (e.g., Ridge or Lasso Regression) to mitigate overfitting.
- Use cross-validation to evaluate model performance more robustly.
- Explore feature selection and scaling to improve regression performance.