https://github.com/harshnevse/rmse-based-assessment-of-model-complexity-and-overfitting-in-polynomial-regression-frameworks

RMSE-Based Assessment of Model Complexity and Overfitting in Polynomial Regression Frameworks
https://github.com/harshnevse/rmse-based-assessment-of-model-complexity-and-overfitting-in-polynomial-regression-frameworks

jupyter-notebook machine-learning polynomial-regression python regression-analysis regression-models

Last synced: 4 days ago
JSON representation

RMSE-Based Assessment of Model Complexity and Overfitting in Polynomial Regression Frameworks

Host: GitHub
URL: https://github.com/harshnevse/rmse-based-assessment-of-model-complexity-and-overfitting-in-polynomial-regression-frameworks
Owner: HarshNevse
Created: 2024-12-15T09:17:46.000Z (6 months ago)
Default Branch: main
Last Pushed: 2024-12-15T09:29:23.000Z (6 months ago)
Last Synced: 2025-03-17T20:39:45.418Z (3 months ago)
Topics: jupyter-notebook, machine-learning, polynomial-regression, python, regression-analysis, regression-models
Language: Jupyter Notebook
Homepage:
Size: 103 KB
Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

        # RMSE-Based Assessment of Model Complexity and Overfitting in Polynomial Regression Frameworks

## Overview

This project demonstrates overfitting in machine learning using polynomial regression on the **Advertising dataset**. By progressively increasing the degree of polynomial features, we analyze the bias-variance trade-off and how model complexity impacts training and testing performance.

## Key Features

- Implementation of **Polynomial Regression** with varying degrees of complexity.

- Quantitative analysis using **Root Mean Squared Error (RMSE)** as the evaluation metric.

- Visualization of training and testing RMSE trends to highlight overfitting behavior.

## Dataset

 ![{27E7E6FB-13E4-4259-8C5F-709683564C08}](https://github.com/user-attachments/assets/7aec8517-63f6-4aeb-9915-a7ede2efeced)

The dataset used is `Advertising.csv`, which contains the following features:

- **TV**: Advertising spend on TV (in thousands of dollars).

- **Radio**: Advertising spend on radio (in thousands of dollars).

- **Newspaper**: Advertising spend on newspapers (in thousands of dollars).

Label:

- **Sales**: Product sales (in thousands of units).

## Requirements

This project uses Python and the following libraries:

- `numpy`

- `pandas`

- `matplotlib`

- `seaborn`

- `scikit-learn`

## Code Description

### Data Preprocessing

- The dataset is loaded and separated into predictors (`X`) and target variable (`y`).

- Polynomial features are generated for increasing degrees (1 to 9).

### Training and Testing

- Data is split into training (70%) and testing (30%) sets.

- A `LinearRegression` model is fit on the training data for each polynomial degree.

### Error Metrics

- The RMSE for both training and testing sets is calculated for each degree of polynomial features.

- Results are stored and visualized to show the impact of increasing model complexity.

### Visualization

1. **Scatter Plots**: Relationship between each predictor (TV, Newspaper, Radio) and the target variable (Sales).

   ![Untitled](https://github.com/user-attachments/assets/3172463e-2ce5-4a82-b6ef-8171204e4612)

3. **Regression Plot**: Comparison between predicted and actual sales for testing data.

   ![Untitled](https://github.com/user-attachments/assets/a27731f8-dbd5-4c8d-8b18-ce5f060dd26c)

5. **RMSE Trends**: Line plot showing train and test RMSE as a function of polynomial degree.

   ![Untitled](https://github.com/user-attachments/assets/8f295f1a-5059-4c18-9efb-92ea89e2dc2b)

## Results

The RMSE trends demonstrate:

- **Low-degree models (e.g., degree 1-3)**: Both train and test RMSE decrease, indicating underfitting.

- **Optimal degree (e.g., degree 4)**: Balanced performance on training and testing data.

- **High-degree models (e.g., degree 5-9)**: Training RMSE decreases further, but testing RMSE rises sharply, showing overfitting.

## Key Code Snippets

### RMSE Calculation Loop

```python

test_RMSE = []

train_RMSE = []

for i in range(1,10):

    

    ob2 = PolynomialFeatures(degree=i, include_bias=False)

    

    polymod2 = ob2.fit_transform(X)

    

    X_train, X_test, y_train, y_test = train_test_split(polymod2, y, test_size=0.3, random_state=101)

    

    ob3 = LinearRegression()

    mod3 = ob3.fit(X_train, y_train)

    

    ypredtest = mod3.predict(X_test)

    ypredtrain = mod3.predict(X_train)

    

    train_rmse = np.sqrt(mean_squared_error(y_train,ypredtrain))

    test_rmse = np.sqrt(mean_squared_error(y_test,ypredtest))

    

    train_RMSE.append(train_rmse)

    test_RMSE.append(test_rmse)

```

### RMSE Trend Visualization

```python

plt.figure(figsize=(10,6))

plt.plot(err['degree'][0:6],err['train_RMSE'][0:6])

plt.plot(err['degree'][0:6],err['test_RMSE'][0:6])

plt.legend(labels=['train_RMSE','test_RMSE'])

plt.xlabel('Dergree')

plt.ylabel('RMSE')

plt.ylim(0,4.6)

```

## Conclusion

This project successfully demonstrates how increasing polynomial degrees can lead to overfitting. By analyzing RMSE trends, we highlight the trade-off between bias and variance, providing valuable insights into model complexity and its impact on generalization.

## Future Work

- Introduce regularization techniques (e.g., Ridge or Lasso Regression) to mitigate overfitting.

- Use cross-validation to evaluate model performance more robustly.

- Explore feature selection and scaling to improve regression performance.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/harshnevse/rmse-based-assessment-of-model-complexity-and-overfitting-in-polynomial-regression-frameworks

Awesome Lists containing this project

README