https://github.com/sjain2580/simple-linear-regression-model

This project demonstrates a simple, yet robust, multiple linear regression model built with Python and scikit-learn to predict median house values in California.
https://github.com/sjain2580/simple-linear-regression-model

joblib linear-regression matplotlib matplotlib-pyplot numpy python scikit-learn

Last synced: 2 months ago
JSON representation

This project demonstrates a simple, yet robust, multiple linear regression model built with Python and scikit-learn to predict median house values in California.

Host: GitHub
URL: https://github.com/sjain2580/simple-linear-regression-model
Owner: sjain2580
Created: 2025-09-12T10:17:05.000Z (2 months ago)
Default Branch: main
Last Pushed: 2025-09-12T11:02:22.000Z (2 months ago)
Last Synced: 2025-09-12T12:26:13.345Z (2 months ago)
Topics: joblib, linear-regression, matplotlib, matplotlib-pyplot, numpy, python, scikit-learn
Language: Jupyter Notebook
Homepage:
Size: 414 KB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Simple Linear Regression Model - California Housing Price Prediction with Linear Regression
## Overview

This project demonstrates a simple, yet robust, multiple linear regression model built with Python and scikit-learn to predict median house values in California.

## Features

- Multiple Features: The model uses multiple features (Median Income, House Age, and Average Rooms) for more accurate predictions.

- Data Preprocessing: It includes a machine learning pipeline to handle data scaling, a crucial step for many models.

- Model Persistence: The trained model is automatically saved to disk (linear_regression_model.joblib), allowing for easy reuse without retraining.

- Comprehensive Evaluation: The script calculates and prints key metrics (Mean Squared Error and R-squared) to evaluate the model's performance.

- Data Visualization: It generates and saves multiple plots (housing_prices_plot.png and housing_prices_residual_plot.png) for visual analysis.

- Prediction Functionality: The script includes a practical example of how to use the trained model to make a prediction on new, unseen data.

## Technologies used

- Python: The core programming language for the project.

- scikit-learn: A powerful machine learning library used for building the model, data splitting, and evaluation.

- NumPy: A fundamental library for numerical operations and handling the dataset arrays.

- Matplotlib: Used for creating the data visualizations, including the scatter and residual plots.

- joblib: A library for saving and loading the trained machine learning model.

## Model used (Architecture)

The core of this project is a LinearRegression model, which is a fundamental algorithm in supervised machine learning. The model is implemented within a scikit-learn pipeline. This pipeline's architecture consists of two main stages:

1. Data Preprocessing: The StandardScaler scales the features to have a mean of 0 and a standard deviation of 1. This is crucial for linear models to perform well, as it prevents features with larger values from disproportionately influencing the model.

2. Regression Model: The LinearRegression estimator fits a linear model to the preprocessed data, finding the best-fit line (or hyperplane in this case) that minimizes the sum of squared errors between the predicted and actual values.

## Data Processing

The project performs the following data processing steps:

- Data Splitting: The dataset is divided into a training set (80%) and a testing set (20%) to ensure the model's performance is evaluated on unseen data.

- Feature Scaling: A StandardScaler is applied to the input features. This process transforms the data such that it has zero mean and unit variance. Scaling prevents features with a larger magnitude from dominating the learning process.

## Data Analysis

This project performs data analysis through both quantitative metrics and visual inspection:

- Quantitative Metrics: The model's performance is evaluated using two standard metrics:

- Mean Squared Error (MSE): Measures the average squared difference between the estimated values and the actual value. A lower MSE indicates a better fit.

- R-squared (R2): Represents the proportion of the variance in the dependent variable that can be predicted from the independent variables. A score closer to 1.0 indicates a stronger fit.

## Model Training

The model training process is managed to be efficient and reproducible:

- Training: The fit() method is called on the machine learning pipeline, which first scales the training data and then trains the LinearRegression model.

- Persistence: Once trained, the entire pipeline is saved to a .joblib file. This is a common practice that "persists" the model, allowing it to be loaded directly for making predictions without the need for a full retraining process. The script intelligently checks for the existence of this file and either loads the existing model or trains a new one.

## Prerequisites

- Python 3.11+
- Required packages (install via `pip`):

## How to Run the Project

1. Clone this repository to your local machine:

```bash
git clone [https://github.com/sjain2580/simple-linear-regression](https://github.com/sjain2580/simple-linear-regression.git)
cd your-repo-name
```

2. Create and activate a virtual environment (optional but recommended):python -m venv venv

- On Windows:

```bash
.\venv\Scripts\activate
```

- On macOS/Linux:

```bash
source venv/bin/activate
```

3. Install the required libraries:

```bash
pip install -r requirements.txt
```

4. To Run the Script: Simply execute the main Python script from your terminal.

```bash
python simple_linear_regression.py
```

## Visualization

- Prediction Plot: Compares the model's predicted house values against the actual values to show how well the linear relationship is captured.
![Housing Prices Plot](./housing_prices_plot.png)

- Residual Plot: Plots the difference between the actual and predicted values. A good residual plot shows a random scatter of points around the zero line, indicating that the model's assumptions are met and it is not systematically under- or over-predicting.
![Residual Plot](./housing_prices_residual_plot.png)

## Contributors

****
Feel free to fork this repository, submit issues, or pull requests to improve the project. Suggestions for model enhancement or additional visualizations are welcome!

## Connect with Me

Feel free to reach out if you have any questions or just want to connect!
**[![LinkedIn](https://img.shields.io/badge/-LinkedIn-0A66C2?style=flat-square&logo=linkedin&logoColor=white)](https://www.linkedin.com/in/sjain04/)**
**[![GitHub](https://img.shields.io/badge/-GitHub-181717?style=flat-square&logo=github&logoColor=white)](https://github.com/sjain2580)**
**[![Email](https://img.shields.io/badge/-Email-D14836?style=flat-square&logo=gmail&logoColor=white)](mailto:sjain040395@gmail.com)**

---

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/sjain2580/simple-linear-regression-model

Awesome Lists containing this project

README