https://github.com/sjain2580/simple-linear-regression-model
This project demonstrates a simple, yet robust, multiple linear regression model built with Python and scikit-learn to predict median house values in California.
https://github.com/sjain2580/simple-linear-regression-model
joblib linear-regression matplotlib matplotlib-pyplot numpy python scikit-learn
Last synced: 2 months ago
JSON representation
This project demonstrates a simple, yet robust, multiple linear regression model built with Python and scikit-learn to predict median house values in California.
- Host: GitHub
- URL: https://github.com/sjain2580/simple-linear-regression-model
- Owner: sjain2580
- Created: 2025-09-12T10:17:05.000Z (2 months ago)
- Default Branch: main
- Last Pushed: 2025-09-12T11:02:22.000Z (2 months ago)
- Last Synced: 2025-09-12T12:26:13.345Z (2 months ago)
- Topics: joblib, linear-regression, matplotlib, matplotlib-pyplot, numpy, python, scikit-learn
- Language: Jupyter Notebook
- Homepage:
- Size: 414 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Simple Linear Regression Model - California Housing Price Prediction with Linear Regression
## Overview
This project demonstrates a simple, yet robust, multiple linear regression model built with Python and scikit-learn to predict median house values in California.
## Features
- Multiple Features: The model uses multiple features (Median Income, House Age, and Average Rooms) for more accurate predictions.
- Data Preprocessing: It includes a machine learning pipeline to handle data scaling, a crucial step for many models.
- Model Persistence: The trained model is automatically saved to disk (linear_regression_model.joblib), allowing for easy reuse without retraining.
- Comprehensive Evaluation: The script calculates and prints key metrics (Mean Squared Error and R-squared) to evaluate the model's performance.
- Data Visualization: It generates and saves multiple plots (housing_prices_plot.png and housing_prices_residual_plot.png) for visual analysis.
- Prediction Functionality: The script includes a practical example of how to use the trained model to make a prediction on new, unseen data.
## Technologies used
- Python: The core programming language for the project.
- scikit-learn: A powerful machine learning library used for building the model, data splitting, and evaluation.
- NumPy: A fundamental library for numerical operations and handling the dataset arrays.
- Matplotlib: Used for creating the data visualizations, including the scatter and residual plots.
- joblib: A library for saving and loading the trained machine learning model.
## Model used (Architecture)
The core of this project is a LinearRegression model, which is a fundamental algorithm in supervised machine learning. The model is implemented within a scikit-learn pipeline. This pipeline's architecture consists of two main stages:
1. Data Preprocessing: The StandardScaler scales the features to have a mean of 0 and a standard deviation of 1. This is crucial for linear models to perform well, as it prevents features with larger values from disproportionately influencing the model.
2. Regression Model: The LinearRegression estimator fits a linear model to the preprocessed data, finding the best-fit line (or hyperplane in this case) that minimizes the sum of squared errors between the predicted and actual values.
## Data Processing
The project performs the following data processing steps:
- Data Splitting: The dataset is divided into a training set (80%) and a testing set (20%) to ensure the model's performance is evaluated on unseen data.
- Feature Scaling: A StandardScaler is applied to the input features. This process transforms the data such that it has zero mean and unit variance. Scaling prevents features with a larger magnitude from dominating the learning process.
## Data Analysis
This project performs data analysis through both quantitative metrics and visual inspection:
- Quantitative Metrics: The model's performance is evaluated using two standard metrics:
- Mean Squared Error (MSE): Measures the average squared difference between the estimated values and the actual value. A lower MSE indicates a better fit.
- R-squared (R2): Represents the proportion of the variance in the dependent variable that can be predicted from the independent variables. A score closer to 1.0 indicates a stronger fit.
## Model Training
The model training process is managed to be efficient and reproducible:
- Training: The fit() method is called on the machine learning pipeline, which first scales the training data and then trains the LinearRegression model.
- Persistence: Once trained, the entire pipeline is saved to a .joblib file. This is a common practice that "persists" the model, allowing it to be loaded directly for making predictions without the need for a full retraining process. The script intelligently checks for the existence of this file and either loads the existing model or trains a new one.
## Prerequisites
- Python 3.11+
- Required packages (install via `pip`):
## How to Run the Project
1. Clone this repository to your local machine:
```bash
git clone [https://github.com/sjain2580/simple-linear-regression](https://github.com/sjain2580/simple-linear-regression.git)
cd your-repo-name
```
2. Create and activate a virtual environment (optional but recommended):python -m venv venv
- On Windows:
```bash
.\venv\Scripts\activate
```
- On macOS/Linux:
```bash
source venv/bin/activate
```
3. Install the required libraries:
```bash
pip install -r requirements.txt
```
4. To Run the Script: Simply execute the main Python script from your terminal.
```bash
python simple_linear_regression.py
```
## Visualization
- Prediction Plot: Compares the model's predicted house values against the actual values to show how well the linear relationship is captured.

- Residual Plot: Plots the difference between the actual and predicted values. A good residual plot shows a random scatter of points around the zero line, indicating that the model's assumptions are met and it is not systematically under- or over-predicting.

## Contributors
****
Feel free to fork this repository, submit issues, or pull requests to improve the project. Suggestions for model enhancement or additional visualizations are welcome!
## Connect with Me
Feel free to reach out if you have any questions or just want to connect!
**[](https://www.linkedin.com/in/sjain04/)**
**[](https://github.com/sjain2580)**
**[](mailto:sjain040395@gmail.com)**
---