Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/samruddhi3012/insurance-price-forecasting

This is a Machine Learning project where I performed EDA and forecasted the insurance pricing using Linear Regression and XGBoost Regressor.
https://github.com/samruddhi3012/insurance-price-forecasting

bayesian-optimization data-visualization exploratory-data-analysis linear-regression machine-learning pipeline statistical-analysis xgboost-regression

Last synced: 25 days ago
JSON representation

This is a Machine Learning project where I performed EDA and forecasted the insurance pricing using Linear Regression and XGBoost Regressor.

Host: GitHub
URL: https://github.com/samruddhi3012/insurance-price-forecasting
Owner: samruddhi3012
Created: 2024-07-29T15:30:39.000Z (5 months ago)
Default Branch: main
Last Pushed: 2024-08-17T18:04:27.000Z (5 months ago)
Last Synced: 2024-08-18T17:06:14.886Z (5 months ago)
Topics: bayesian-optimization, data-visualization, exploratory-data-analysis, linear-regression, machine-learning, pipeline, statistical-analysis, xgboost-regression
Language: Jupyter Notebook
Homepage:
Size: 1.22 MB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Insurance Price Forecast

## 🕵️ Objectives
The objective of this project is to analyse the insurance dataset and build an optimized machine learning model to make accurate predictions of insurance costs based on the provided dataset.

## 💻 Data
The dataset contains historical records of 1338 insured customers with the following columns:
* Age: Age of the primary beneficiary.
* Sex: Gender of the primary beneficiary.
* BMI: Body mass index of the primary beneficiary.
* Children: Number of children the primary beneficiary has.
* Smoker: Whether the primary beneficiary smokes.
* Region: The primary beneficiary's residential area in the US.
* Charges: Individual medical costs billed by health insurance.

## 🪛 Tools Used

* Tools : Python, Jupyter Notebook

* Keywords: Linear Regression, XGBoost, Box-Cox, Recursive Feature Elimination (RFE), BayesSearchCV, Sklearn Pipeline, Statistical Analysis, Chi-Squared test, ANOVA test, Exploratory data analysis, Univariate and Bivariate analysis, Quantile Quantile Plot, Trend Analysis

* Libraries: Pandas, Numpy, Scipy, Scikit-learn, Xgboost, Skopt

Scipy

Scikit-learn

Skopt

## 🔖 Results

* Baseline Linear Model
1. Transformation: Used Box-Cox for data transformation.
2. Training RMSE: **5830.10**
3. R-Squared: **0.80**
* XGBoost model
1. Optimization: Used Sklearn's Pipeline and Skopt's BayesSearchCV.
2. Training RMSE: **4627.11**
3. R-Squared: **0.87**

* Comparison of models.
1. From the baseline to the xgb model, there was a **20.63%** improvement in the RMSE value for the test set.
2. The R-Squared value for test set has improved by **8.75%** from baseline to xgb model.

## 📝 Description

* _**Data Preparation**_:
1. Imported essential libraries for data manipulation and visualization.
2. Cleaned the data to handle missing values, correct data types, and remove inconsistencies.

* _**Exploratory Data Analysis (EDA)**_:

EDA was conducted to understand the dataset's underlying patterns and relationships. This section covered:
* Distribution Analysis
* Univariate Data Analysis (wrt target)
* Bivariate Data Analysis (wrt target)
* Collinearity between Features
* Correlation between Features

* _**Build and evaluate a baseline linear model**_:

A baseline model was built using Linear Regression. This section covered:
* Data Transformation
* Understanding Linear Regression Assumptions
* Building Linear Regression
* Validating Linear Regression Assumptions
* Model Training
* Model Evaluation:
Assessing the model's performance using metrics like R-squared and Mean Squared Error (MSE).

* _**Model Building using XGBoost Regressor**_:

An advanced model was built using the XGBoost Regressor. This section included:
* Preparing data specifically for the XGBoost model.
* Building Pipelines with Sklearn’s Pipeline Operator
* Implementing BayesSearchCV for XGBoost Hyperparameter Optimization:
Using Bayesian optimization to fine-tune the hyperparameters of the XGBoost model.
* Model Evaluation

* _**Comparison of Models**_:

The performance of the Linear Regression model and the XGBoost Regressor was compared. The primary evaluation metric used for comparison was the Root Mean Square Error (RMSE).

* _**Performance of the Models**_:

The final step involved a detailed analysis of the models' performance. The XGBoost Regressor, with optimized hyperparameters, was expected to outperform the baseline Linear Regression model in terms of predictive accuracy and RMSE.

### _Thank you for visiting my repository!_