https://github.com/srinibas-masanta/insurance-premium-prediction
This project predicts medical insurance charges using machine learning models after performing data preprocessing, EDA, and feature engineering. It highlights key cost drivers like age and smoking status and uses trained models for accurate predictions. The entire workflow demonstrates an end-to-end approach to regression-based predictive modeling.
https://github.com/srinibas-masanta/insurance-premium-prediction
bivariate-analysis data-preprocessing exploratory-data-analysis insurance-prediction machine-learning regression-models univariate-analysis
Last synced: 3 months ago
JSON representation
This project predicts medical insurance charges using machine learning models after performing data preprocessing, EDA, and feature engineering. It highlights key cost drivers like age and smoking status and uses trained models for accurate predictions. The entire workflow demonstrates an end-to-end approach to regression-based predictive modeling.
- Host: GitHub
- URL: https://github.com/srinibas-masanta/insurance-premium-prediction
- Owner: srinibas-masanta
- License: mit
- Created: 2025-07-30T14:12:16.000Z (4 months ago)
- Default Branch: main
- Last Pushed: 2025-07-30T14:34:04.000Z (4 months ago)
- Last Synced: 2025-07-30T17:01:13.818Z (4 months ago)
- Topics: bivariate-analysis, data-preprocessing, exploratory-data-analysis, insurance-prediction, machine-learning, regression-models, univariate-analysis
- Language: Jupyter Notebook
- Homepage:
- Size: 622 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
## ๐งพ Insurance Premium Prediction
This project aims to predict insurance premium charges using machine learning models based on an individualโs demographic and lifestyle attributes. The dataset includes information such as age, BMI, number of children, region, sex, and smoking status. The goal is to build interpretable models that help identify key drivers of insurance costs and make accurate predictions.
### ๐ Dataset
* The dataset used is from Kaggle's [US Health Insurance Dataset](https://www.kaggle.com/datasets/teertha/ushealthinsurancedataset/data).
* It contains 1,300+ records with the following features:
* `age`: Age of the primary beneficiary
* `sex`: Gender
* `bmi`: Body mass index
* `children`: Number of dependents
* `smoker`: Whether the person smokes
* `region`: Residential region in the US
* `charges`: Medical insurance charges (target variable)
### ๐ Data Preprocessing & Feature Engineering
- Removed outliers using the IQR method to retain meaningful data.
- Created an `age_group` feature by binning ages into categories (Young, Adult, Middle-aged, Senior).
- Encoded categorical variables:
- One-hot encoding for `region`
- Label encoding for `smoker` and `sex`
- Selected relevant numeric features based on correlation analysis.
### ๐ Exploratory Data Analysis (EDA)
Conducted univariate and bivariate analysis to understand feature distributions and relationships:
- **Univariate Plots**: Histograms of `age` and `bmi`, pie chart of smoker distribution, count plot of gender.
- **Bivariate Plots**:
- Scatter plots of `age` vs. `charges` and `bmi` vs. `charges` (colored by smoking status)
- Violin plot for `charges` by number of children
- Box plot comparing charges for smokers vs non-smokers
- Bar plots showing total charges and smoker counts by region
### ๐ Correlation Analysis
* Pearson correlation was used to analyze relationships between numerical features.
* Only `age` and `smoker_encoded` showed a strong correlation with `charges`.
### ๐ค Modeling and Evaluation
Three regression models were trained using the selected features (`age` and `smoker_encoded`) to predict insurance charges:
| Model | Mean Squared Error (MSE) | Mean Absolute Error (MAE) |
| ----------------- | ------------------------ | ------------------------- |
| Decision Tree | 19,994,952.54 | 2,757.66 |
| Random Forest | 19,946,316.36 | 2,749.04 |
| Gradient Boosting | 19,405,842.52 | 2,694.16 |
๐ **Gradient Boosting** achieved the lowest error values among the three, indicating better performance on the test data.
Each modelโs predictions were also visualized using **Actual vs Predicted** scatter plots to assess how well the models captured the underlying pattern in the data.
### ๐ Feature Importance
Feature importance was extracted from all three models.
**Observation**: Smoking status consistently had a higher importance than age in predicting insurance charges.
### ๐ฎ Inference on New Data
The trained models were used to predict insurance charges for new input data.
Example:
```python
new_data = pd.DataFrame({'age': [31], 'smoker_encoded': [1]})
```
**Predicted Charges for a 31-year-old smoker:**
* Decision Tree: โน19,275.16
* Random Forest: โน19,190.37
* Gradient Boosting: โน19,103.24
### ๐ฆ Libraries Used
* `pandas`, `numpy`
* `matplotlib`, `seaborn`
* `sklearn`: preprocessing, model\_selection, tree, ensemble, metrics
### ๐ License
This project is licensed under the MIT License.