Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/arjunan-k/medical_insurance
Project to analyze and forecast medical insurance costs of patients using data science framework.
https://github.com/arjunan-k/medical_insurance
medical-insurance scikit-learn tableau
Last synced: 25 days ago
JSON representation
Project to analyze and forecast medical insurance costs of patients using data science framework.
- Host: GitHub
- URL: https://github.com/arjunan-k/medical_insurance
- Owner: arjunan-k
- Created: 2022-10-21T14:12:49.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2022-12-22T19:46:55.000Z (about 2 years ago)
- Last Synced: 2024-11-11T11:12:13.860Z (3 months ago)
- Topics: medical-insurance, scikit-learn, tableau
- Language: HTML
- Homepage: https://arjunan-k.github.io/Medical_Insurance/
- Size: 2.74 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
#
MEDICAL INSURANCE COST PREDICTION
# PROJECT OVERVIEW
Health insurance costs have risen dramatically over the past decade in response to the rising cost of health care services and are determined by a multitude of factors. Let's look at the cost of healthcare for a sample of the population given age, sex, bmi, number of children, smoking habits, and region.The purpose of this project is to determine the contributing factors and predict health insurance cost by performing exploratory data analysis using Tableau and predictive modeling on the Health Insurance dataset. This project makes use of Numpy, Pandas, Sci-kit learn, and Data Visualization libraries.
**Overview:**
* Seek insight from the dataset with Exploratory Data Analysis using Tableau
* Performed Data Processing, Data Engineering and Feature Transformation to prepare data before modeling
* Built a model to predict Insurance Cost based on the features
* Evaluated the model using various Performance Metrics like MSE, MAE, RMSE, RMSLE and R2# DATA DESCRIPTION
1. age: age of primary beneficiary
2. sex: insurance contractor gender, female, male
3. bmi: Body mass index, providing an understanding of body, weights that are relatively high or low relative to height,
objective index of body weight (kg / m ^ 2) using the ratio of height to weight, ideally 18.5 to 24.9
4. children: Number of children covered by health insurance / Number of dependents
5. smoker: Smoking
6. region: the beneficiary's residential area in the US, northeast, southeast, southwest, northwest
7. charges: Individual medical costs billed by health insuranceData source : https://www.kaggle.com/mirichoi0218/insurance
# INSIGHTS
The insights drawn by performing `Data Analysis Using Tableau` are:* Features like Sex and Region has an almost balanced distribution.
* Majority of the policyholders fall in Overweight and Obese category.
* Most of the policyholders are Non-Smokers.
* Highest number of policyholders are in the range of 18 to 22 years old.
* A person who smokes and have a BMI above 30 (Obese) tends to have a higher medical cost.
* Older people who smoke have more expensive charges.
* Most of the policyholders have 1 to 2 children as dependencies.# DATA PROCESSING
1. Check missing value - there are none
2. Check duplicate value - there are 1 duplicate, will be remove
3. Feature engineering - make a new column `weight_status` based on BMI score
4. Feature transformation:
A) Encoding `sex`, `region`, & `weight_status` attributes
B) Ordinal encoding `smoker` attribute
5. Modeling:
A) Separating target & features
B) Splitting train & test data
C) Modeling using Linear Regression, Decision Tree, Random Forest, Ridge Regression and Lasso Regression
D) Feature Importance Ranking
E) Find the best algorithm
# MODEL EVALUATION
| Score | LinearRegression | DecisionTree | RandomForest | Ridge | Lasso |
| ----------- | ----------- | ----------- | ----------- | ----------- | ------------ |
| R2 | 88.13 | 62.54 | 87.70 | 80.13 | 80.16 |
| MSE | 18895160 | 59610327 | 19571921 | 31620411 | 31569154 |
| MAE | 2824 | 3964 | 2647 | 3883 | 3872 |
| RMSE | 4347 | 7721 | 4424 | 5623 | 5619 |
| RMSLE | 8.377 | 8.952 | 8.395 | 8.635 | 8.634 |# CONCLUSION
Based on the predictive modeling, Linear Regression (Fitted Polynomial Regression of Degree 2) algorithm has the best score compared to others.
Therefore, Linear Regression algorithm is the best fitted model.