https://github.com/tolumie/aviva-insurance-statistics-hypothesis-abtesting-modelling

This project explores the impact of demographic and lifestyle factors on insurance charges. Using statistical hypothesis testing (ANOVA, Chi-Square, T-tests) and predictive modeling (Elastic Net, Random Forest, Gradient Boosting). The analysis is deployed using Streamlit.
https://github.com/tolumie/aviva-insurance-statistics-hypothesis-abtesting-modelling

anova chi-square-test data-visualization eda gradient-boosting hypothesis-testing insurance-dataset machine-learning predictive-modeling python random-forest statistical-analysis streamlit

Last synced: 5 days ago
JSON representation

Host: GitHub
URL: https://github.com/tolumie/aviva-insurance-statistics-hypothesis-abtesting-modelling
Owner: Tolumie
Created: 2025-03-06T18:27:31.000Z (8 months ago)
Default Branch: main
Last Pushed: 2025-03-07T10:57:17.000Z (8 months ago)
Last Synced: 2025-03-07T11:35:01.093Z (8 months ago)
Topics: anova, chi-square-test, data-visualization, eda, gradient-boosting, hypothesis-testing, insurance-dataset, machine-learning, predictive-modeling, python, random-forest, statistical-analysis, streamlit
Homepage:
Size: 1000 Bytes
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

### **Project: Aviva Insurance Data Analysis ~ Hypothesis Testing and Predictive Modeling**

## **📌 Table of Contents**
1. [Introduction](#introduction)
2. [Data Overview & Objective](#data-overview--objective)
3. [Data Preparation & Descriptive Statistics](#data-preparation--descriptive-statistics)
- 3a. Data Preparation
- 3a.1 Import Required Libraries
- 3a.2 Data Loading and Overview
- 3a.3 Data Cleaning and Encoding
- 3b. Descriptive Statistics
- 3b.1 Descriptive Statistics
- 3b.2 Correlation Matrix
4. [Hypothesis Testing](#hypothesis-testing)
- 4.1 Steps in Hypothesis Testing
- 4.2 Hypothesis 1: BMI and Genders
- 4.3 Hypothesis 2: ANOVA ~ Age vs Charges
- 4.4 Hypothesis 3: BMI Impact on Charges
- 4.5 Hypothesis 4: Medical Claims of Smokers vs Non-Smokers
- 4.6 Hypothesis 5: ANOVA Analysis ~ BMI and Number of Children
- 4.7 Chi-Square Test for Smoking Proportions Across Regions
- 4.8 Advanced Predictive Modeling
- 4.8.1 Elastic Net for Linear Regression with GridSearchCV
- 4.8.2 Random Forest Regressor with GridSearchCV
- 4.8.3 Gradient Boosting Regressor with GridSearchCV for Hyperparameter Tuning
5. [Conclusion](#conclusion)
6. [Deployment with Streamlit](#deployment-with-streamlit)
7. [Next Steps](#next-steps)

---

## **📌 Introduction**
This project provides Aviva with an in-depth analysis of the factors influencing insurance charges. It focuses on key demographic and lifestyle attributes such as **age, number of children, smoking status, and BMI**.

We employ **Exploratory Data Analysis (EDA)**, **Hypothesis Testing** to uncover statistical relationships, and **Predictive Modeling** to forecast insurance charges. The findings aim to **enhance risk assessment and optimize underwriting strategies**.

---

## **📌 Data Overview & Objective**
The dataset consists of **1,338 records**, capturing key attributes such as:

- **Age**: Customer's age in years
- **Gender**: Male or Female
- **BMI**: Body Mass Index, a health risk indicator based on weight and height
- **Smoker**: Whether the customer is a smoker or non-smoker
- **Region**: Geographic location (Northeast, Northwest, Southeast, Southwest)
- **Charges**: The annual insurance premium charged

### **🔹 Project Objectives**
✅ **Exploratory Data Analysis (EDA)** – Identify trends, distributions, and relationships within the dataset
✅ **Hypothesis Testing** – Evaluate how demographic factors impact insurance charges
✅ **Predictive Analysis** – Develop models to forecast charges, helping Aviva make data-driven underwriting decisions
✅ **Customer Insights** – Identify risk patterns and tailor insurance premiums accordingly

---

## **📌 Data Preparation & Descriptive Statistics**
### **🔹 Data Preparation**
- **Import Required Libraries** – Load necessary Python libraries for analysis
- **Data Loading and Overview** – Read the dataset and inspect missing values, data types, and general structure
- **Data Cleaning and Encoding** – Handle missing values, outliers, and categorical encoding

### **🔹 Descriptive Statistics**
- **General Summary Statistics** – Mean, median, standard deviation, and distribution of key variables
- **Correlation Matrix** – Understanding relationships between numeric features

---

## **📌 Hypothesis Testing**
### **🔹 Steps in Hypothesis Testing**
1. Define **Null (H₀)** and **Alternative (H₁)** hypotheses
2. Choose an appropriate **statistical test**
3. Set a **significance level (α = 0.05)**
4. Compute the **test statistic and p-value**
5. Interpret the results and **accept/reject H₀**

### **🔹 Key Hypothesis Tests**
📌 **Hypothesis 1:** Does BMI differ significantly between males and females?
📌 **Hypothesis 2:** Does **age** influence **insurance charges** (ANOVA Test)?
📌 **Hypothesis 3:** Is there a correlation between **BMI** and **charges**?
📌 **Hypothesis 4:** Do **smokers pay significantly higher premiums** than non-smokers?
📌 **Hypothesis 5:** Does **BMI vary based on the number of children** (ANOVA Test)?
📌 **Chi-Square Test:** Is the **proportion of smokers different across regions**?

---

## **📌 Predictive Modeling**
We build **three machine learning models** to predict insurance charges:

✅ **Elastic Net Regression** – A regularized linear model combining L1 (Lasso) & L2 (Ridge) penalties
✅ **Random Forest Regressor** – An ensemble learning model using multiple decision trees
✅ **Gradient Boosting Regressor** – A boosting technique to improve predictive performance

All models undergo **hyperparameter tuning** using **GridSearchCV**.

---

## **📌 Conclusion**
This study provides valuable insights into **factors affecting insurance charges**, statistical significance of relationships, and **predictive models** for premium estimation.

Key takeaways include:
✅ **Smoking has the highest impact on insurance costs**
✅ **BMI and age significantly influence charges**
✅ **Predictive models help forecast costs, improving risk assessment**

---

## **📌 Deployment with Streamlit**
The final analysis is deployed using **Streamlit**, allowing interactive exploration of the results.

💡 **To run the app locally:**
```bash
streamlit run app.py
```

---

## **📌 Next Steps**
🔹 Expand the dataset to include more policyholders for improved generalization
🔹 Incorporate additional features like **pre-existing medical conditions**
🔹 Fine-tune predictive models with **ensemble learning and deep learning approaches**

---

## **🚀 Get Started**
Clone this repository and explore the notebook:
```bash
git clone https://github.com/Tolumie/Statistics_Hypothesis_AB_Testing.git
cd Statistics_Hypothesis_AB_Testing
```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/tolumie/aviva-insurance-statistics-hypothesis-abtesting-modelling

Awesome Lists containing this project

README