Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/pradipece/insurance_data_analysis_ml

This project approach defines the terms machine learning and linear regression ML algorithm in the context of real-time problem-solving.
https://github.com/pradipece/insurance_data_analysis_ml

data-science data-visualization database machine-learning matplotlib numpy pandas python python3

Last synced: 12 days ago
JSON representation

This project approach defines the terms machine learning and linear regression ML algorithm in the context of real-time problem-solving.

Awesome Lists containing this project

README

        

### Overview

This project consists of the following topics:

- Understanding the machine learning algorithms exploring the dataset
- Linear regression using Scikit-learn and adding some of the multiple features
- Using categorical features for machine learning and analysis of the data
- Regression coefficients and feature importance
- Other models and techniques for regression model using Scikit-learn
- Applying linear regression to other datasets

### Problem Statement

This approach defines the terms _machine learning_ and _linear regression_ in the context of a problem, and later generalizes their definitions:

> **QUESTION**: ACME Insurance Inc. offers affordable health insurance to thousands of customers in the United States. As the lead data scientist at ACME, **This tasked with creating an automated system to estimate the annual medical expenditure for new customers**, using information such as their age, sex, BMI, children, smoking habits, and region of residence.
>
> Estimates from the system regulatory requirements will determine the annual insurance premium (per month amount paid) offered to the customer.
>
> You're given a [CSV file](https://raw.githubusercontent.com/JovianML/opendatasets/master/data/medical-charges.csv) containing verified historical data, consisting of the aforementioned information and the actual medical charges incurred by over 1300 customers.
>
>
> Dataset source: https://github.com/stedy/Machine-Learning-with-R-datasets

### Machine Learning

Congratulations, you've just trained your first _machine learning model!_ Machine learning is simply the process of computing the best parameters to model the relationship between some feature and targets.

Machine learning problem has three components:

1. **Model**

2. **Cost Function**

3. **Optimizer**

We'll look at several examples of each of the above in future tutorials. Here's how the relationship between these three components can be visualized:

### Categorical Features ML

Using only numeric columns, and perform computations with numbers. The categorical columns such as "smoker", train a single model for the entire dataset.
For the conversion three common techniques are:

1. If a categorical column has just two categories (it's called a binary category), then we can replace their values with 0 and 1.
2. If a categorical column has more than 2 categories, we can perform one-hot encoding i.e. create a new column for each category with 1s and 0s.
3. If the categories have a natural order (e.g. cold, neutral, warm, hot), then they can be converted to numbers (e.g. 1, 2, 3, 4) preserving the order is called ordinals

### One-hot Encoding

The "region" column contains 4 values, so we'll need to use hot encoding and create a new column for each region.

![](https://i.imgur.com/n8GuiOO.png)

### How to Approach a Machine Learning Problem

Here's a strategy you can apply to approach any machine learning problem:

1. Explore the data and find correlations between inputs and targets
2. Pick the right model, loss functions and optimizer for the problem at hand
3. Scale numeric variables and one-hot encode categorical data
4. Set aside a test set (using a fraction of the training set)
5. Train the model
6. Make predictions on the test set and compute the loss

Finally, Apply this process to several problems for ML.

## Conclusions

Covered the following topics in this project

- A typical problem statement for machine learning
- Downloading and exploring a dataset for machine learning
- Linear regression with one variable using Scikit-learn
- Linear regression with multiple variables
- Using categorical features for machine learning
- Regression coefficients and feature importance
- Creating a training and test set for reporting results

### Reference

Apply the techniques of ML

- https://www.kaggle.com/vikrishnan/boston-house-prices
- https://www.kaggle.com/sohier/calcofi
- https://www.kaggle.com/budincsevity/szeged-weather

Check out the following links to learn more about linear regression:

- https://jovian.ai/aakashns/02-linear-regression
- https://www.kaggle.com/hely333/eda-regression