{"id":22561984,"url":"https://github.com/pradipece/insurance_data_analysis_ml","last_synced_at":"2026-04-11T13:04:00.033Z","repository":{"id":264827338,"uuid":"894392712","full_name":"pradipece/Insurance_Data_Analysis_ML","owner":"pradipece","description":"This project approach defines the terms machine learning and linear regression ML algorithm in the context of real-time problem-solving.","archived":false,"fork":false,"pushed_at":"2024-11-30T15:45:14.000Z","size":1292,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-02-02T13:15:42.693Z","etag":null,"topics":["data-science","data-visualization","database","machine-learning","matplotlib","numpy","pandas","python","python3"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/pradipece.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-11-26T09:25:59.000Z","updated_at":"2024-11-30T16:08:26.000Z","dependencies_parsed_at":"2024-11-26T18:06:37.202Z","dependency_job_id":null,"html_url":"https://github.com/pradipece/Insurance_Data_Analysis_ML","commit_stats":null,"previous_names":["pradipece/insurance_data_analysis_ml"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pradipece%2FInsurance_Data_Analysis_ML","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pradipece%2FInsurance_Data_Analysis_ML/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pradipece%2FInsurance_Data_Analysis_ML/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pradipece%2FInsurance_Data_Analysis_ML/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/pradipece","download_url":"https://codeload.github.com/pradipece/Insurance_Data_Analysis_ML/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246034232,"owners_count":20712851,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-science","data-visualization","database","machine-learning","matplotlib","numpy","pandas","python","python3"],"created_at":"2024-12-07T22:11:08.054Z","updated_at":"2025-12-30T23:20:33.453Z","avatar_url":"https://github.com/pradipece.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"### Overview\n\nThis project consists of the following topics:\n\n- Understanding the machine learning algorithms exploring the dataset\n- Linear regression using Scikit-learn and adding some of the multiple features\n- Using categorical features for machine learning and analysis of the data\n- Regression coefficients and feature importance\n- Other models and techniques for regression model using Scikit-learn\n- Applying linear regression to other datasets\n\n### Problem Statement\n\nThis approach defines the terms _machine learning_ and _linear regression_ in the context of a problem, and later generalizes their definitions:\n\n\u003e **QUESTION**: ACME Insurance Inc. offers affordable health insurance to thousands of customers in the United States. As the lead data scientist at ACME, **This tasked with creating an automated system to estimate the annual medical expenditure for new customers**, using information such as their age, sex, BMI, children, smoking habits, and region of residence.\n\u003e\n\u003e Estimates from the system regulatory requirements will determine the annual insurance premium (per month amount paid) offered to the customer.\n\u003e\n\u003e You're given a [CSV file](https://raw.githubusercontent.com/JovianML/opendatasets/master/data/medical-charges.csv) containing verified historical data, consisting of the aforementioned information and the actual medical charges incurred by over 1300 customers.\n\u003e \u003cimg src=\"https://i.imgur.com/87Uw0aG.png\" width=\"480\"\u003e\n\u003e\n\u003e Dataset source: https://github.com/stedy/Machine-Learning-with-R-datasets\n\n### Machine Learning\n\nCongratulations, you've just trained your first _machine learning model!_ Machine learning is simply the process of computing the best parameters to model the relationship between some feature and targets.\n\nMachine learning problem has three components:\n\n1. **Model**\n\n2. **Cost Function**\n\n3. **Optimizer**\n\nWe'll look at several examples of each of the above in future tutorials. Here's how the relationship between these three components can be visualized:\n\n\u003cimg src=\"https://www.deepnetts.com/blog/wp-content/uploads/2019/02/SupervisedLearning.png\" width=\"480\"\u003e\n\n### Categorical Features ML\n\nUsing only numeric columns, and perform computations with numbers. The categorical columns such as \"smoker\", train a single model for the entire dataset.\nFor the conversion three common techniques are:\n\n1. If a categorical column has just two categories (it's called a binary category), then we can replace their values with 0 and 1.\n2. If a categorical column has more than 2 categories, we can perform one-hot encoding i.e. create a new column for each category with 1s and 0s.\n3. If the categories have a natural order (e.g. cold, neutral, warm, hot), then they can be converted to numbers (e.g. 1, 2, 3, 4) preserving the order is called ordinals\n\n### One-hot Encoding\n\nThe \"region\" column contains 4 values, so we'll need to use hot encoding and create a new column for each region.\n\n![](https://i.imgur.com/n8GuiOO.png)\n\n### How to Approach a Machine Learning Problem\n\nHere's a strategy you can apply to approach any machine learning problem:\n\n1. Explore the data and find correlations between inputs and targets\n2. Pick the right model, loss functions and optimizer for the problem at hand\n3. Scale numeric variables and one-hot encode categorical data\n4. Set aside a test set (using a fraction of the training set)\n5. Train the model\n6. Make predictions on the test set and compute the loss\n\nFinally, Apply this process to several problems for ML.\n\n## Conclusions\n\nCovered the following topics in this project\n\n- A typical problem statement for machine learning\n- Downloading and exploring a dataset for machine learning\n- Linear regression with one variable using Scikit-learn\n- Linear regression with multiple variables\n- Using categorical features for machine learning\n- Regression coefficients and feature importance\n- Creating a training and test set for reporting results\n\n### Reference\n\nApply the techniques of ML\n\n- https://www.kaggle.com/vikrishnan/boston-house-prices\n- https://www.kaggle.com/sohier/calcofi\n- https://www.kaggle.com/budincsevity/szeged-weather\n\nCheck out the following links to learn more about linear regression:\n\n- https://jovian.ai/aakashns/02-linear-regression\n- https://www.kaggle.com/hely333/eda-regression\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpradipece%2Finsurance_data_analysis_ml","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpradipece%2Finsurance_data_analysis_ml","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpradipece%2Finsurance_data_analysis_ml/lists"}