{"id":20332430,"url":"https://github.com/zeroby0/diabetes-prediction","last_synced_at":"2025-07-26T09:37:33.899Z","repository":{"id":175198138,"uuid":"345744755","full_name":"zeroby0/diabetes-prediction","owner":"zeroby0","description":"Diabetes Prediction","archived":false,"fork":false,"pushed_at":"2021-03-08T19:34:51.000Z","size":20,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-07-15T06:06:42.926Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"https://www.kaggle.com/c/widsdatathon2021","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"unlicense","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/zeroby0.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-03-08T17:49:01.000Z","updated_at":"2021-03-08T19:36:47.000Z","dependencies_parsed_at":null,"dependency_job_id":"e53672ae-751a-4c2f-ab56-63b431e0eb46","html_url":"https://github.com/zeroby0/diabetes-prediction","commit_stats":null,"previous_names":["zeroby0/diabetes-prediction"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/zeroby0/diabetes-prediction","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zeroby0%2Fdiabetes-prediction","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zeroby0%2Fdiabetes-prediction/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zeroby0%2Fdiabetes-prediction/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zeroby0%2Fdiabetes-prediction/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/zeroby0","download_url":"https://codeload.github.com/zeroby0/diabetes-prediction/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zeroby0%2Fdiabetes-prediction/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":267145948,"owners_count":24042657,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-07-26T02:00:08.937Z","response_time":62,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-14T20:26:34.898Z","updated_at":"2025-07-26T09:37:33.878Z","avatar_url":"https://github.com/zeroby0.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Diabetes Prediction\n\n## Why?\n\nSomeone asked me for help with predicting Diabetes using a dataset, and I later found that it's the [Kaggle WiDS Datathon 2021](https://www.kaggle.com/c/widsdatathon2021).\n\nThis was my first major foray into Data Science and Machine Learning, and I had a lot of fun!\nI'm sharing this so that I can refer to it later, and so that it can help others getting started.\n\nI'm leaving a basic outline of my process in the Readme, but you should refer to the attached [iPython Notebook](WIDS2021.ipynb) if you have the time.\n\n* [Preprocessing](#pre-processing)\n  - Remove features with too much missing data.\n  - Encode category columns with LabelEncoder.\n  - Impute missing values with IterativeImputer.\n  - Remove outliers with IsolationForest.\n* [Feature Engineering](#feature-engineering)\n  - Generate new features\n  - Remove correlated features.\n* [Training](#training)\n  - Estimate number of iterations required for LGBM.\n  - Train LGBM.\n\n## Pre-processing\n\nYou can find details about the dataset and the metrics for evaluation at the Kaggle competetion linked above.\nThe gist is, there are about 180 feature columns, and a Target column with boolean values indicating if the patient has Diabetes Mellitus or not.There were 130157 samples of labelled data, and 10234 samples of unlabelled data.\n\nMost of the samples have missing values. Of the 130k samples in labelled data, only about 3000 had no missing values.\nI dropped feature columns with too many missing values (more than 30,000 in labelled), and columns which are unlikely to be\ngood indicators of Diabetes.\n\nThere are some categorical features like Ethnicity, which I encoded using `sklearn.preprocessing.LabelEncoder`.\nIn hindsight, I should have left the categorical features [to lgbm](https://lightgbm.readthedocs.io/en/latest/Advanced-Topics.html#categorical-feature-support).\n\nThen I imputed the missing values in the dataset using `sklearn.impute.IterativeImputer`.\n\nSince the dimensionality of the dataset is high, I cannot use my usual, naive approach to outlier detection: Distance from mean.\nAfter some Google-fu, I found out IsolationForests work well with high-dimensionality datasets.\n`sklearn.ensemble.solationForest` to the rescue!\n\n## Feature Engineering\n\nI generated new feature columns by binning bmi, height, weight, and age.\n\nThen I generated several new features like:\n- Difference between max and min columns of a lab measurement. Example: d1_glucose_max, and d1_glucose_min.\n- Is the Daily maximum and Hourly maximum of a lab measurement same.\n- Is the Daily minimum and Hourly minimum of a lab measurement same.\n- Difference of a feature's value from the mean of the feature.\n- Difference of a feature's value from the mean of the feature for other people in the same BMI/Height/Weight/Age bin.\n- Difference of a feature's value from the mean of the feature for other Diabetes Positive people in the same BMI/Height/Weight/Age bin.\n- Difference of a feature's value from the mean of the feature for other Diabetes Negative people in the same BMI/Height/Weight/Age bin.\n\nThen I removed highly correlated features.\nI used the SULA method from [auto-viml](https://github.com/AutoViML/Auto_ViML) to find correlated features.\nYou can find it in the [uncorr.py](uncorr.py) file.\n\n## Training\n\nI estimated Number of Iterations to be used for LGBM by cross validation as suggested in https://sites.google.com/view/lauraepp/parameters.\n\nThen I created an ensemble of 5 LGBM classifiers which predict the probability of Diabetes, and combined them with equal weight.\n\nHere are my parameters for LGBM:\n- Boosting: Goss\n- Metric: Area Under RoC curve.\n- Learning Rate: 0.01\n- Number of Iterations: about 7000.\n\n\n## Results\n\nMy AuC RoC was 0.86791. There were around 800 teams, and the winning team has an AuC of 0.87804.\n\nThe difference in AuC is about 0.01, but my rank was 170 (at the time of writing).\nHere is what I would do differently if I tried to close the gap:\n\n- Generate more features\n- Put back features the correlation finder dropped, but (I think) should stay.\n- Pass the indexes of categorical features to LGBM so that it can process them better.\n- Actually learn how to properly [tune hyperparameters for LGBM](https://lightgbm.readthedocs.io/en/latest/Parameters-Tuning.html).\n- Train more ensembles with other classification algorithms (catboost?).\n\nOne could probably push the 0.01 with these, but it would take a [lot more time and effort](https://www.google.com/search?q=pareto+principle) than I'm willing to spend.\nI had already spent the better part of a week on this, and the tiny improvement would not have an impact meaningful enough to warrant indulging in it more.\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fzeroby0%2Fdiabetes-prediction","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fzeroby0%2Fdiabetes-prediction","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fzeroby0%2Fdiabetes-prediction/lists"}