https://github.com/itancio/diabetesclassification

Classify a diabetes diagnosis based on patients' data.
https://github.com/itancio/diabetesclassification

binary-classification machine-learning

Last synced: about 1 month ago
JSON representation

Classify a diabetes diagnosis based on patients' data.

Host: GitHub
URL: https://github.com/itancio/diabetesclassification
Owner: itancio
Created: 2024-06-13T11:14:29.000Z (about 2 years ago)
Default Branch: main
Last Pushed: 2024-06-18T07:40:40.000Z (almost 2 years ago)
Last Synced: 2025-02-21T22:41:24.405Z (over 1 year ago)
Topics: binary-classification, machine-learning
Language: Jupyter Notebook
Homepage:
Size: 29.8 MB
Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Diabetes Binary Classification

## Objective:
The goal is to classify diabetes diagnosis based on various patient data. The dataset includes demographics, lifestyle, medical history, clinical measurements, medications, symptoms, and more. The aim is to identify factors strongly correlated with diabetes and use these features to train a classification model.

## Process Overview:

### Import Libraries and Datasets:

Libraries such as Pandas, NumPy, TensorFlow, Scikit-learn, and Seaborn were used for data handling, visualization, and model building.
The dataset was read from a CSV file containing data for 1,879 patients.

This dataset contains comprehensive health data for 1,879 patients, uniquely identified with IDs ranging from 6000 to 7878. The data includes demographic details, lifestyle factors, medical history, clinical measurements, medication usage, symptoms, quality of life scores, environmental exposures, and health behaviors. Each patient is associated with a confidential doctor in charge, ensuring privacy and confidentiality.

[Data source: Diabetes Health Dataset Analysis](https://www.kaggle.com/datasets/rabieelkharoua/diabetes-health-dataset-analysis/data)

### Data Preprocessing:

* Inspected and cleaned the dataset, dropping unnecessary columns like PatientID and DoctorInCharge.
* Visualized the data to understand distributions and correlations.

* Using pairwise plotting, we are able to visualize the a pairwise correlations of each feature from our dataset.

* A different way of seeing this relationship is this 3d plot that shows the relationships of three features. The most notable features include _Fasting Blood Sugar_, _Hb1Ac_ and their diagnoses.

* Selected relevant features such as Age, BMI, SystolicBP, DiastolicBP, FastingBloodSugar, HbA1c, and SerumCreatinine.
* Scaled the data using MinMaxScaler to normalize feature values.

Before: The data show the info from the original dataset.

After: Here shows the scaled versions of each features.

### Model Training:
* Split the data into training (60%), cross-validation (20%), and test sets (20%).
* Trained multiple models, including Logistic Regression, Decision Trees, Random Forest, Gradient Boosting, AdaBoost, Extra Trees, XGBoost, and LightGBM.
* Evaluated models using accuracy metrics on the cross-validation set.
* Used polynomial features to improve model performance but found the optimal degree for minimal cross-validation mean square error.

### Model Selection and Evaluation:
The RandomForestClassifier was chosen as the optimal model based on cross-validation accuracy.
Adjusted the decision boundary to optimize performance, achieving an accuracy of 87.77%.

## Choices Made and Their Justification:

### Feature Selection:
Focused on clinically relevant features with strong correlation to diabetes, ensuring a meaningful input for models.

Here, we are showing the performance of different degrees of polynomials added in our linear model. (Blue line) We can see the the cross-validation 2nd degree polynomial can help data fit our model better.

### Data Scaling:
Used MinMaxScaler to ensure all features were on a similar scale, which is crucial for the performance of many machine learning models.

### Model Selection:
Evaluated multiple models to find the best performer, ensuring robustness in the classification task. RandomForest was chosen for its high accuracy and general performance in classification problems.

### Cross-Validation:
Used to tune hyperparameters and select the best model, providing a reliable estimate of model performance.

## Improvements and Next Steps:

* **Feature Engineering**: Explore more sophisticated feature engineering techniques or interactions between features.
Consider domain-specific knowledge to create new features.
* Hyperparameter Tuning**: Perform a more extensive hyperparameter tuning for the selected models using grid search or Bayesian optimization.
* **Model Ensemble**: Combine multiple models to create an ensemble that could potentially improve performance over individual models.
* **Regularization**: Implement regularization techniques to prevent overfitting, especially for complex models like neural networks.
* **Further Validation**: Use additional validation techniques such as bootstrapping or k-fold cross-validation to ensure the model's robustness.

By following these steps, the model can be further refined and its performance potentially improved, ensuring better accuracy and reliability in diagnosing diabetes based on patient data.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/itancio/diabetesclassification

Awesome Lists containing this project

README