Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/thecoderpinar/diabetes_health_prediction_and_analysis

A comprehensive project to predict and analyze diabetes health data using advanced machine learning models, including Logistic Regression, Random Forest, and XGBoost. πŸ“ŠπŸ”
https://github.com/thecoderpinar/diabetes_health_prediction_and_analysis

analytics artificial-intelligence classification data-science data-visualization deep-learning diabetes-prediction health healthcare logistic-regression machine-learning medical-analysis mlops prediction python random-forest xgboost

Last synced: about 1 month ago
JSON representation

A comprehensive project to predict and analyze diabetes health data using advanced machine learning models, including Logistic Regression, Random Forest, and XGBoost. πŸ“ŠπŸ”

Awesome Lists containing this project

README

        

# Diabetes Health Prediction and Analysis πŸŽ‰

![Diabetes Health Prediction](https://miro.medium.com/v2/resize:fit:828/format:webp/1*KkQbSEI9sT44_yxR9vscJA.gif)

---

Welcome to the **Diabetes Health Prediction and Analysis** project! This repository contains a comprehensive pipeline for predicting diabetes diagnosis using various machine learning and deep learning models, along with an in-depth exploratory data analysis and feature engineering steps.

## πŸš€ Project Overview

This project aims to provide a thorough analysis of diabetes-related health data, develop predictive models, and evaluate their performance. The key components of the project include:

- πŸ“Š Data Preprocessing
- πŸ” Exploratory Data Analysis (EDA)
- πŸ› οΈ Feature Engineering
- 🧠 Model Training
- πŸ“ˆ Model Evaluation
- πŸ“‘ Comprehensive Reports

## πŸ“‚ Project Structure

Here's an overview of the project directory structure:

```plaintext
Diabetes_Health_Prediction_and_Analysis/
β”œβ”€β”€ data/
β”‚ β”œβ”€β”€ raw/
β”‚ β”‚ └── diabetes_data.csv
β”‚ β”œβ”€β”€ processed/
β”‚ β”‚ β”œβ”€β”€ X_train.csv
β”‚ β”‚ β”œβ”€β”€ X_train_engineered.csv
β”‚ β”‚ β”œβ”€β”€ X_test.csv
β”‚ β”‚ β”œβ”€β”€ X_test_engineered.csv
β”‚ β”‚ β”œβ”€β”€ y_train.csv
β”‚ β”‚ └── y_test.csv
β”œβ”€β”€ app/
β”‚ β”œβ”€β”€ app.py
β”‚ β”œβ”€β”€ templates/
β”‚ β”‚ └── index.html
β”‚ └── static/
β”‚ └── styles.css
β”œβ”€β”€ models/
β”‚ β”œβ”€β”€ logistic_regression.pkl
β”‚ β”œβ”€β”€ random_forest.pkl
β”‚ └── xgboost.pkl
β”œβ”€β”€ notebooks/
β”‚ └── exploratory_data_analysis.ipynb
β”œβ”€β”€ scripts/
β”‚ β”œβ”€β”€ plots/
β”‚ β”œβ”€β”€ reports/
β”‚ β”œβ”€β”€ data_preprocessing.py
β”‚ β”œβ”€β”€ feature_engineering.py
β”‚ β”œβ”€β”€ model_training.py
β”‚ β”œβ”€β”€ model_evaluation.py
β”‚ └── model_performance_report.py
β”œβ”€β”€ tests/
β”‚ β”œβ”€β”€ models/
β”‚ β”œβ”€β”€ test_data_preprocessing.py
β”‚ β”œβ”€β”€ test_feature_engineering.py
β”‚ β”œβ”€β”€ test_model_training.py
β”œβ”€β”€ requirements.txt
└── README.md
```

## πŸ”§ Setup and Installation

To get started with this project, follow the steps below:

1. **Clone the repository:**

```sh
git clone https://github.com/ThecoderPinar/Diabetes_Health_Prediction_and_Analysis.git
cd Diabetes_Health_Prediction_and_Analysis
```

2. **Create and activate a virtual environment:**

```sh
python -m venv venv
source venv/bin/activate # On Windows use `venv\Scripts\activate`
```

3. **Install the required packages:**

```sh
pip install -r requirements.txt
```

4. **Run the data preprocessing script:**

```sh
python scripts/data_preprocessing.py
```

5. **Run the feature engineering script:**

```sh
python scripts/feature_engineering.py
```

6. **Train the models:**

```sh
python scripts/model_training.py
```

7. **Evaluate the models:**

```sh
python scripts/model_evaluation.py
```

8. **Generate comprehensive model performance reports:**

```sh
python script/comprehensive_model_report.py
```

## πŸš€ Usage

- **Exploratory Data Analysis**: Check the `notebooks/exploratory_data_analysis.ipynb` notebook for detailed data analysis and visualizations.
- **Scripts**: All scripts for data preprocessing, feature engineering, model training, and evaluation are located in the `scripts/` directory.
- **Tests**: To ensure code quality and correctness, tests are included in the `tests/` directory. Run them with `pytest`.

## πŸ“Š Models

The following models are trained and evaluated in this project:

---

### Logistic Regression

#### ROC Curve:
![Logistic Regression ROC Curve](/scripts/plots/Logistic%20Regression_roc_curve.png)

*The ROC curve illustrates the true positive rate (sensitivity) versus the false positive rate (1-specificity) for different threshold settings. A higher area under the curve (AUC) indicates better model performance.*

#### Confusion Matrix:
![Logistic Regression Confusion Matrix](/scripts/plots/Logistic%20Regression_confusion_matrix.png)

*The confusion matrix provides a summary of the prediction results on the classification problem. It shows the number of true positive (TP), true negative (TN), false positive (FP), and false negative (FN) predictions.*

---

### Random Forest

#### ROC Curve:
![Random Forest ROC Curve](/scripts/plots/Random%20Forest_roc_curve.png)

*The ROC curve illustrates the true positive rate (sensitivity) versus the false positive rate (1-specificity) for different threshold settings. A higher area under the curve (AUC) indicates better model performance.*

#### Confusion Matrix:
![Random Forest Confusion Matrix](/scripts/plots/Random%20Forest_confusion_matrix.png)

*The confusion matrix provides a summary of the prediction results on

## 🎯 Performance Metrics

The performance of the models is evaluated using the following metrics:

- **Accuracy**
- **Precision**
- **Recall**
- **F1 Score**
- **ROC AUC Score**
- **Confusion Matrix**

### Logistic Regression

- **Accuracy (Doğruluk):** %78.99
- **Precision (Kesinlik):** %73.19
- **Recall (DuyarlΔ±lΔ±k):** %70.63
- **F1 Score:** %71.89
- **ROC AUC:** %83.86

**Confusion Matrix:**
```plaintext
[[196 37]
[ 42 101]]
```
Model dosyasΔ±:
```sh
models/logistic_regression.pkl
```

### Random Forest

- **Accuracy (Doğruluk):** %91.22
- **Precision (Kesinlik):** %94.35
- **Recall (DuyarlΔ±lΔ±k):** %81.82
- **F1 Score:** %87.64
- **ROC AUC:** %97.69

**Confusion Matrix:**
```plaintext
[[226 7]
[ 26 117]]
```
Model dosyasΔ±:
```sh
models/random_forest.pkl
```
##### Explanations:

1. [x] **_Accuracy:_** The ratio of correctly predicted instances to the total instances.
2. [x] **_Precision:**_ The ratio of true positive predictions to the total predicted positives. It measures the accuracy of positive predictions.
3. [x] **_Recall:_** The ratio of true positive predictions to the actual positives. It measures the model's ability to identify positive instances.
4. [x] **_F1 Score:_** The harmonic mean of precision and recall. It provides a balance between precision and recall.
5. [x] **_ROC AUC:_** The area under the ROC curve. It summarizes the model's ability to distinguish between classes.

**Confusion Matrix:**

* True Positive (TP): 117 - The number of actual positive cases correctly identified by the model.
* True Negative (TN): 226 - The number of actual negative cases correctly identified by the model.
* False Positive (FP): 7 - The number of actual negative cases incorrectly identified as positive by the model.
* False Negative (FN): 26 - The number of actual positive cases incorrectly identified as negative by the model.

##### Explanations:
1. [x] **_Accuracy:_** The ratio of correctly predicted instances to the total instances.
2. [x] **_Precision:_** The ratio of true positive predictions to the total predicted positives. It measures the accuracy of positive predictions.
3. [x] **_Recall:_** The ratio of true positive predictions to the actual positives. It measures the model's ability to identify positive instances.
4. [x] **_F1 Score:_** The harmonic mean of precision and recall. It provides a balance between precision and recall.
5. [x] **_ROC AUC:_** The area under the ROC curve. It summarizes the model's ability to distinguish between classes.

**Confusion Matrix:**

* True Positive (TP): 117 - The number of actual positive cases correctly identified by the model.
* True Negative (TN): 226 - The number of actual negative cases correctly identified by the model.
* False Positive (FP): 7 - The number of actual negative cases incorrectly identified as positive by the model.
* False Negative (FN): 26 - The number of actual positive cases incorrectly identified as negative by the model.

### XGBoost

- **Accuracy (Doğruluk):** %91.76
- **Precision (Kesinlik):** %93.08
- **Recall (DuyarlΔ±lΔ±k):** %84.62
- **F1 Score:** %88.64
- **ROC AUC:** %98.41

**Confusion Matrix:**
```plaintext
[[224 9]
[ 22 121]]
```
Model dosyasΔ±:
```sh
models/xgboost.pkl
```
##### Explanations:

1. [x] **_Accuracy_:** The ratio of correctly predicted instances to the total instances.
2. [x] **_Precision:_** The ratio of true positive predictions to the total predicted positives. It measures the accuracy of positive predictions.
3. [x] **_Recall:_** The ratio of true positive predictions to the actual positives. It measures the model's ability to identify positive instances.
4. [x] _**F1 Score:**_ The harmonic mean of precision and recall. It provides a balance between precision and recall.
5. [x] **_ROC AUC:_** The area under the ROC curve. It summarizes the model's ability to distinguish between classes.

**Confusion Matrix:**

* True Positive (TP): 121 - The number of actual positive cases correctly identified by the model.
* True Negative (TN): 224 - The number of actual negative cases correctly identified by the model.
* False Positive (FP): 9 - The number of actual negative cases incorrectly identified as positive by the model.
* False Negative (FN): 22 - The number of actual positive cases incorrectly identified as negative by the model.

## πŸ“ˆ Results

Model performance reports and evaluation metrics are saved and displayed in the `comprehensive_model_report.py` script output.

## πŸ’‘ Future Work

- Implement more advanced deep learning models (e.g., Neural Networks, LSTM).
- Perform hyperparameter tuning to optimize model performance.
- Explore feature selection techniques to improve model accuracy.
- Integrate additional health datasets for broader analysis.

## 🀝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Whether it's improving the documentation, adding new features, or fixing bugs, your contributions are highly appreciated. Let's make this project better together! πŸš€

### How to Contribute:

1. **Fork the Repository**: Click on the 'Fork' button at the top right corner of this page to create a copy of this repository in your GitHub account.

2. **Clone the Forked Repository**:
```bash
git clone https://github.com/your-username/Diabetes_Health_Prediction_and_Analysis.git
```

3. **Create a New Branch**:
```bash
git checkout -b feature/your-feature-name
```

4. **Make Your Changes**: Implement your feature, bug fix, or improvement.

5. **Commit Your Changes**:
```bash
git commit -m "Add your commit message here"
```

6. **Push to Your Forked Repository**:
```bash
git push origin feature/your-feature-name
```

7. **Open a Pull Request**: Go to the original repository on GitHub and click on the 'New Pull Request' button. Compare changes from your forked repository and submit the pull request.

---

Thank you for your contributions! Together, we can build a more robust and efficient Diabetes Health Prediction and Analysis tool. 🌟

## πŸ“„ License

This project is licensed under the MIT License.

## πŸ“¬ Contact

If you have any questions or suggestions, feel free to open an issue or contact me directly. I am always open to feedback and would love to hear from you!

---

### How to Reach Me:

- **Email:** [[email protected]](mailto:[email protected])
- **GitHub Issues:** [Open an Issue](https://github.com/ThecoderPinar/Diabetes_Health_Prediction_and_Analysis/issues)
- **LinkedIn:** [Your LinkedIn Profile](https://www.linkedin.com/in/piinartp/)

---

Thank you for your interest in the Diabetes Health Prediction and Analysis project! Your feedback and suggestions are invaluable in making this project better and more useful for everyone. 🌟

![Contact Us](https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhgcxnmPWgrukdZFkZONlQ4vUIKWJakRLZqvQUfzkDUbS2nAbQyIxR23-OwOis99pE6UQSxXmxwwuugHQWmwRFfZdw4QKGnk9S_n4yFrfPFTSbKIL6sKUKTwFUyG-8no5Y_9dCLI0LUJIo/s1600/welovehearingfromu.png!)

---

---

⭐️ Don't forget to give this project a star if you found it useful! ⭐️