Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/timkong21/medical-appointment-no-show-prediction

A machine learning solution predicting patient no-shows in healthcare appointments. This project integrates EDA, data processing, feature engineering, and XGBoost modeling, with a workflow spanning from Snowflake data retrieval to AWS deployment (S3, SageMaker, Lambda, API Gateway), aiming to enhance appointment management in medical ERP systems.
https://github.com/timkong21/medical-appointment-no-show-prediction

api aws aws-lambda aws-s3 data-preprocessing data-science exploratory-data-analysis feature-engineering healthcare hyperopt hyperparameter-tuning hypothesis-testing machine-learning predictive-modeling python sagemaker snowflake sql statistical-analysis xgboost

Last synced: 10 days ago
JSON representation

Host: GitHub
URL: https://github.com/timkong21/medical-appointment-no-show-prediction
Owner: TimKong21
Created: 2024-01-30T02:59:38.000Z (11 months ago)
Default Branch: main
Last Pushed: 2024-02-20T05:07:29.000Z (10 months ago)
Last Synced: 2024-02-21T03:39:32.257Z (10 months ago)
Topics: api, aws, aws-lambda, aws-s3, data-preprocessing, data-science, exploratory-data-analysis, feature-engineering, healthcare, hyperopt, hyperparameter-tuning, hypothesis-testing, machine-learning, predictive-modeling, python, sagemaker, snowflake, sql, statistical-analysis, xgboost
Language: Jupyter Notebook
Homepage:
Size: 35.4 MB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Predicting Patient No-Shows in Healthcare Appointments

## Business Problem

A significant issue in medical setting is patients failing to attend scheduled doctor appointments despite receiving instructions (no-shows). Our client, a medical ERP solutions provider, seeks to tackle this by introducing a machine learning model into their software. This model aims to predict patient attendance, enabling medical providers to optimize appointment management.

Intro image

## Dataset Description

The dataset from [Kaggle](https://www.kaggle.com/datasets/joniarroba/noshowappointments) utilized in this project comprises appointment records from medical institutions, capturing various attributes related to patients and their appointments. Key features include:

- **Patient demographics**: Age and gender.

- **Health characteristics**: The presence of conditions such as diabetes or hypertension.

- **Appointment-specific details**: Scheduled and appointment dates, and whether the patient received a reminder SMS.

- **Target**: Binary indicator representing whether a patient was a no-show or attended their appointment.

| No | Column Name | Description |
| --- | --- | --- |
| 01 | PatientId | Identification of a patient |
| 02 | AppointmentID | Identification of each appointment |
| 03 | Gender | Male or Female. Female is the greater proportion, women take way more care of their health in comparison to men. |
| 04 | ScheduledDay | The day someone called or registered the appointment, this is before the appointment of course. |
| 05 | AppointmentDay | The day of the actual appointment, when they have to visit the doctor. |
| 06 | Age | How old is the patient. |
| 07 | Neighbourhood | Where the appointment takes place. |
| 08 | Scholarship | True or False. Indicates whether the patient is enrolled in Brasilian welfare program Bolsa Família. |
| 09 | Hipertension | True or False. Indicates if the patient has hypertension. |
| 10 | Diabetes | True or False. Indicates if the patient has diabetes. |
| 11 | Alcoholism | True or False. Indicates if the patient is an alcoholic. |
| 12 | Handcap | True or False. Indicates if the patient is handicapped. |
| 13 | SMS_received | True or False. Indicates if 1 or more messages sent to the patient. |
| 14 | No-show | True or False (Target variable). Indicates if the patient missed their appointment. |

## Technical Highlights

The approach to solving the challenge of predicting patient no-shows involved a comprehensive workflow, focusing on both the development of a predictive model and its practical application within an existing system. Here's an overview of the approach taken:

- **Data Storage and Initial Analysis**: Utilized Snowflake for secure data storage and conducted exploratory data analysis (EDA) to understand the dataset's characteristics and identify potential predictive features.

- **Data Loading and Preprocessing**: Initial steps involved loading the data from Snowflake, followed by preprocessing tasks such as handling missing values, encoding categorical variables, and normalizing features to prepare the dataset for modeling.

- **Feature Engineering and Selection**: Engineered meaningful features from the raw data, such as calculating the time interval between scheduling and appointment dates. The selection of features was based on their importance as determined through analysis using logistic regression and decision tree models, focusing on retaining features with at least 1% importance from either model.

- **Dataset and Model Selection**: Various machine learning algorithms, including Logistic Regression, Decision Tree, Random Forest, and XGBoost, were evaluated across different datasets (original, upsampled, downsampled, and SMOTE-enhanced) to identify the best-performing model. XGBoost emerged as the optimal choice, particularly when trained on the original dataset, after adjusting the `scale_pos_weight` parameter to address class imbalance effectively.

- **API Development for Model Deployment**: Created an API for the model, facilitating its integration into the client's ERP system. This step involved deploying the model to AWS SageMaker, setting up an AWS Lambda function for model invocation, and configuring an Amazon API Gateway to expose the model as a RESTful service.

- **Testing and Validation**: Conducted thorough testing of the deployed model using Postman, validating its functionality and ensuring its readiness for real-world application.

## Project Structure

The project is organized into several directories and files, each serving a specific purpose in the development, deployment, and documentation of the machine learning model.

Below is an overview of the project structure and the contents of each component:

```markdown
Medical-Appointment-No-Show
├── data/
│ ├── input/
│ ├── processed/
│ ├── output/
│ ├── features/
│ └── hyperparameters/
├── src/
│ ├── data_loader.py
│ ├── preprocessing.py
│ ├── feature_engineering.py
│ ├── modeling.py
│ ├── train.py
│ ├── predict.py
│ ├── requirements.txt
│ └── snowflake_creds.py
├── model/
├── deployment_assets/
├── Snowflake_assets/
├── Project Notebook.ipynb
├── Project Documentation.pdf
├── Model Deployment.ipynb
``` -Prediction # Raw data files. # Data files that have been cleaned and preprocessed. # Output data files, including model predictions. # Contains the important features used for filtering the data. # Contains the best hyperparameters obtained from Hyperopt tuning. # Script for loading and preprocessing data. # Script containing data preprocessing functions. # Script for feature engineering tasks. # Contains model training, evaluation, and prediction scripts. # Main script for training the model. # Script for making predictions using the trained model. # Lists the Python dependencies required for the project. # Contains credentials for Snowflake database access. # Trained model files and artifacts. # Files and scripts used for deploying the model. # Original data file for database creation and SQL queries for exploratory analysis. # Jupyter notebook detailing the model development process. # Comprehensive documentation of the project. # Jupyter notebook detailing the model deployment process.

## Usage

1. **Clone the Repository**

Clone the project repository to local machine.

```bash
git clone https://github.com/TimKong21/Medical-Appointment-No-Show-Prediction.git
cd Medical-Appointment-No-Show-Prediction
```

2. **Set Up a Virtual Environment**

Create and activate a virtual environment to manage the project's dependencies.

```bash
# Create a virtual environment
python -m venv env
```
Activate the virtual environment.
```bash
# On Windows
env\Scripts\activate

# On MacOS/Linux
source env/bin/activate
```
3. **Install Dependencies**

Install the required Python dependencies.

```bash
pip install -r src/requirements.txt
```

4. **Model Training**

Train the model and make predictions.

```bash
cd src
python train.py
python predict.py
```

5. **Model Deployment**

For deploying the model to AWS SageMaker and setting up the necessary AWS services for model invocation and API exposure, follow step 1 to step 6 on `Model Deployment.ipynb`. This notebook provides detailed steps for deploying the model to AWS SageMaker, creating an AWS Lambda function, and configuring an Amazon API Gateway to expose the model as a RESTful service.

6. **Testing and Validating with POSTMAN**

After deployment, follow step 7 on `Model Deployment.ipynb` to test and validate the model's functionality using POSTMAN. This involves sending requests to the deployed model's API endpoint and verifying the responses to ensure the model operates as expected.

For a comprehensive understanding of the project, refer to:

- `Project Notebook.ipynb` for detail model development process.

- `Model Deployment.ipynb` for detail model deployment process.

- `Project Documentation.pdf` for comprehensive project documentation.