https://github.com/drkbluescience/wids2024_challenge2_metastaticdiagnosisregression

This notebook presents an exploratory data analysis (EDA) and regression modeling approach for the WiDS Datathon 2024 Challenge #2.
https://github.com/drkbluescience/wids2024_challenge2_metastaticdiagnosisregression

catboost data-visualization ensemble-learning exploratory-data-analysis imputation-methods kfold-cross-validation machine-learning metastatic-breast-cancer regression-models scikit-learn tabular-data women-in-data-science

Last synced: 28 days ago
JSON representation

This notebook presents an exploratory data analysis (EDA) and regression modeling approach for the WiDS Datathon 2024 Challenge #2.

Host: GitHub
URL: https://github.com/drkbluescience/wids2024_challenge2_metastaticdiagnosisregression
Owner: drkbluescience
Created: 2024-10-26T13:32:39.000Z (6 months ago)
Default Branch: main
Last Pushed: 2024-11-07T16:13:06.000Z (6 months ago)
Last Synced: 2025-02-11T17:57:19.669Z (3 months ago)
Topics: catboost, data-visualization, ensemble-learning, exploratory-data-analysis, imputation-methods, kfold-cross-validation, machine-learning, metastatic-breast-cancer, regression-models, scikit-learn, tabular-data, women-in-data-science
Language: Jupyter Notebook
Homepage:
Size: 20.4 MB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# WiDS Datathon 2024 Challenge #2 - Regression of Metastatic Diagnosis Period
## Introduction
This project presents an exploratory data analysis (EDA) and regression modelling approach for the WiDS Datathon 2024 Challenge #2. The objective was to predict the metastatic diagnosis period based on various patient demographic and socio-economic characteristics. The analysis highlights the steps taken to clean, preprocess, and model the data while optimizing feature selection and modelling parameters to improve predictive performance.

## Data Exploration
EDA was conducted to inspect initial data distributions, identify key variables, and examine relationships between features. This includes:

- Initial data inspection and analysis of categorical/numerical variables
- Mutual information and pairwise correlation to assess feature relevance
- Investigation of inconsistencies and missing values

## Data Cleaning and Imputation
Several imputation strategies were applied to handle missing values, including standard and group-based methods. Grouped imputations, such as mean and mode imputation by patient demographics, helped retain feature relationships while filling gaps in the dataset. Various imputation techniques were tested to identify the most effective approach based on model performance.

## Feature Engineering
Additional features were created based on domain knowledge to enhance model prediction. New features were generated by grouping variables such as bmi and density into clusters, resulting in improved feature representation. Before imputation, redundant features were removed, facilitating the identification of the most effective imputation methods.

## Modelling Approach
Several regression models were evaluated, with key models optimized using cross-validation and hyperparameter tuning. Steps included:

- Establishing baseline scores and SHAP values for feature importance
- Conducting backward feature selection for optimal feature sets
- Hyperparameter tuning
- Implementing a stacking meta-model approach with ensemble techniques

## Results

The modelling results highlighted the effectiveness of using CatBoost with tailored imputation techniques and StratifiedKFold validation.

Group-based imputations combined with specific models yielded the best performance. CatBoost, with Constant Categorical imputation for categorical features and No Numerical Imputation, achieved the lowest RMSE score of **80.225** using a 9-fold StratifiedKFold.

Additionally, the average RMSE for the 5th and 7th folds was **80.154**, marking the best scores across the private validation.

The second-best score of **80.182** was obtained by using Constant Categorical imputation for categorical features and KNN for numerical features.

Using 9-fold StratifiedKFold, grouped by breast_cancer_diagnosis_desc, further enhanced the model’s ability to capture categorical group structures, resulting in these optimal RMSE scores.

Feature selection based on SHAP values improved modelling performance by helping CatBoost identify the most predictive variables.

The notebook table presents detailed RMSE scores for each model and imputation combination, with GradientBoosting also showing promising results when using group-based imputation strategies.

## Conclusion
This study underscores the significance of selecting appropriate imputation techniques and modelling approaches for predicting metastatic diagnosis periods. Combining standard and group-based imputations was particularly effective for handling datasets with diverse missing value patterns.

CatBoost emerged as the top-performing model, particularly due to its compatibility with categorical data and its ability to work well with features selected through SHAP values.

The findings demonstrate that structured feature selection and stratified grouping improve predictive accuracy in healthcare-related regression tasks by capturing meaningful relationships within the data, especially for tree-based models.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/drkbluescience/wids2024_challenge2_metastaticdiagnosisregression

Awesome Lists containing this project

README