https://github.com/ridwansharkar/bone-marrow-transplant-study
⚕Rutgers University - NB Fall 2024 | ST486 Applied Statistics
https://github.com/ridwansharkar/bone-marrow-transplant-study
bone-marrow machine-learning stem-cell-transplant uci-dataset
Last synced: 3 months ago
JSON representation
⚕Rutgers University - NB Fall 2024 | ST486 Applied Statistics
- Host: GitHub
- URL: https://github.com/ridwansharkar/bone-marrow-transplant-study
- Owner: RidwanSharkar
- Created: 2024-12-19T04:57:24.000Z (5 months ago)
- Default Branch: main
- Last Pushed: 2024-12-29T05:31:55.000Z (5 months ago)
- Last Synced: 2025-02-18T12:54:50.023Z (3 months ago)
- Topics: bone-marrow, machine-learning, stem-cell-transplant, uci-dataset
- Language: R
- Homepage:
- Size: 52.7 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# UC-Irvine Bone Marrow Transplant Study
“This dataset describes pediatric patients with several hematolic disease, who were subject to the unmanipulated allogeneic unrelated donor hematopoietic stem cell.” (UC Irvine).## 1. Overview
CD34+ cells, also known as hematopoietic stem cells (HSCs), primarily serve the purpose of self-renewal and producing mature blood cells, including erythrocytes, leukocytes, platelets, and lymphocytes. As the source of all blood lineages, CD34+ T cells are critical in hematopoietic stem cell transplantation (HSCT) as they play a central role in governing the immune environment post-transplantation. In pediatric HSCT studies, CD34+ T cell dynamics help evaluate immune recovery and treatment efficacy; this study aims to highlight the synergy between immune function and CD34+ stem cell transplantation outcomes.**Dataset Source**: [UCI Bone Marrow Transplant Children](https://archive.ics.uci.edu/dataset/565/bone+marrow+transplant+children) (187 observations x 36 features)
UCI Bone Marrow dataset was analyzed to predict two key outcomes:
- **Survival Status** (categorical)
- **Survival Time** (continuous)Our main objective was to determine which variables best predict these outcomes and to compare different supervised learning models in terms of their predictive performance.
---
## 2. Methods
### 2.1 Exploratory Analysis
- Inspected missing data using visualizations (`vis_miss`, `gg_miss_var`).
- Examined correlations (`corrplot`) and outliers using the IQR rule.
- Explored distributions via histograms, density plots, and scatterplot matrices:**Correlation Matrix:**
**Scatterplot Matrix:**
### 2.2 Modeling Approaches
1. **Survival Status (Classification)**
- Logistic Regression
- Random Forest2. **Survival Time (Regression)**
- Linear Regression
- Lasso (L1 Regularization)
- Random Forest### 2.3 Variable Selection
Three main strategies were used to identify important features:
1. **Stepwise Selection** (using AIC-based forward/backward selection)
2. **Lasso Regularization** (to shrink less important coefficients to zero)
3. **Random Forest Feature Importance** (ranking variables by mean decrease in node purity)---
## 3. Results
### 3.1 Survival Status
- **Most Important Predictors** (overlap of stepwise, Lasso, and Random Forest):
1. *Relapse*
2. *extcGvHD*
3. *Survival Time*
4. *Txpostrelapse*- **Model Comparison**
- **Logistic Regression**: ~94.44% accuracy
- **Random Forest**: ~94.44% accuracy (rounded before and after tuning)
- **Logistic Regression** remained the best choice in our comparison, even after Random Forest tuning, due to consistent predictive performance and model interpretability.### 3.2 Survival Time
- **Features Identified by Each Method**:
1. **Stepwise Selection**
*Stemcellsource, RecipientABO, Disease, Txpostrelapse, extcGvHD, Recipientage, Rbodymass, survival_status, DosageGroup*2. **Lasso**
*Donorage, CD34kgx10d6, CD3dCD34, CD3dkgx10d8, Rbodymass, ANCrecovery, PLTrecovery, time_to_aGvHD_III_IV, survival_status*3. **Random Forest**
*survival_status, extcGvHD, CD3dCD34, PLTrecovery, CD3dkgx10d8, Donorage, CD34kgx10d6, CMVstatus, Rbodymass, HLAgrI*- **Model Comparison**
- **Stepwise Linear Model**
- R-squared: 0.654
- RMSE: 494.12
- AIC: 2814.30- **Lasso Model**
- R-squared: 0.612
- RMSE: 523.05
- AIC: 2817.01- **Random Forest Model**
- R-squared: 0.656
- RMSE: 492.40
- AIC: 2815.03- **Best Model**
- **Random Forest** outperformed other models with the highest R-squared and lowest RMSE, indicating that Random Forest is the most robust regressor for predicting survival time.---
## 4. Analysis
1. **Survival Status** depends primarily on:
- *Relapse*, *extcGvHD*, *Survival Time*, and *Txpostrelapse*
- *CD34+ dosage* did not appear as a crucial determinant for survival status in the final models.
- **Logistic Regression** proved the most reliable for classification.2. **Survival Time** is strongly influenced by:
- *Survival Status*, *extcGvHD*, *CD3dCD34*, *PLTrecovery*, *CD3dkgx10d8*, *Donorage*, *CD34kgx10d6*, *CMVstatus*, *Rbodymass*, and *HLAgrI*
- **CD34+ dosage** surfaced as a significant predictor of survival time but does not alone guarantee survival.3. **Interaction Between Outcomes**
- *Survival Status* and *Survival Time* are interdependent.
- Only *extcGvHD* was shared as a top predictor across both final models.---
## 5. Conclusion
- While higher **CD34+ dosage** may prolong survival time, it does not unequivocally ensure survival status.
- For categorical survival status predictions, Logistic Regression is recommended, while for continuous survival time predictions, Random Forest is most effective.
- The hypothesis that higher CD34+ cell dosage extends survival time is partially supported by the results, though not conclusively linked to improved survival status.