An open API service indexing awesome lists of open source software.

https://github.com/ridwansharkar/bone-marrow-transplant-study

⚕Rutgers University - NB Fall 2024 | ST486 Applied Statistics
https://github.com/ridwansharkar/bone-marrow-transplant-study

bone-marrow machine-learning stem-cell-transplant uci-dataset

Last synced: 3 months ago
JSON representation

⚕Rutgers University - NB Fall 2024 | ST486 Applied Statistics

Awesome Lists containing this project

README

        

# UC-Irvine Bone Marrow Transplant Study
“This dataset describes pediatric patients with several hematolic disease, who were subject to the unmanipulated allogeneic unrelated donor hematopoietic stem cell.” (UC Irvine).

## 1. Overview
CD34+ cells, also known as hematopoietic stem cells (HSCs), primarily serve the purpose of self-renewal and producing mature blood cells, including erythrocytes, leukocytes, platelets, and lymphocytes. As the source of all blood lineages, CD34+ T cells are critical in hematopoietic stem cell transplantation (HSCT) as they play a central role in governing the immune environment post-transplantation. In pediatric HSCT studies, CD34+ T cell dynamics help evaluate immune recovery and treatment efficacy; this study aims to highlight the synergy between immune function and CD34+ stem cell transplantation outcomes.

**Dataset Source**: [UCI Bone Marrow Transplant Children](https://archive.ics.uci.edu/dataset/565/bone+marrow+transplant+children) (187 observations x 36 features)

Screenshot 2024-12-25 at 1 00 11 AM

UCI Bone Marrow dataset was analyzed to predict two key outcomes:

- **Survival Status** (categorical)
- **Survival Time** (continuous)

Our main objective was to determine which variables best predict these outcomes and to compare different supervised learning models in terms of their predictive performance.

---

## 2. Methods

### 2.1 Exploratory Analysis
- Inspected missing data using visualizations (`vis_miss`, `gg_miss_var`).
- Examined correlations (`corrplot`) and outliers using the IQR rule.
- Explored distributions via histograms, density plots, and scatterplot matrices:

**Correlation Matrix:**

Screenshot 2024-12-25 at 12 53 39 AM

**Scatterplot Matrix:**

Screenshot 2024-12-25 at 12 55 17 AM

### 2.2 Modeling Approaches

1. **Survival Status (Classification)**
- Logistic Regression
- Random Forest

2. **Survival Time (Regression)**
- Linear Regression
- Lasso (L1 Regularization)
- Random Forest

### 2.3 Variable Selection
Three main strategies were used to identify important features:
1. **Stepwise Selection** (using AIC-based forward/backward selection)
2. **Lasso Regularization** (to shrink less important coefficients to zero)
3. **Random Forest Feature Importance** (ranking variables by mean decrease in node purity)

---

## 3. Results

### 3.1 Survival Status

- **Most Important Predictors** (overlap of stepwise, Lasso, and Random Forest):
1. *Relapse*
2. *extcGvHD*
3. *Survival Time*
4. *Txpostrelapse*

- **Model Comparison**
- **Logistic Regression**: ~94.44% accuracy
- **Random Forest**: ~94.44% accuracy (rounded before and after tuning)
- **Logistic Regression** remained the best choice in our comparison, even after Random Forest tuning, due to consistent predictive performance and model interpretability.

### 3.2 Survival Time

- **Features Identified by Each Method**:

1. **Stepwise Selection**
*Stemcellsource, RecipientABO, Disease, Txpostrelapse, extcGvHD, Recipientage, Rbodymass, survival_status, DosageGroup*

2. **Lasso**
*Donorage, CD34kgx10d6, CD3dCD34, CD3dkgx10d8, Rbodymass, ANCrecovery, PLTrecovery, time_to_aGvHD_III_IV, survival_status*

3. **Random Forest**
*survival_status, extcGvHD, CD3dCD34, PLTrecovery, CD3dkgx10d8, Donorage, CD34kgx10d6, CMVstatus, Rbodymass, HLAgrI*

- **Model Comparison**
- **Stepwise Linear Model**
- R-squared: 0.654
- RMSE: 494.12
- AIC: 2814.30

- **Lasso Model**
- R-squared: 0.612
- RMSE: 523.05
- AIC: 2817.01

- **Random Forest Model**
- R-squared: 0.656
- RMSE: 492.40
- AIC: 2815.03

- **Best Model**
- **Random Forest** outperformed other models with the highest R-squared and lowest RMSE, indicating that Random Forest is the most robust regressor for predicting survival time.

---

## 4. Analysis

1. **Survival Status** depends primarily on:
- *Relapse*, *extcGvHD*, *Survival Time*, and *Txpostrelapse*
- *CD34+ dosage* did not appear as a crucial determinant for survival status in the final models.
- **Logistic Regression** proved the most reliable for classification.

2. **Survival Time** is strongly influenced by:
- *Survival Status*, *extcGvHD*, *CD3dCD34*, *PLTrecovery*, *CD3dkgx10d8*, *Donorage*, *CD34kgx10d6*, *CMVstatus*, *Rbodymass*, and *HLAgrI*
- **CD34+ dosage** surfaced as a significant predictor of survival time but does not alone guarantee survival.

3. **Interaction Between Outcomes**
- *Survival Status* and *Survival Time* are interdependent.
- Only *extcGvHD* was shared as a top predictor across both final models.

---

## 5. Conclusion
- While higher **CD34+ dosage** may prolong survival time, it does not unequivocally ensure survival status.
- For categorical survival status predictions, Logistic Regression is recommended, while for continuous survival time predictions, Random Forest is most effective.
- The hypothesis that higher CD34+ cell dosage extends survival time is partially supported by the results, though not conclusively linked to improved survival status.