https://github.com/duan-lab1/spapheno
Spatially-Informed Phenotype Modeling
https://github.com/duan-lab1/spapheno
spatial-transcriptomics
Last synced: about 1 month ago
JSON representation
Spatially-Informed Phenotype Modeling
- Host: GitHub
- URL: https://github.com/duan-lab1/spapheno
- Owner: Duan-Lab1
- License: other
- Created: 2025-08-04T10:48:32.000Z (10 months ago)
- Default Branch: main
- Last Pushed: 2026-04-17T00:58:52.000Z (about 1 month ago)
- Last Synced: 2026-04-17T02:40:33.674Z (about 1 month ago)
- Topics: spatial-transcriptomics
- Language: R
- Homepage: https://duan-lab1.github.io/SpaPheno/
- Size: 12 MB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.Rmd
- Changelog: NEWS.md
- License: LICENSE
Awesome Lists containing this project
README
---
output:
md_document:
variant: gfm
html_preview: false
---
# SpaPheno: Linking Spatial Transcriptomics to Clinical Phenotypes with Interpretable Machine Learning
```{r, echo=FALSE, results="hide", message=FALSE}
Biocpkg <- function (pkg) {
sprintf("[%s](http://bioconductor.org/packages/%s)", pkg, pkg)
}
library(conflicted)
conflicted::conflict_prefer("filter", "dplyr")
knitr::opts_chunk$set(fig.path = "inst/figures/README-")
```
## Overview
Linking spatial transcriptomic patterns to clinically relevant phenotypes is a critical step toward spatially informed precision oncology. Here, we introduce SpaPheno, an interpretable machine learning framework that integrates spatial transcriptomics with clinically annotated bulk RNA-seq data to uncover spatially resolved biomarkers predictive of patient outcomes. Leveraging Elastic Net regression combined with SHAP-based attribution, SpaPheno uniquely identifies spatial features at multiple scalesβfrom tissue regions to specific cell types and individual spatial spotsβthat are associated with patient survival, tumor stage, and immunotherapy response. We demonstrate the robustness and generalizability of SpaPheno through comprehensive simulations and applications spanning primary liver cancer, clear cell renal cell carcinoma, breast cancer, and melanoma. Across these diverse settings, SpaPheno achieves high predictive accuracy while providing biologically meaningful and spatially precise interpretations. Our framework offers a powerful and extensible approach for translating complex spatial omics data into actionable clinical insights, accelerating the development of precision oncology strategies grounded in tumor spatial architecture.
```{r, echo=FALSE, out.width="80%", out.height="80%", dpi=600, fig.align="center", fig.cap="The Overview of SpaPheno"}
knitr::include_graphics("./man/figures/workflow.jpg")
```
## :sunny: Key Features
- **Integration of spatial transcriptomics with clinically annotated bulk RNA-seq data**
- **Multi-scale interpretable machine learning framework**
- **Robust applicability across diverse cancer types and clinical endpoints**
## :arrow_double_down: Installation
```r
if (!require("BiocManager", quietly = TRUE)) {
install.packages("BiocManager")
}
## Install suggested packages
# BiocManager::install(c(
# "glmnet",
# "FNN",
# "survival"
# ))
# install.packages("devtools")
# devtools::install_github("bm2-lab/SpaDo")
# SpaPheno installation
# devtools::install_github("Duan-Lab1/SpaPheno", dependencies = c("Depends", "Imports", "LinkingTo"))
library(SpaPheno)
library(tidyverse)
library(ggplot2)
library(reshape2)
library(stringr)
library(survival)
```
Download the pre-packaged installation package directly from the [GitHub repository](https://github.com/Duan-Lab1/SpaPheno/releases)
```r
install.packages("SpaPheno_0.0.1.tar.gz", repos = NULL, type = "source")
```
## π Quick Start
### Data availability
The data required for the test are all listed in the following google cloud directory [SpaPheno Demo Data](https://drive.google.com/drive/folders/1tiSgMjhzvIsirvJwFDIAQIEIhR7qixUW?usp=drive_link).
```
βββ BRCAsurvival.RData
βββ HCC_stage.RData
βββ HCC_survival.RData
βββ KIRC_survival.RData
βββ Melanoma_ICB.RData
βββ Simulation_osmFISH.RData
βββ Simulation_STARmap.RData
```
In addition to the demonstration datasets above, we provide standardized pan-cancer bulk and single-cell reference resources to support cross-cohort and multi-omic applications of SpaPheno [SpaPheno TCGA-scRNARef-Dataset](https://drive.google.com/drive/folders/1g8Uj1bSprGMGG0Qitl5TMXVnEr6f3Ix0):
| No. | TCGA Standard Cancer Type | Corresponding Single-Cell Data Original Naming |
|:---:|:--------------------------|:-----------------------------------------------|
| 1 | BLCA | BLCA |
| 2 | BRCA | BRCA / Breast |
| 3 | CESC | CESC |
| 4 | CHOL | CHOL |
| 5 | COAD | CRC |
| 6 | ESCA | ESCA |
| 7 | HNSC | HNSC / HNSCC / Oral |
| 8 | KICH | KICH |
| 9 | KIRC | KIRC |
| 10 | LIHC | LIHC / Liver |
| 11 | LUAD | LUAD |
| 12 | LUSC | LSCC |
| 13 | OV | OV / Ovary |
| 14 | PAAD | PAAD |
| 15 | PRAD | PRAD |
| 16 | SKCM | SKCM |
| 17 | STAD | STAD |
| 18 | THCA | THCA |
| 19 | UCEC | UCEC |
| 20 | UVM | UVM |
- **TCGA Pan-Cancer Bulk Expression and Clinical Data**:
Processed RNA-seq gene expression profiles (raw counts) and corresponding clinical annotations (including survival outcomes, tumor stage) for **20 common cancer types** from The Cancer Genome Atlas (TCGA) are available. The included cancer types are listed in the table below, with unified gene symbols and standardized phenotype annotations to facilitate direct use with SpaPheno:
```
TCGA-n20PanCaner_Dataset
βββ TCGA-BLCA
βΒ Β βββ BLCA_summary.csv
βΒ Β βββ BLCA_expression_by_gene_name.tsv
βΒ Β βββ BLCA_expression.tsv
βΒ Β βββ BLCA_phenotype_with_survival.csv
βΒ Β βββ BLCA_phenotype.csv
βββ TCGA-BRCA
βββ TCGA-CESC
βββ TCGA-CHOL
βββ TCGA-COAD
βββ TCGA-ESCA
βββ TCGA-HNSC
βββ TCGA-KICH
βββ TCGA-KIRC
βββ TCGA-LIHC
βββ TCGA-LUAD
βββ TCGA-LUSC
βββ TCGA-OV
βββ TCGA-PAAD
βββ TCGA-PRAD
βββ TCGA-SKCM
βββ TCGA-STAD
βββ TCGA-THCA
βββ TCGA-UCEC
βββ TCGA-UVM
```
- **TabulaTIME Single-Cell Reference Data**:
Matched single-cell RNA-seq reference datasets for the above cancer types, derived from the TabulaTIME database, are provided as preprocessed `Seurat` objects. These datasets include cell type annotations, enabling direct integration with spatial transcriptomics data for cell type deconvolution and spatially resolved interpretation in SpaPheno.
```
TabulaTIME_scRNA_ref/
βββ TabulaTIME_reference_summary.csv
βββ BLCA_ref.rds
βββ BRCA_ref.rds
βββ CESC_ref.rds
βββ CHOL_ref.rds
βββ CRC-COAD_ref.rds
βββ ESCA_ref.rds
βββ HNSC_ref.rds
βββ KICH_ref.rds
βββ KIRC_ref.rds
βββ LIHC_ref.rds
βββ LSCC-LUSC_ref.rds
βββ LUAD_ref.rds
βββ OV_ref.rds
βββ PAAD_ref.rds
βββ PRAD_ref.rds
βββ SKCM_ref.rds
βββ STAD_ref.rds
βββ THCA_ref.rds
βββ UCEC_ref.rds
βββ UVM_ref.rds
```
### Deconvolution Strategy
To enable consistent and comparable phenotype association analysis across data types, SpaPheno performs **cell-type deconvolution** on both bulk RNA-seq and spatial transcriptomics (ST) data using a shared single-cell RNA-seq reference dataset.
In the current implementation, **cell2location** is used to estimate cell-type abundance profiles, ensuring that downstream phenotype modeling is built on unified, biologically interpretable features.
> ### Parameter Selection for cell2location
>
> When performing deconvolution with cell2location, two key parameters should be carefully adjusted based on the input data modality:
>
> #### 1. N_cells_per_location
>
> This parameter specifies the expected number of cells contributing to each measured profile.
>
> - **Spatial transcriptomics (e.g., 10x Visium)**
>
> Each spot captures a mixture of multiple cells.
>
> A reasonable range is:
>
> `N_cells_per_location = 10β30`
>
> Default setting in SpaPheno:
>
> `N_cells_per_location = 20`
>
> - **Bulk RNA-seq**
>
> Each sample represents a large aggregate of cells.
>
> Following cell2location recommendations, a moderate-to-large value is used:
>
> `N_cells_per_location = 1β100`
>
> Default setting in SpaPheno:
>
> `N_cells_per_location = 100`
>
> #### 2. detection_alpha
>
> This parameter controls regularization strength for per-sample normalization, accounting for technical variation in RNA detection efficiency.
>
> - **Lower values (e.g., 20)**
>
> β Stronger normalization and adaptation to technical noise
>
> β More suitable for **spatial transcriptomics**, which typically exhibits higher technical heterogeneity
>
> - **Higher values (e.g., 200)**
>
> β Weaker normalization, assuming more stable detection sensitivity
>
> β More suitable for **bulk RNA-seq**, where technical variation is relatively modest
>
> Default settings used in SpaPheno:
>
> - Visium spatial transcriptomics: `detection_alpha = 20`
> - Bulk RNA-seq: `detection_alpha = 200`
>
> ### Practical Recommendations
>
> Parameter choice should reflect both biological structure and technical characteristics:
>
> - **Spot-based spatial data**
>
> β Use relatively low `N_cells_per_location`
>
> β Use moderate or low `detection_alpha`
>
> - **Bulk or bulk-like profiling data**
>
> β Use higher `N_cells_per_location`
>
> β Use higher `detection_alpha`
### Tutorial
For more information and documentation, please visit the **[SpaPheno website](https://duan-lab1.github.io/SpaPheno/)**.
## :book: Vignette
Using the following command and Choosing the `html` for more details.
```r
utils::browseVignettes(package = "SpaPheno")
```
## :sparkling_heart: Contributing
Welcome any contributions or comments, and you can file them
[here](https://github.com/Duan-Lab1/SpaPhenoissues).
## :trophy: Acknowledgement
Thanks all the developers of the methods integrated into **SpaPheno**.
## :eight_pointed_black_star: Citation
Kindly cite by using `citation("SpaPheno")` if you think **SpaPheno** helps you. Alternative way is **Duan, B., Cheng, X. & Zou, H. SpaPheno: linking spatial transcriptomics to clinical phenotypes with interpretable machine learning. Genome Med (2026). https://doi.org/10.1186/s13073-026-01645-7**
## :writing_hand: Authors
+ [Bin Duan](mailto:binduan@sjtu.edu.cn)
+ [Hua Zou](mailto:zouhua1@outlook.com)