https://github.com/msk-access/kreview
Advanced cfDNA Fragmentomics Core Evaluation Engine
https://github.com/msk-access/kreview
Last synced: 2 days ago
JSON representation
Advanced cfDNA Fragmentomics Core Evaluation Engine
- Host: GitHub
- URL: https://github.com/msk-access/kreview
- Owner: msk-access
- License: other
- Created: 2026-04-08T15:21:03.000Z (2 months ago)
- Default Branch: main
- Last Pushed: 2026-06-01T21:53:26.000Z (6 days ago)
- Last Synced: 2026-06-01T22:21:25.046Z (6 days ago)
- Language: Python
- Homepage: https://msk-access.github.io/kreview/
- Size: 4.19 MB
- Stars: 1
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
Awesome Lists containing this project
README
---
## 𧬠Overview
`kreview` is a production-grade, notebook-first (`nbdev`) evaluation engine designed for high-throughput cancer liquid biopsy fragmentomics feature analysis. Developed at Memorial Sloan Kettering (MSKCC), it processes cohorts containing tens of thousands of samples using an embedded DuckDB query engine with chunked I/O and automatic retry logic.
π **[Full Documentation](https://msk-access.github.io/kreview/)**
## π Features
- **5-Tier ctDNA Taxonomy**: MSK-IMPACT paired-inference to label `True ctDNA+`, `Possible ctDNA+`, `Possible ctDNAβ`, `Healthy Normal`, and `Insufficient Data`. Optional CH hotspot demotion via `--ch-hotspot-maf`.
- **DuckDB Dynamic Data Lake**: In-memory `read_parquet` bindings with chunked I/O and exponential backoff retry. Builds a merged SQL-queryable `kreview_lake.duckdb` on demand.
- **Multi-Model Evaluation**: Logistic Regression, Random Forest, and XGBoost (CPU) plus TabPFN and TabICL (GPU) with Stratified K-Fold CV, SHAP explainability, and subgroup analysis.
- **Feature Selection**: [mRMR](https://github.com/smazzanti/mrmr) (Minimum Redundancy Maximum Relevance) as default strategy β iteratively selects features maximizing target relevance while minimizing inter-feature redundancy. Legacy `hybrid_union` (AUC βͺ MI) also available.
- **Multimodal Stacking**: Cross-evaluator fusion via super-matrix with Mutual Information or [Boruta-SHAP](https://github.com/Ekeany/Boruta-Shap) selection, followed by stacking ensemble + ablation analysis.
- **Interactive Dashboards**: Plotly-native HTML reports with ROC curves, violin plots, SHAP beeswarm/waterfall, mRMR scatter plots, per-cancer-type sensitivity tables, and Decision Curve Analysis.
- **Nextflow HPC Integration**: Decomposed multistage DAG for SLURM-based HPC execution with per-evaluator parallelism, GPU scheduling, and automatic retry logic.
- **26 Built-In Evaluators**: Modular extractors covering fragment sizes (FSC, FSD, FSR), nucleosome protection (WPS, TFBS), cleavage motifs (EndMotif, BreakPointMotif), chromatin accessibility (ATAC), motif divergence (MDS), and orientation (OCF).
## ποΈ Pipeline Architecture
```mermaid
graph LR
A[Label] --> B["Extract ΓN"]
B --> C[Select]
C --> D["Eval CPU"]
C --> E["Eval GPU"]
C --> F[Fuse]
D --> G[Scoreboard]
E --> G
D --> I["Eval Multimodal"]
E --> I
F --> I
G --> H[Report]
I --> J["Report Multimodal"]
```
The pipeline supports two modes:
| Mode | Command | Use Case |
|------|---------|----------|
| **Monolithic** | `kreview run` | Single-machine, sequential execution |
| **Multistage** | `nextflow run ... -profile iris` | HPC parallelism, per-evaluator scatter |
## βοΈ Quick Start
### Installation
> [!IMPORTANT]
> **Quarto is strictly required** for programmatic dashboard generation. Because `quarto-cli` wrapper packages are unreliable across Python environments, `kreview` assumes the Quarto executable is installed dynamically on your OS or container.
#### Option 1: Docker (Recommended "Batteries-Included" Method)
The easiest way to run `kreview` without managing external dependencies is to use our pre-built Docker containers (hosted on GHCR). They ship with `Python 3.12`, all ML libraries, and `quarto`:
```bash
# CPU image (~1.5 GB) β for all standard pipeline processes
docker pull ghcr.io/msk-access/kreview:latest
# GPU image (~8-10 GB) β adds PyTorch, TabPFN, TabICL (requires NVIDIA drivers)
docker pull ghcr.io/msk-access/kreview:latest-gpu
# Run
docker run -v /your/data:/data ghcr.io/msk-access/kreview:latest \
kreview run --cancer-samplesheet /data/cancer.csv ...
```
#### Option 2: Local Install (Pip)
If you install via pip, you **must separately install Quarto** via your OS manager:
1. **Install Quarto:** Follow the [official Quarto Installation Guide](https://quarto.org/docs/get-started/) (e.g. `brew install quarto` on macOS).
2. **Install kreview:**
```bash
git clone https://github.com/msk-access/kreview.git
cd kreview
pip install -e . # CPU models only
pip install -e ".[gpu]" # + TabPFN, TabICL (requires CUDA)
```
### Running the Pipeline
#### Local (Single Machine)
```bash
kreview run \
--cancer-samplesheet "/path/to/cancer/samplesheet.csv" \
--healthy-xs1-samplesheet "/path/to/healthy/xs1/samplesheet.csv" \
--healthy-xs2-samplesheet "/path/to/healthy/xs2/samplesheet.csv" \
--cbioportal-dir "/path/to/cBioPortal_MAF_CNA_SV/" \
--krewlyzer-dir "/path/to/unified_krewlyzer_results" \
--output output/ \
--strategy mrmr \
--top-percentile 10 \
--compute-univariate-auc \
--ch-hotspot-maf "/path/to/ch_hotspots.maf" \
--export-duckdb
```
#### HPC (Nextflow + SLURM)
```bash
nextflow run /path/to/kreview/nextflow/main.nf \
--cancer_samplesheet /path/to/cancer.csv \
--healthy_xs1_samplesheet /path/to/healthy_xs1.csv \
--healthy_xs2_samplesheet /path/to/healthy_xs2.csv \
--cbioportal_dir /path/to/cbioportal/ \
--krewlyzer_dir /path/to/manifest.txt \
--outdir /path/to/output/ \
--pipeline_mode multistage \
--run_gpu_eval true \
--gpu_models "tabpfn,tabicl" \
--run_multimodal_eval true \
-profile iris
```
### Dashboard Access
Once finished, open the generated HTML reports:
```bash
open output/reports/ATAC_dashboard.html
```
## π§ͺ Feature Selection
| Strategy | Scope | Method | Default |
|----------|-------|--------|---------|
| `mrmr` | Single-evaluator | F-statistic relevance + Pearson redundancy penalty | β
|
| `hybrid_union` | Single-evaluator | Top-X% AUC βͺ Top-X% MI | Legacy |
| `mi` | Multimodal | Mutual Information top-K ranking | β
|
| `boruta_shap` | Multimodal | SHAP importance vs shadow variables (50 trials) | Optional |
See [Statistical Evaluation](https://msk-access.github.io/kreview/machine-learning/statistical-tests/) for full documentation.
## π nbdev Architecture
This project operates as an `nbdev` repo. Do **not** edit `.py` scripts manually in `kreview/`. Build natively inside Jupyter notebooks within `nbs/` and trigger:
```bash
nbdev_export
```
## π Resources
- **[Documentation](https://msk-access.github.io/kreview/)** β Full user and developer guide
- **[Contributing](CONTRIBUTING.md)** β How to contribute
- **[Changelog](https://msk-access.github.io/kreview/changelog/)** β Version history