https://github.com/jinia2801/mrd-detection-flow
Unsupervised MRD detection in flow cytometry data using Variational AutoEncoder (VAE) and Gaussian Mixture Model (GMM).
https://github.com/jinia2801/mrd-detection-flow
cancer-detection gmm mrd pytorch sklearn vae
Last synced: about 2 months ago
JSON representation
Unsupervised MRD detection in flow cytometry data using Variational AutoEncoder (VAE) and Gaussian Mixture Model (GMM).
- Host: GitHub
- URL: https://github.com/jinia2801/mrd-detection-flow
- Owner: Jinia2801
- Created: 2025-11-25T05:39:45.000Z (7 months ago)
- Default Branch: main
- Last Pushed: 2026-04-07T04:18:42.000Z (2 months ago)
- Last Synced: 2026-04-07T06:24:25.595Z (2 months ago)
- Topics: cancer-detection, gmm, mrd, pytorch, sklearn, vae
- Language: Jupyter Notebook
- Homepage:
- Size: 3.66 MB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# MRD Detection in Flow Cytometry using VAE and GMM
This repository presents two complementary approaches for detecting **Minimal Residual Disease (MRD)** in flow cytometry data:
- [`VAE/`](./VAE/): Deep learning-based anomaly detection using **Variational Autoencoders**
- [`GMM/`](./GMM/): Probabilistic modeling with **Gaussian Mixture Models**
---
## Dataset Overview
We work with flow cytometry data collected from **12 patients**:
- **Healthy Patients (P1–P6)**:
- ~27 million cells
- Used for model training
- **Unhealthy Patients (P7–P12)**:
- ~20 million cells
- Used for evaluation and MRD prediction
Each cell is represented by **14 features**. The models are trained to learn the healthy distribution and detect anomalous cells in new patient data, which may indicate MRD.
---
## Objective
To accurately identify cells that are anomalous (i.e., likely cancerous) using only **unsupervised learning** methods trained on healthy patient data. These anomalies collectively form an estimate of MRD (%).
---
## Methods Used
### 1. Variational Autoencoder (VAE)
- Learns latent representations via probabilistic encoding/decoding
- Detects anomalies based on **reconstruction error (MSE)**
- Explored different latent dimensions (2 and 4) and β values
- Uses **Leave-One-Patient-Out (LOPO)** validation with **progressive fine-tuning**
- Produces per-cell MSE scores → used to estimate MRD
📎 [Explore VAE Approach](./VAE/README.md)
📎 [VAE Best Model](./VAE/vae_4dim_part2.ipynb)
---
### 2. Gaussian Mixture Model (GMM)
- Trains a mixture of Gaussians on healthy cell data
- Evaluates likelihood of each new cell under the model
- Low-likelihood cells are flagged as anomalies
- Tried multiple component counts (4, 6, 16)
- Compared `full` vs `tied` covariance structures
- Final threshold: **1.5th percentile of healthy scores**
📎 [Explore GMM Approach](./GMM/README.md)
📎 [GMM Best Model](./GMM/GMM_s_complete_tied_4.ipynb)
---
## Evaluation Metrics
- **Mean Squared Error (MSE)** between predicted and actual MRD %
- **Mean Absolute Error (MAE)**
- **MRD Estimation** for each patient based on anomaly scores
Both models approximate expert-annotated MRD scores with high accuracy.
---
## Main libraries:
- `pytorch`
- `scikit-learn`
- `numpy, pandas`
- `matplotlib, seaborn`
- `joblib`
## References
- [PyTorch VAE Tutorial](https://pytorch.org/tutorials/beginner/vae.html)
- [Uncovering Anomalies with Variational Autoencoders – Towards Data Science](https://towardsdatascience.com/uncovering-anomalies-with-variational-autoencoders-vae-a-deep-dive-into-the-world-of-1b2bce47e2e9/)
- [Hands-On Anomaly Detection with Variational Autoencoders – Medium](https://medium.com/data-science/hands-on-anomaly-detection-with-variational-autoencoders-d4044672acd5)
- scikit-learn GMM: [https://scikit-learn.org/stable/modules/mixture.html](https://scikit-learn.org/stable/modules/mixture.html)
- [Understanding Gaussian Mixture Models – Number Analytics Blog](https://www.numberanalytics.com/blog/understanding-gaussian-mixture-models-data-analysis)
- [PMC Article on GMM & MRD](https://pmc.ncbi.nlm.nih.gov/articles/PMC11659572/)