https://github.com/mahmoodlab/madeleine
MADELEINE: multi-stain slide representation learning (ECCV'24)
https://github.com/mahmoodlab/madeleine
cancer molecular-status-prediction multimodal-foundation-model pathology slide-representation-learning ssl
Last synced: 5 months ago
JSON representation
MADELEINE: multi-stain slide representation learning (ECCV'24)
- Host: GitHub
- URL: https://github.com/mahmoodlab/madeleine
- Owner: mahmoodlab
- License: other
- Created: 2024-07-16T14:22:55.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2025-02-18T20:23:25.000Z (8 months ago)
- Last Synced: 2025-03-30T11:51:11.777Z (6 months ago)
- Topics: cancer, molecular-status-prediction, multimodal-foundation-model, pathology, slide-representation-learning, ssl
- Language: Python
- Homepage:
- Size: 22.9 MB
- Stars: 49
- Watchers: 3
- Forks: 5
- Open Issues: 3
-
Metadata Files:
- Readme: README.md
- License: LICENSE.md
- Support: support/joint_logo.png
Awesome Lists containing this project
README
# Code for Multistain Pretraining for Slide Representation Learning in Pathology (ECCV'24)
[arXiv](https://arxiv.org/pdf/2408.02859) | [HuggingFace](https://huggingface.co/MahmoodLab/madeleine) | [Proceedings](https://www.ecva.net/papers/eccv_2024/papers_ECCV/html/4788_ECCV_2024_paper.php)Welcome to the official GitHub repository of our ECCV 2024 paper, "Multistain Pretraining for Slide Representation Learning in Pathology". This project was developed at the [Mahmood Lab](https://faisal.ai/) at Harvard Medical School and Brigham and Women's Hospital.
## Abstract
![]()
Developing self-supervised learning (SSL) models that can learn universal and transferable representations of H&E gigapixel whole-slide images (WSIs) is becoming increasingly valuable in computational pathology. These models hold the potential to advance critical tasks such as few-shot classification, slide retrieval, and patient stratification. Existing approaches for slide representation learning extend the principles of SSL from small images (e.g., 224x224 patches) to entire slides, usually by aligning two different augmentations (or views) of the slide. Yet the resulting representation remains constrained by the limited clinical and biological diversity of the views. Instead, we postulate that slides stained with multiple markers, such as immunohistochemistry, can be used as different views to form a rich task-agnostic training signal. To this end, we introduce MADELEINE, a multimodal pretraining strategy for slide representation learning. MADELEINE is trained with a dual global-local cross-stain alignment objective on large cohorts of breast cancer samples (N=4,211 WSIs across five stains) and kidney transplant samples (N=12,070 WSIs across four stains). We demonstrate the quality of slide representations learned by MADELEINE on various downstream evaluations, ranging from morphological and molecular classification to prognostic prediction, comprising 21 tasks using 7,299 WSIs from multiple medical centers.## Overview
## Updates
- 02.2025: Madeleine has been integrated into [Trident](https://github.com/mahmoodlab/TRIDENT). Madeleine feature can be extracted with:
```bash
python run_batch_of_slides.py --task all --wsi_dir pngs/ --job_dir ./trident_processed8 --slide_encoder madeleine --mag 10 --patch_size 256`
```## Installation
```
# Clone repo
git clone https://github.com/mahmoodlab/MADELEINE
cd MADELEINE# Create conda env
conda create --name madeleine python=3.9
conda activate madeleine
pip install -r requirements.txt
```## Preprocessing: tissue segmentation, patching, and patch feature extraction
We are extracting [CONCH](https://github.com/mahmoodlab/CONCH) features at 10x magnification on 256x256-pixel patches. The script uses a new deep learning-based tissue segmentation that can provide off-the-shelf H&E and IHC tissue detection (deprecated, please use Trident insteads. See Updates).
```
cd ./bin
python extract_patch_embeddings.py \
--slide_dir \
--local_dir ../results/BCNB \
--patch_mag 10 \
--patch_size 256
```## Extracting MADELEINE slide encoding
You can run MADELEINE slide encoding (trained on 10x breast samples) using the [HuggingFace](https://huggingface.co/MahmoodLab/madeleine) checkpoint with:
```
cd ./bin
python extract_slide_embeddings.py --local_dir ../results/BCNB/
```
Deprecated: please use Trident insteads. See Updates.## Linear probe for molecular status prediction
To run linear probe using MADELEINE on BCNB molecular status prediction, run:```
cd ./bin
python run_linear_probing.py --slide_embedding_pkl ../results/BCNB/madeleine_slide_embeddings.pkl --label_path ../dataset_csv/BCNB/BCNB.csv
```BCNB slide embeddings can also be downloaded from [here](). The command performs linear probing for `k=1,10,25`, testing the data efficiency of the slide emebddings.
## MADELEINE slide embeddings against the state-of-the-art
All models are evaluated using linear probing without hyper-parameter tuning. MADELEINE slide embeddings are able to outperform various baselines, including GigaPath (Xu et al. *Nature*, 2024), on molecular status prediction:
| | | k=1 | | | k=10 | | | k=25 | |
|------------|-----|-----|------|------|-----|------|------|-----|------|
| | ER | PR | HER2 | ER | PR | HER2 | ER | PR | HER2 |
| **CONCH (Mean of patchh embs)** | 0.575 | 0.528 | 0.509 | 0.759 | 0.678 | 0.603 | 0.785 | 0.724 | 0.647 |
| **GigaPath (Mean of patch embs)** | 0.568 | 0.523 | 0.501 | 0.718 | 0.657 | 0.588 | 0.762 | 0.71 | 0.637 |
| **GigaPath (slide encoder)** | 0.555 | 0.514 | 0.498 | 0.691 | 0.636 | 0.577 | 0.741 | 0.689 | 0.618 |
| **MADELEINE (slide encoder)** | **0.664** | **0.537** | **0.545** | **0.818** | **0.756** | **0.662** | **0.838** | **0.791** | **0.706** |# How to train your version of MADELEINE
## Train MADELEINE on Breast tissue using ACROBAT
```
cd ./bin# launch pretraining without stain encodings
bash ../scripts/launch_pretrain_withoutStainEncodings.sh# launch pretraining with stain encodings
bash ../scripts/launch_pretrain_withStainEncodings.sh# launch both experiments
bash ../scripts/master.sh
```
NOTE: The pretrain script by default extracts the slide embeddings of the BCNB dataset used for downstream evaluation.TIP: place the data directory on SSD for faster I/O and training. We use 3x24GB 3090Ti for training and it takes ~1 h to train MADELEINE.
## Issues
- The preferred mode of communication is via GitHub issues.
- If GitHub issues are inappropriate, email avaidya@mit.edu (and cc gjaume@bwh.harvard.edu).
- Immediate response to minor issues may not be available.
- We cannot provide access to CONCH weights. Please refer to instructions on [CONCH GitHub page](https://github.com/mahmoodlab/CONCH).## Cite
If you find our work useful in your research, please consider citing:```
@inproceedings{jaume2024multistain,
title={Multistain Pretraining for Slide Representation Learning in Pathology},
author={Jaume, Guillaume and Vaidya, Anurag Jayant and Zhang, Andrew and Song, Andrew H and Chen, Richard J. and Sahai, Sharifa and Mo, Dandan and Madrigal, Emilio and Le, Long Phi and Mahmood Faisal},
booktitle={European Conference on Computer Vision},
year={2024},
organization={Springer}
}
```## License
This repository is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0) license. You are free to download and share the work with proper attribution, but commercial use and modifications are not allowed. Please note that Creative Commons provides this license "as-is" without any warranties or liabilities.