{"id":20756119,"url":"https://github.com/mahmoodlab/hipt","last_synced_at":"2025-04-04T23:08:17.780Z","repository":{"id":37265064,"uuid":"500459271","full_name":"mahmoodlab/HIPT","owner":"mahmoodlab","description":"Hierarchical Image Pyramid Transformer - CVPR 2022 (Oral)","archived":false,"fork":false,"pushed_at":"2024-03-19T15:50:10.000Z","size":776265,"stargazers_count":550,"open_issues_count":23,"forks_count":95,"subscribers_count":10,"default_branch":"master","last_synced_at":"2025-03-28T22:13:08.709Z","etag":null,"topics":["computational-pathology","cvpr","cvpr2022","deep-learning","hierarchical-attention-networks","high-resolution","histopathology","pretrained-weights","pytorch","self-supervised-learning","transfer-learning","unsupervised-learning","vision-transformer","weakly-supervised-learning"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mahmoodlab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":".github/CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-06-06T14:09:27.000Z","updated_at":"2025-03-26T08:45:52.000Z","dependencies_parsed_at":"2024-03-19T17:06:25.700Z","dependency_job_id":null,"html_url":"https://github.com/mahmoodlab/HIPT","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mahmoodlab%2FHIPT","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mahmoodlab%2FHIPT/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mahmoodlab%2FHIPT/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mahmoodlab%2FHIPT/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mahmoodlab","download_url":"https://codeload.github.com/mahmoodlab/HIPT/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247261604,"owners_count":20910108,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["computational-pathology","cvpr","cvpr2022","deep-learning","hierarchical-attention-networks","high-resolution","histopathology","pretrained-weights","pytorch","self-supervised-learning","transfer-learning","unsupervised-learning","vision-transformer","weakly-supervised-learning"],"created_at":"2024-11-17T09:29:14.668Z","updated_at":"2025-04-04T23:08:17.749Z","avatar_url":"https://github.com/mahmoodlab.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"Scaling Vision Transformers to Gigapixel Images via Hierarchical Self-Supervised Learning\n===========\n\u003cdetails\u003e\n\u003csummary\u003e\n  \u003cb\u003eScaling Vision Transformers to Gigapixel Images via Hierarchical Self-Supervised Learning\u003c/b\u003e, CVPR 2022.\n  \u003ca href=\"https://openaccess.thecvf.com/content/CVPR2022/html/Chen_Scaling_Vision_Transformers_to_Gigapixel_Images_via_Hierarchical_Self-Supervised_Learning_CVPR_2022_paper.html\" target=\"blank\"\u003e[HTML]\u003c/a\u003e\n  \u003ca href=\"https://arxiv.org/abs/2206.02647\" target=\"blank\"\u003e[arXiv]\u003c/a\u003e\n  \u003ca href=\"https://www.youtube.com/watch?v=cABkB1J-GTA\" target=\"blank\"\u003e[Oral]\u003c/a\u003e\n\t\u003cbr\u003e\u003cem\u003e\u003ca href=\"http://richarizardd.me\"\u003eRichard. J. Chen\u003c/a\u003e, \u003ca href=\"https://www.kuanchchen.com\"\u003eChengkuan Chen\u003c/a\u003e, \u003ca href=\"https://www.linkedin.com/in/yicong-jackson-li/\"\u003eYicong Li\u003c/a\u003e, \u003ca href=\"https://twitter.com/tiffanyytchen\"\u003eTiffany Y. Chen\u003c/a\u003e, \u003ca href=\"https://www.gatesfoundation.org/about/leadership/andrew-trister\"\u003eAndrew D. Trister\u003c/a\u003e, \u003ca href=\"http://www.cs.toronto.edu/~rahulgk/index.html\"\u003eRahul G. Krishnan*\u003c/a\u003e, \u003ca href=\"https://faisal.ai/\"\u003eFaisal Mahmood*\u003c/a\u003e\u003c/em\u003e\u003c/br\u003e\n\u003c/summary\u003e\n\n```bash\n@inproceedings{chen2022scaling,\n    author    = {Chen, Richard J. and Chen, Chengkuan and Li, Yicong and Chen, Tiffany Y. and Trister, Andrew D. and Krishnan, Rahul G. and Mahmood, Faisal},\n    title     = {Scaling Vision Transformers to Gigapixel Images via Hierarchical Self-Supervised Learning},\n    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},\n    month     = {June},\n    year      = {2022},\n    pages     = {16144-16155}\n}\n```\n\u003c/details\u003e\n\n\u003cdiv align=\"center\"\u003e\n  \u003cimg width=\"100%\" alt=\"HIPT Illustration\" src=\".github/HIPT Architecture.gif\"\u003e\n\u003c/div\u003e\n\n\u003cdetails\u003e\n  \u003csummary\u003e\n\t  \u003cb\u003eKey Ideas \u0026 Main Findings\u003c/b\u003e\n  \u003c/summary\u003e\n\n1. **Hierarchical Image Pyramid Transformer (HIPT) Architecture:** Three-stage hierarchical ViT that formulates gigapixel whole-slide images (WSIs) as a disjoint set of nested sequences. HIPT unroll the WSI into non-overlapping ```[4096 × 4096]``` image regions, followed by unrolling each region into non-overlapping ```[256 × 256]``` image patches, and lastly each patch as non-overlapping ```[16 × 16]``` cell tokens. Our method is analgous to that of hierarchical attention networks in long document modeling, in which word embeddings within sentences are aggregated to form sentence-level embeddings and subsequently aggregated into document-level embeddings. Inference in HIPT is performed via bottom-up aggregation of ```[16 × 16]``` visual tokens in their respective ```[256 × 256]``` and ```[4096 × 4096]``` windows via Transformer attention to compute a slide-level representation.\n2. **Learning Context-Aware Token Dependencies in WSIs:** Note that Transformer attention is computed only in local windows (instead of across the entire WSI), which makes learning long-range dependencies tractable. Though representation learning for ```[4096 × 4096]``` image regions may seem expensive, also note that the patch size at this level is ```[256 × 256]```, and thus has similar complexity of applying ViTs to ```[256 × 256]``` image patches with ```[16 × 16]``` tokens.\n3. **Hierarchical Pretraining:** Since encoding ```[4096 x 4096]``` images is the same subproblem as encoding ```[256 x 256]``` images, we hypothesize that ViT pretraining techniques can generalize to higher resolutions with little modification. DINO is used to not only pretrain ViT-16 in HIPT, but also ViT-256 via [6 x 6] local and [14 x 14] global crops on a 2D grid-of-features (obtained by using VIT-16 as a patch tokenizer for ViT-256).\n4. **Self-Supervised Slide-Level Representation Learning:** HIPT is evaluated via pretraining + freezing the ViT-16 / ViT-256 stages, with the ViT-4K stage finetuned with slide-level labels, assessed on cancer subtyping and survival prediction tasks in TCGA. We also perform self-supervised KNN evaluation of HIPT embeddings via computing the mean [CLS]-4K tokens extracted from ViT-256, as a proxy for the slide-level embedding. On Renal Cell Carcinoma subtyping, we report that averaged, pretrained HIPT-4K embeddings without any labels perform as well as CLAM-SB.\n\u003c/details\u003e\n\n## Updates / TODOs\nPlease follow this GitHub for more updates.\n- [ ] Removing dead code in HIPT_4K library.\n- [X] Better documentation on interpretability code example.\n- [x] Add pretrained models + instructions for hierarchical visualization.\n- [X] Add pre-extracted slide-level embeddings, and code for K-NN evaluation.\n- [X] Add weakly-supervised results for Tensorboard.\n\n## Pre-Reqs + Installation\nThis repository includes not only the code base for HIPT, but also saved HIPT checkpoints and pre-extracted HIPT slide embeddings with ~4.08 GiB of storage, which we version control via [Git LFS](https://git-lfs.github.com/).\n\nTo clone this repository without large files initially:\n```bash\nGIT_LFS_SKIP_SMUDGE=1 git clone https://github.com/mahmoodlab/HIPT.git \t# Pulls just the codebase\ngit lfs pull --include \"*.pth\"\t\t\t\t\t\t# Pulls the pretrained checkpoints\ngit lfs pull --include \"*.pt\"\t\t\t\t\t\t# Pulls pre-extracted slide embeddings\ngit lfs pull --include \"*.pkl\"\t\t\t\t\t\t# Pulls pre-extracted patch embeddings\ngit lfs pull --include \"*.png\"\t\t\t\t\t\t# Pulls demo images (required for 4K x 4K visualization)\n```\nTo clone all files:\n```bash\ngit clone https://github.com/mahmoodlab/HIPT.git\n```\n\nTo install Python dependencies:\n```bash\npip install -r requirements.txt\n```\n\n## HIPT Walkthrough\n\n### How HIPT Works\nBelow is a snippet of a standalone two-stage HIPT model architecture that can load fully self-supervised weights for nested [16 x 16] and [256 x 256] token aggregation, defined in [./HIPT_4K/hipt_4k.py](https://github.com/mahmoodlab/HIPT/blob/master/HIPT_4K/hipt_4k.py). Via a few ```einsum``` operations, you can put together multiple ViT encoders and have it scale to large resolutions. HIPT_4K was used for feature extraction of non-overlapping [4096 x 4096] image regions across the TCGA.\n\n```python\nimport torch\nfrom einops import rearrange, repeat\nfrom HIPT_4K.hipt_model_utils import get_vit256, get_vit4k\n\nclass HIPT_4K(torch.nn.Module):\n    \"\"\"\n    HIPT Model (ViT_4K-256) for encoding non-square images (with [256 x 256] patch tokens), with \n    [256 x 256] patch tokens encoded via ViT_256-16 using [16 x 16] patch tokens.\n    \"\"\"\n    def __init__(self, \n        model256_path: str = 'path/to/Checkpoints/vit256_small_dino.pth',\n        model4k_path: str = 'path/to/Checkpoints/vit4k_xs_dino.pth', \n        device256=torch.device('cuda:0'), \n        device4k=torch.device('cuda:1')):\n\n        super().__init__()\n        self.model256 = get_vit256(pretrained_weights=model256_path).to(device256)\n        self.model4k = get_vit4k(pretrained_weights=model4k_path).to(device4k)\n        self.device256 = device256\n        self.device4k = device4k\n        self.patch_filter_params = patch_filter_params\n\t\n    def forward(self, x):\n        \"\"\"\n        Forward pass of HIPT (given an image tensor x), outputting the [CLS] token from ViT_4K.\n        1. x is center-cropped such that the W / H is divisible by the patch token size in ViT_4K (e.g. - 256 x 256).\n        2. x then gets unfolded into a \"batch\" of [256 x 256] images.\n        3. A pretrained ViT_256-16 model extracts the CLS token from each [256 x 256] image in the batch.\n        4. These batch-of-features are then reshaped into a 2D feature grid (of width \"w_256\" and height \"h_256\".)\n        5. This feature grid is then used as the input to ViT_4K-256, outputting [CLS]_4K.\n\n        Args:\n          - x (torch.Tensor): [1 x C x W' x H'] image tensor.\n\n        Return:\n          - features_cls4k (torch.Tensor): [1 x 192] cls token (d_4k = 192 by default).\n        \"\"\"\n        batch_256, w_256, h_256 = self.prepare_img_tensor(x)                    # 1. [1 x 3 x W x H].\n        batch_256 = batch_256.unfold(2, 256, 256).unfold(3, 256, 256)           # 2. [1 x 3 x w_256 x h_256 x 256 x 256] \n        batch_256 = rearrange(batch_256, 'b c p1 p2 w h -\u003e (b p1 p2) c w h')    # 2. [B x 3 x 256 x 256], where B = (1*w_256*h_256)\n\n\n        features_cls256 = []\n        for mini_bs in range(0, batch_256.shape[0], 256):                       # 3. B may be too large for ViT_256. We further take minibatches of 256.\n            minibatch_256 = batch_256[mini_bs:mini_bs+256].to(self.device256, non_blocking=True)\n            features_cls256.append(self.model256(minibatch_256).detach().cpu()) # 3. Extracting ViT_256 features from [256 x 3 x 256 x 256] image batches.\n\n        features_cls256 = torch.vstack(features_cls256)                         # 3. [B x 384], where 384 == dim of ViT-256 [ClS] token.\n        features_cls256 = features_cls256.reshape(w_256, h_256, 384).transpose(0,1).transpose(0,2).unsqueeze(dim=0) \n        features_cls256 = features_cls256.to(self.device4k, non_blocking=True)  # 4. [1 x 384 x w_256 x h_256]\n        features_cls4k = self.model4k.forward(features_cls256)                  # 5. [1 x 192], where 192 == dim of ViT_4K [ClS] token.\n        return features_cls4k\n```\n\n### Using the HIPT_4K API\nYou can use the HIPT_4K model out-of-the-box, and use it to plug-and-play into any of your downstream tasks (example below).\n```python\nfrom HIPT_4K.hipt_4k import HIPT_4K\nfrom HIPT_4K.hipt_model_utils import eval_transforms\n\nmodel = HIPT_4K()\nmodel.eval()\n\nregion = Image.open('HIPT_4K/image_demo/image_4k.png')\nx = eval_transforms()(region).unsqueeze(dim=0)\nout = model.forward(x)\n```\n\n### Hierarchical Interpretability\n\n\u003cdiv align=\"center\"\u003e\n  \u003cimg width=\"100%\" alt=\"DINO illustration\" src=\".github/HIPT_attention.jpg\"\u003e\n\u003c/div\u003e\n\nFor hierarchical interpretability, please see the [following notebook](https://github.com/mahmoodlab/HIPT/blob/master/HIPT_4K/HIPT_4K%20Inference%20%2B%20Attention%20Visualization.ipynb), which uses the following functions in [./HIPT_4K/hipt_heatmap_utils.py](https://github.com/mahmoodlab/HIPT/blob/master/HIPT_4K/hipt_heatmap_utils.py).\n\n\n\n## Downloading + Preprocessing + Organizing TCGA Data\nUsing the [NIH Genomic Data Commons Data Portal](https://portal.gdc.cancer.gov/) and the [cBioPortal](https://www.cbioportal.org/), we downloaded diagnostic whole-slide images (WSIs) for 28 cancer types using the [GDC Data Transfer Tool](https://docs.gdc.cancer.gov/Data_Transfer_Tool/Users_Guide/Data_Download_and_Upload/), followed by using the publicly-available [CLAM library](https://github.com/mahmoodlab/CLAM) for tissue segmentation, tissue patching and feature extraction, which we modified for extracting both ResNet-50 features (pretrained on ImageNet) and ViT-16 features (pretrained on the TCGA). For patching at `[256 × 256]` resolution, we used default tissue segmentation parameters. For patching at `[4096 × 4096]` resolution, we additionally saved each `[4096 × 4096]` image region, which we used for ViT_256-16 and ViT_4096-256 pretraining (`-16` suffix == using [16 × 16]-sized tokens in a ViT model, `-256` suffix == using [256 × 256]-sized tokens in a ViT model). Extracted TCGA features are organized in the following directories:\n\u003cdetails\u003e\n\u003csummary\u003e\nExample Directory\n\u003c/summary\u003e\n  \n```bash\nTCGA_ROOT_DIR/\n    └──tcga_acc/\n        ├── ...\n    └──tcga_blca/\n        ├── ...\n    └──tcga_brca/\n        └── WSIs/\n            ├── slide_1.svs\n            ├── slide_2.svs\n            └── ...\n        └── extracted_mag20x_patch256_fp/\n            └── masks/\n                ├── slide_1.png\n                ├── slide_2.png\n                └── ...\n            └── patches/\n                ├── slide_1.h5\n                ├── slide_2.h5\n                └── ...\n            └── stitches/\n                ├── slide_1.png\n                ├── slide_2.png\n                └── ...\n            └── resnet50_trunc_pt_patch_features/\n                ├── slide_1.pt\n                ├── slide_2.pt\n                └── ...\n            └── vits_tcga_pancancer_dino_pt_patch_features/\n                ├── slide_1.pt\n                ├── slide_2.pt\n                └── ...\n            └── process_list_autogen.csv\n        └── extracted_mag20x_patch4096_fp/\n            └── masks/\n                ├── slide_1.png\n                ├── slide_2.png\n                └── ...\n            └── patches/\n                ├── slide_1.h5\n                ├── slide_2.h5\n                └── ...\n            └── stitches/\n                ├── slide_1.png\n                ├── slide_2.png\n                └── ...\n            └── tar_patch_4096/\n                ├── slide_1.tar\n                ├── slide_2.tar\n                └── ...\n            └── vits_tcga_pancancer_dino_pt_patch_features/\n                ├── slide_1.pt\n                ├── slide_2.pt\n                └── ...\n            └── process_list_autogen.csv\n    └──tcga_coadread/\n        ├── ...\n    ...\n    └──tcga_ucec/\n        ├── ...\n```\n\u003c/details\u003e\n\nEach cancer type is organized as its own folder in `TCGA_ROOT_DIR`, which additionally contains the following subfolders:\nIn extracting patches at 20X magnification with non-overlapping patch sizes of 256, we create a results directory called `extracted_mag20x_patch256_fp` that will contain the following files / folders:\n\u003cdetails\u003e\n  \u003csummary\u003e\n    Folder Structure\n  \u003c/summary\u003e\n  \n1. `WSIs/`: Raw `*.svs` WSIs for that cancer type\n2. `extracted_mag20x_patch256_fp`: Extracted features at 20× magnification for `[256 × 256]` patches (performed only for BRCA, COADREAD, LUAD, LUSC, CCRCC, CHRCC, PRCC, and STAD studies in TCGA). The `_fp` suffix represents the use of 'fast patching\" as performed in CLAM, in which coordinates instead of raw patches are saved. This folder contains the following subfolders:\n    - `masks/`: Directory of segmented tissue-containing regions (one image per WSI).\n    - `patches/`: Directory of extracted image patches (one .h5 file per WSI, where each entry corresponds to the coordinates of the top-left corner of a patch)\n    - `stitches/`: Directory of downsampled visualizations of stitched tissue patches, used a sanity check to inspect whether we patched correctly (one image per WSI). \n    - `resnet50_trunc_pt_patch_features/`: Directory of pre-extracted ResNet-50 features (pretrained on ImageNet) for each patch within each WSI (with patches read via OpenSlide using coordinates in `patches/`, saved in a `*.pt` format. Each `*.pt` file is a `[M × 1024]`-sized Tensor containing extracted 1024-dim embeddings for `M` patches in the WSI.\n    - `vits_tcga_pancancer_dino_pt_patch_features/`: Directory of pre-extracted ViT-16 features (pretrained on TCGA) for each patch within each WSI (with patches read via OpenSlide using coordinates in `patches/`, saved in a `*.pt` format. Each `*.pt` file is a `[M × 384]`-sized Tensor containing extracted 384-dim embeddings for `M` patches in the WSI.\n    - `process_list_autogen.csv`: An auto-generated csv file that contains a list of all slides processed, along with their segmentation/patching parameters used.\n3. `extracted_mag20x_patch4096_fp`: Extracted features at 20× magnification for `[4096 × 4096]` image regions, containing the following subfolders:\n    - `masks/`: Same as `[256 × 256]` setting.\n    - `patches/`: Same as `[256 × 256]` setting.\n    - `stitches/`: Same as `[256 × 256]` setting.\n    - `tar_patch_4096/`: Directory of saved `[4096 × 4096]` image regions for each WSI, stored in a `*.tar` format using [WebDataset](https://github.com/webdataset/webdataset) API.\n    - `vits_tcga_pancancer_dino_pt_patch_features/`: Directory of pre-extracted ViT-16 features (pretrained on TCGA) for each `[4096 × 4096]` region within each WSI (with regions read via OpenSlide using coordinates in `patches/`, saved in a `*.pt` format. Each `*.pt` file is a `[M × 256 × 384]`-sized Tensor containing extracted 384-dim embeddings for `M` regions in the WSI, which each region represented as as a 256-length sequence of `[256 × 256]` patch embeddings.\n    - `process_list_autogen.csv`: An auto-generated csv file that contains a list of all slides processed, along with their segmentation/patching parameters used. Note that in using a large image resolution for patching, not all WSIs are used in `[4096 × 4096]` evaluation.\n\u003c/details\u003e\n\nOrganizing the folders and subfolders for all of these different cancer types (with different features types too) allowed ease of running classification experiments.\n \n## Hierarchical Pretraining for ViT-16/256 Models + Pretrained Models\n\u003cdetails\u003e\n\u003csummary\u003e\nExample Directory\n\u003c/summary\u003e\n  \n```bash\nTCGA_PRETRAINING_DIR/\n  └──patch_256_pretraining/\n      ├── patch_1.png\n      ├── patch_2.png\n      └── ...\n  └──region_4096_pretraining/\n      ├── slide_1_1.pt\n      ├── slide_1_2.pt\n      └── ...\n  └──ckpts/\n      └── pretrain/\n          └── vit256_s_dino.pth\n          └── vit4k_xs_dino.pth\n ```\n \u003c/details\u003e\n \n We set up the following directories for ViT_256-16 and ViT_4K-256 pretraining respectively:\n  - `.../path/to/patch_256_pretraining/`: Directory of raw `[256 × 256]` patches (as `*.png` format) extracted from the `tar_patch_4096/` subdirectories of each cancer type, used to pretrain ViT_256-16.\n  - `.../path/to/region_4096_pretraining/`: Directory of pre-extracted ViT_4K-256 features for each `[4096 × 4096]` region across all WSIs (in total: 433779 regions). Each `*.pt` file is a `[256 × 384]`-sized Tensor, which is a 256-length sequence of pre-extracted ViT_256-16 features for each `[256 × 256]` patch. This folder is used to pretain ViT_4K-256.\n  - `./HIPT_4K/Checkpoints/`: Directory for holding the pretrained weights, which we use for feature extraction. Our pretraining method largely follows the original [DINO](https://github.com/facebookresearch) framework for conventional `[256 × 256]` image pretraining using ViT_256-16, which we extend to the `[4096 × 4096]` setting. Again, note that the `-16` suffix refers to using [16 × 16]-sized tokens in a ViT model, and the `-256` suffix using [256 × 256]-sized tokens in a ViT model. The following commands below are used for pretraining.\n\n```python\npython -m torch.distributed.launch --nproc_per_node=8 main_dino.py --arch vit_small --data_path /path/to/TCGA_PRETRAINING_DIR/patch_256_pretraining/ --output_dir /path/to/TCGA_PRETRAINING_DIR/ckpts/pretrain/ --epochs 100\npython -m torch.distributed.launch --nproc_per_node=8 main_dino4k.py --arch vit_xs --data_path /path/to/TCGA_PRETRAINING_DIR/region_4k_pretraining/ --output_dir /path/to/TCGA_PRETRAINING_DIR/ckpts/pretrain/ --epochs 100\n```\n\n\n\u003ctable\u003e\n  \u003ctr\u003e\n    \u003cth\u003eSSL Strategy\u003c/th\u003e\n    \u003cth\u003eViT SSL\u003c/th\u003e\n    \u003cth\u003eDataset\u003c/th\u003e\n    \u003cth\u003eIteration\u003c/th\u003e\n    \u003cth\u003eBatch Size\u003c/th\u003e\n    \u003cth\u003eArch\u003c/th\u003e\n    \u003cth\u003eImage Size\u003c/th\u003e\n    \u003cth\u003eToken Size\u003c/th\u003e\n    \u003cth\u003eDim\u003c/th\u003e\n    \u003cth\u003eDownload\u003c/th\u003e\n  \u003c/tr\u003e\n  \n  \u003ctr\u003e\n    \u003ctd rowspan=\"2\"\u003eHierarchical Pretraining\u003c/td\u003e\n    \u003ctd\u003e\u003ca href=\"https://github.com/facebookresearch/dino\"\u003eDINO\u003c/a\u003e\u003c/td\u003e\n    \u003ctd\u003eTCGA\u003c/td\u003e\n    \u003ctd\u003e400,000\u003c/td\u003e\n    \u003ctd\u003e256\u003c/td\u003e\n    \u003ctd\u003eViT-S/16\u003c/td\u003e\n    \u003ctd\u003e256\u003c/td\u003e\n    \u003ctd\u003e16\u003c/td\u003e\n    \u003ctd\u003e384\u003c/td\u003e\n    \u003ctd\u003e\u003ca href=\"https://github.com/mahmoodlab/HIPT/blob/master/HIPT_4K/Checkpoints/vit256_small_dino.pth\"\u003eBackbone\u003c/a\u003e\u003c/td\u003e\n  \u003c/tr\u003e\n\t\n  \u003ctr\u003e\n    \u003ctd\u003e\u003ca href=\"https://github.com/facebookresearch/dino\"\u003eDINO\u003c/a\u003e\u003c/td\u003e\n    \u003ctd\u003eTCGA\u003c/td\u003e\n    \u003ctd\u003e200,000\u003c/td\u003e\n    \u003ctd\u003e256\u003c/td\u003e\n    \u003ctd\u003eViT-XS/256\u003c/td\u003e\n    \u003ctd\u003e4096\u003c/td\u003e\n    \u003ctd\u003e256\u003c/td\u003e\n    \u003ctd\u003e192\u003c/td\u003e\n    \u003ctd\u003e\u003ca href=\"https://github.com/mahmoodlab/HIPT/blob/master/HIPT_4K/Checkpoints/vit4k_xs_dino.pth\"\u003eBackbone\u003c/a\u003e\u003c/td\u003e\n  \u003c/tr\u003e\n\u003c/table\u003e\n\n\n\n\n## Weakly-Supervised Training + Evaluation\nFollowing ViT-16/256 pretraining and pre-extracting instance-level `[256 × 256]` features using ViT-16, we extend the publicly-available CLAM scaffold code for running 10-fold cross-validation experiments as well as implement several of the current weakly-supervised baselines. Our main method is `hipt_lgp` (abbreviated for HIPT with Local-Global Pretraining). We make available our [saved results directory](https://github.com/mahmoodlab/HIPT/tree/master/2-Weakly-Supervised-Subtyping/results_cvpr2022_class), [evaluation code](https://github.com/mahmoodlab/HIPT/blob/master/2-Weakly-Supervised-Subtyping/Evaluation-Classification.ipynb), and a [Jupyter Notebook](https://github.com/mahmoodlab/HIPT/blob/master/2-Weakly-Supervised-Subtyping/Model%20Walkthrough.ipynb) containing a walkthrough of our method.\n\n\u003cdetails\u003e\n\u003csummary\u003e\nFull List of Training Classification Commands\n\u003c/summary\u003e\n\n```python\nGPU=0\nDATAROOT=/path/to/TCGA_ROOT_DIR/\nTASK=tcga_brca_subtype\nCUDA_VISIBLE_DEVICES=$GPU python main.py --data_root_dir $DATAROOT --model_type hipt_lgp --task $TASK --prop 0.25\nCUDA_VISIBLE_DEVICES=$GPU python main.py --data_root_dir $DATAROOT --model_type hipt_lgp --task $TASK --prop 1.0\nCUDA_VISIBLE_DEVICES=$GPU python main.py --data_root_dir $DATAROOT --model_type hipt_lgp --task $TASK --prop 0.25 --pretrain_4k vit4k_xs_dino\nCUDA_VISIBLE_DEVICES=$GPU python main.py --data_root_dir $DATAROOT --model_type hipt_lgp --task $TASK --prop 1.0 --pretrain_4k vit4k_xs_dino\nCUDA_VISIBLE_DEVICES=$GPU python main.py --data_root_dir $DATAROOT --model_type hipt_lgp --task $TASK --prop 0.25 --pretrain_4k vit4k_xs_dino --freeze_4k\nCUDA_VISIBLE_DEVICES=$GPU python main.py --data_root_dir $DATAROOT --model_type hipt_lgp --task $TASK --prop 1.0 --pretrain_4k vit4k_xs_dino --freeze_4k\nTASK=tcga_kidney_subtype\nCUDA_VISIBLE_DEVICES=$GPU python main.py --data_root_dir $DATAROOT --model_type hipt_lgp --task $TASK --prop 0.25\nCUDA_VISIBLE_DEVICES=$GPU python main.py --data_root_dir $DATAROOT --model_type hipt_lgp --task $TASK --prop 1.0\nCUDA_VISIBLE_DEVICES=$GPU python main.py --data_root_dir $DATAROOT --model_type hipt_lgp --task $TASK --prop 0.25 --pretrain_4k vit4k_xs_dino\nCUDA_VISIBLE_DEVICES=$GPU python main.py --data_root_dir $DATAROOT --model_type hipt_lgp --task $TASK --prop 1.0 --pretrain_4k vit4k_xs_dino\nCUDA_VISIBLE_DEVICES=$GPU python main.py --data_root_dir $DATAROOT --model_type hipt_lgp --task $TASK --prop 0.25 --pretrain_4k vit4k_xs_dino --freeze_4k\nCUDA_VISIBLE_DEVICES=$GPU python main.py --data_root_dir $DATAROOT --model_type hipt_lgp --task $TASK --prop 1.0 --pretrain_4k vit4k_xs_dino --freeze_4k\nTASK=tcga_lung_subtype\nCUDA_VISIBLE_DEVICES=$GPU python main.py --data_root_dir $DATAROOT --model_type hipt_lgp --task $TASK --prop 0.25\nCUDA_VISIBLE_DEVICES=$GPU python main.py --data_root_dir $DATAROOT --model_type hipt_lgp --task $TASK --prop 1.0\nCUDA_VISIBLE_DEVICES=$GPU python main.py --data_root_dir $DATAROOT --model_type hipt_lgp --task $TASK --prop 0.25 --pretrain_4k vit4k_xs_dino\nCUDA_VISIBLE_DEVICES=$GPU python main.py --data_root_dir $DATAROOT --model_type hipt_lgp --task $TASK --prop 1.0 --pretrain_4k vit4k_xs_dino\nCUDA_VISIBLE_DEVICES=$GPU python main.py --data_root_dir $DATAROOT --model_type hipt_lgp --task $TASK --prop 0.25 --pretrain_4k vit4k_xs_dino --freeze_4k\nCUDA_VISIBLE_DEVICES=$GPU python main.py --data_root_dir $DATAROOT --model_type hipt_lgp --task $TASK --prop 1.0 --pretrain_4k vit4k_xs_dino --freeze_4k\n```\n\u003c/details\u003e\n \nAnalagously, we also use the [MCAT](https://github.com/mahmoodlab/MCAT) scaffold code for survival prediction, and make available our [saved results directory / tensorboard logs](https://github.com/mahmoodlab/HIPT/tree/master/2-Weakly-Supervised-Survival/results_2022_surv/5foldcv) and [evaluation code](https://github.com/mahmoodlab/HIPT/blob/master/2-Weakly-Supervised-Survival/Evaluation-Survival.ipynb).\n\u003cdetails\u003e\n\u003csummary\u003e\nFull List of Training Survival Commands\n\u003c/summary\u003e\n \n```python\nDATAROOT=/path/to/TCGA_ROOT_DIR/\nGPU=0\nCUDA_VISIBLE_DEVICES=$GPU python main.py --data_root_dir $DATAROOT --which_splits 5foldcv --split_dir tcga_brca --mode pyramid --model_type hipt_lgp --pretrain_4k vit4k_xs_dino --freeze_4k\nCUDA_VISIBLE_DEVICES=$GPU python main.py --data_root_dir $DATAROOT --which_splits 5foldcv --split_dir tcga_coadread --mode pyramid --model_type hipt_lgp --pretrain_4k vit4k_xs_dino --freeze_4k\nCUDA_VISIBLE_DEVICES=$GPU python main.py --data_root_dir $DATAROOT --which_splits 5foldcv --split_dir tcga_kirc --mode pyramid --model_type hipt_lgp --pretrain_4k vit4k_xs_dino --freeze_4k\nCUDA_VISIBLE_DEVICES=$GPU python main.py --data_root_dir $DATAROOT --which_splits 5foldcv --split_dir tcga_kirp --mode pyramid --model_type hipt_lgp --pretrain_4k vit4k_xs_dino --freeze_4k\nCUDA_VISIBLE_DEVICES=$GPU python main.py --data_root_dir $DATAROOT --which_splits 5foldcv --split_dir tcga_luad --mode pyramid --model_type hipt_lgp --pretrain_4k vit4k_xs_dino --freeze_4k\nCUDA_VISIBLE_DEVICES=$GPU python main.py --data_root_dir $DATAROOT --which_splits 5foldcv --split_dir tcga_stad --mode pyramid --model_type hipt_n --pretrain_4k vit4k_xs_dino --freeze_4k\n```\n\u003c/details\u003e\n\n## Understanding Baselines, Clarifications, and Future Work\nIn making the pretrained weights for HIPT fully-available, we hope that HIPT can be plugged-and-played in your experiments, and you would find the same level of improvement :). In building off of this work, we clarify a few details:\n- As slide-level tasks in the TCGA do not have official benchmarks, reported AUC performance may vary with different train-test splits. The results in this work use the following 10-fold CV and 5-fold CV train-test splits, which have been used consistently in prior works. Though the comparisons of MIL architecture performance are equivalent (all methods using same pretrained patch-level embeddings), general comparisons with MIL performance of prior works cannot be made, as: 1) different patch-level embeddings are used for training MIL methods (ImageNet ResNet-50 vs. SSL ViT-16), 2) a number of WSIs were excluded in each cohort, due to the lack of tissue content in patching at [4096 x 4096] resolution. To reproduce the results of this paper, you must use the exact train-test splits with the same pretrained embedding type.\n- Despite average ViT_4K-256 performing well in KNN evaluation, average ViT_256-16 embeddings did not perform as well as mean ResNet-50 (transferred from ImageNet) embeddings on some of the downstream tasks. Since Hierarchical Pretraining of ViT_4K-256 depends on pre-extracted ViT_256-16 embeddings, there is (of course) considerable room for improvement in boosting unsupervised and weakly-supervised slide-level performance in refining the ViT_256-16 encoder.\n\n\n## Issues\n- Please open new threads or report issues directly to richardchen@g.harvard.edu.\n\n## Acknowledgements, License \u0026 Usage \n- We thank Felix Yu, Ming Y. Lu, Chunyuan Li, and the BioML group at Microsoft Research New England for their insightful feedback.\n- Code for Weakly-Supervised Subtyping + Survival Classification was largely adapted from [CLAM](https://github.com/mahmoodlab/CLAM) and [MCAT](https://github.com/mahmoodlab/MCAT)\n- Code for Hierarchical Pretraining was largely adapted via making modifications to [DINO](https://github.com/facebookresearch/dino)\n- Code for self-supervised evaluation was built on our previous [NeurIPS workshop paper](https://github.com/Richarizardd/Self-Supervised-ViT-Path)\n- If you found our work useful in your research, please consider citing our works(s) at:\n\n```bash\n@article{chen2022self,\n    author    = {Chen, Richard J and Krishnan, Rahul G},\n    title     = {Self-Supervised Vision Transformers Learn Visual Concepts in Histopathology},\n    journal   = {Learning Meaningful Representations of Life, NeurIPS 2021},\n    year      = {2021},\n}\n\n@inproceedings{chen2022scaling,\n    author    = {Chen, Richard J. and Chen, Chengkuan and Li, Yicong and Chen, Tiffany Y. and Trister, Andrew D. and Krishnan, Rahul G. and Mahmood, Faisal},\n    title     = {Scaling Vision Transformers to Gigapixel Images via Hierarchical Self-Supervised Learning},\n    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},\n    month     = {June},\n    year      = {2022},\n    pages     = {16144-16155}\n}\n```\nAny work that cites HIPT should also cite the [original Vision Transformer](https://arxiv.org/abs/2010.11929) and [DINO](https://github.com/facebookresearch/dino).\n\n\n© This code is made available under the Commons Clasuse License and is available for non-commercial academic purposes.\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmahmoodlab%2Fhipt","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmahmoodlab%2Fhipt","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmahmoodlab%2Fhipt/lists"}