{"id":20756181,"url":"https://github.com/mahmoodlab/sish","last_synced_at":"2025-10-17T20:36:13.808Z","repository":{"id":41557291,"uuid":"372622885","full_name":"mahmoodlab/SISH","owner":"mahmoodlab","description":"Fast and scalable search of whole-slide images via self-supervised deep learning - Nature Biomedical Engineering","archived":false,"fork":false,"pushed_at":"2023-06-09T19:29:42.000Z","size":986,"stargazers_count":100,"open_issues_count":0,"forks_count":27,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-03-30T11:51:12.528Z","etag":null,"topics":["bioimage-analysis","bioimage-informatics","deep-learning","fish","histology","histopathology","image-retrieval","image-search-engine","mahmoodlab","pathology","vqvae","wsi-images"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mahmoodlab.png","metadata":{"files":{"readme":"docs/README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-05-31T20:34:56.000Z","updated_at":"2025-03-19T07:34:44.000Z","dependencies_parsed_at":"2024-11-17T09:29:40.297Z","dependency_job_id":"ff0c6e13-d8bc-4d88-b119-e870ab79cc13","html_url":"https://github.com/mahmoodlab/SISH","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mahmoodlab%2FSISH","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mahmoodlab%2FSISH/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mahmoodlab%2FSISH/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mahmoodlab%2FSISH/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mahmoodlab","download_url":"https://codeload.github.com/mahmoodlab/SISH/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":251366461,"owners_count":21578126,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bioimage-analysis","bioimage-informatics","deep-learning","fish","histology","histopathology","image-retrieval","image-search-engine","mahmoodlab","pathology","vqvae","wsi-images"],"created_at":"2024-11-17T09:29:30.013Z","updated_at":"2025-10-17T20:36:08.753Z","avatar_url":"https://github.com/mahmoodlab.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"SISH \n===========\nFast and scalable search of whole-slide images via self-supervised deep learning\n\n*Nature Biomedical Engineering*\n\n[Read Link](https://t.co/nEd5wulzHh) | [Journal Link](https://www.nature.com/articles/s41551-022-00929-8) | [Preprint](https://arxiv.org/abs/2107.13587) | [Cite](#reference)\n\n***TL;DR:** SISH is a histology whole slide image search pipeline that scales with O(1) and maintains constant search speed regardless of the size of the database. SISH uses self-supervised deep learning to encode meaningful representations from WSIs and a Van Emde Boas tree for fast search, followed by an uncertainty-based ranking algorithm to retrieve similar WSIs. We evaluated SISH on multiple tasks and datasets with over 22,000 patient cases spanning 56 disease subtypes. We additionally demonstrate that SISH can be used to assist with the diagnosis of rare cancer types where sufficient cases may not be available to train traditional deep models.*\n\u003cp align=\"center\"\u003e\n    \u003cimg src=\"../assets/FISH_github_front_long.gif\" alt=\"Teaser\" width=\"600\"\u003e\n\u003c/p\u003e\n\n## Pre-requisites:\n* Linux (Tested on Ubuntu 18.04)\n* NVIDIA GPU (NVIDIA GeForce 2080 Ti) \n* Python (3.7.0), OpenCV (3.4.0), Openslide-python (1.1.1) and Pytorch (1.5.0)\nFor more details, please refer to the [installtion guide](INSTALLATION.md).\n\n## Usage\nThe steps below show how to build SISH pipeline in your own dataset. To reproduce the results in our paper, please refer to the reproducibility section.\n### Preprocessing\n#### Step 1: Slide preparation\nMake the `./DATA` folder, download whole slide images there, and then organize them into the following structure. Note that we ignore slides without specific resolution. \n```bash\nDATA\n└── WSI\n    ├── SITE\n    │   ├── DIAGNOSIS\n    │   │   ├── RESOLUTION\n    │   │   │   ├── slide_1\n    │   │   │   ├── slide_2\n    │   │   │   ├── slide_3\n```\n#### Step 2: Segmentation and Patching\nWe use the [CLAM toolbox](https://github.com/mahmoodlab/CLAM/blob/master/docs/README.md) to segment and patch whole slide images. Simply run:\n```\npython create_patches_fp.py --source ./DATA/WSI/SITE/DIAGNOSIS/RESOLUTION/ --step_size STEP_SIZE --patch_size PATCH_SIZE --seg --patch --save_dir ./DATA/PATCHES/SITE/DIAGNOSIS/RESOLUTION\n```\nWe set `PATCH_SIZE` and `STEP_SIZE` to 1024 for 20x slide and to 2048 for 40x slide. After segmentation and patching, the `DATA` directory will look like the following\n```bash\nDATA/\n├── PATCHES\n│   └── SITE\n│       └── DIAGNOSIS\n│           └── RESOLUTION\n│               ├── masks\n│               ├── patches\n│               ├── process_list_autogen.csv\n│               └── stitches\n└── WSI\n\n```\n#### Step 3: Mosaic generation\nThe following script generates the mosaics for each whole slide image (Please download the checkpoint trash_lgrlbp.pkl from the link in the reproducibility section):\n```\npython extract_mosaic.py --slide_data_path ./DATA/WSI/SITE/DIAGNOSIS/RESOLUTOIN --slide_patch_path ./DATA/PATCHES/SITE/DIAGNOSIS/RESOLUTION/patches/ --save_path ./DATA/MOSAICS/SITE/DIAGNOSIS/RESOLUTION\n```\n\nOnce mosaic generation finsihes, there are some rare cases contain artifacts (i.e., pure white patch) result from mosaic generation. We run the following script to remove the artifacts\n```\npython artifacts_removal.py --site_slide_path ./DATA/WSI/SITE/  --site_mosaic_path ./DATA/MOSAICS/SITE\n```\nThe `DATA` directory should look like below. We only use the mosaics in the `coord_clean` folder for all experiments in the paper.\n```bash\nDATA/\n├── MOSAICS\n│   └── SITE\n│       └── DIAGNOSIS\n│           └── RESOLUTION\n│               ├── coord\n│               ├── coord_clean\n├── PATCHES\n└── WSI\n```\n#### Step 4 SISH database construction\nTo buid the database for each anatomic site, run `build_index.py` as below\n```\npython build_index.py --site SITE\n```\nAfter the script completes, it creates a database folder organized like\n```bash\nDATABASES/\n└── SITE\n    ├── index_meta\n    │   └── meta.pkl\n    └── index_tree\n        └── veb.pkl\n```\nThe `index_meta/meta.pkl` stores the meta data of each integer key in `index_tree/veb.pkl`. \nIt also creates a folder `LATENT` that store the mosaic latent code from VQ-VAE and texture features from densenet which has the structure below\n```bash\n\nDATA/LATENT/\n├── SITE\n│   ├── DIAGNOSIS\n│   │   ├── RESOLUTION\n│   │   │   ├── densenet\n│   │   │   │   ├── slide_1.pkl\n│   │   │   │   └── slide_2.pkl\n│   │   │   └── vqvae\n│   │   │       ├── slide_1.h5\n│   │   │       └── slide_2.h5\n\n```\n#### Step 5 Search the whole database\nRun the script below to get each query's results in the database.\n```\npython main_search.py --site SITE --db_index_path ./DATABASES/SITE/index_tree/veb.pkl --index_meta_path ./DATABASES/SITE/index_meta/meta.pkl\n```\n\nIt will store the results for each query and the time it takes in two separate folders, which are\n```bash\nQUERY_RESULTS/\n└── SITE\n    └── results.pkl\nQUERY_SPEED/\n├── SITE\n│   └── speed_log.txt\n```\n#### Step 6 Evaluation\nRun the `eval.py` to get the performance results which will direclty print on the screen when finish.\n```bash\npython eval.py --site SITE --result_path QUERY_RESULTS/SITE/results.pkl\n```\n\n### Optional: SISH for patch retrieval\nIf you would like to use SISH for patch retrieval task, please organize your data into the structure below\n```bash\n./DATA_PATCH/\n├── All\n├── summary.csv\n```\nwhere all patches files are in the folder `All` and the summary.csv file stores patch name and label in the format below\n```bash\npatch1,y1\npatch2,y2\npatch3,y3\n...\n```\nOnce prepared, run the following:\n\nBuild database:\n```\npython build_index_patch.py --exp_name EXP_NAME --patch_label_file ./DATA_PATCH/summary.csv --patch_data_path ./DATA_PATCH/All\n```\nwhere the `EXP_NAME` is a customized name of this database. You can reproduce our kather100k results by setting `EXP_NAME=kather100k`. One thing to note is that you should implement your method start from line 236 to scale your patch to 1024x1024 if you use your own patch data.\n\n\nSearch:\n```\npython main_search_patch.py --exp_name EXP_NAME --patch_label_file ./DATA_PATCH/summary.csv --patch_data_path ./DATA_PATCH/All --db_index_path DATABASES_PATCH/EXP_NAME/index_tree/veb.pkl --index_meta_path DATABASES_PATCH/EXP_NAME/index_meta/meta.pkl\n```\n\nEvaluation:\n```\npython eval_patch.py --result_path QUERY_RESULTS/PATCH/EXP_NAME/results.pkl\n```\n\n## Reproducibility\nTo reproduce the results in our paper, please download the checkpoints, preprocessed latent code and pre-build databases from the [link](https://drive.google.com/drive/folders/1HClR9ms737qx0d22ia0VQqPLuTIu1UgN?usp=sharing). The preprocess latent codes and pre-build databases are results directly from **Step 1-4** if you start everything from scratch. Once downloaded, unzip the DATABASES.zip and LATENT.ZIP  under `./SISH` and `./SISH/DATA/` respectively.\nThe folder structures should like the ones in **Step 4**. Run the command in **Step 5** and **Step 6** to reproduce the results in each site. \n\nTo reproduce the anatomic site retrieval, run\n```\npython main_search.py --site organ --db_index_path DATABASES/organ/index_tree/veb.pkl --index_meta_path DATABASES/organ/index_meta/meta.pkl\n```\nand\n```\npython eval.py --site organ --result_path QUERY_RESULTS/organ/results.pkl\n```\nNote that the speed results could be different from paper if your CPU is not equivalent to ours (AMD368Ryzen Threadripper 3970X  32-Core Processor).\n\n\n## Funding\nThis work was funded by NIH NIGMS [R35GM138216](https://reporter.nih.gov/search/sWDcU5IfAUCabqoThQ26GQ/project-details/10029418).\n\n## Reference\nIf you find our work useful in your research or if you use parts of this code please consider citing our [paper](https://www.nature.com/articles/s41551-022-00929-8):\n\nChen, C., Lu, M.Y., Williamson, D.F.K. et al. Fast and scalable search of whole-slide images via self-supervised deep learning. Nat. Biomed. Eng 6, 1420–1434 (2022) https://doi.org/10.1038/s41551-022-00929-8\n```\n@article{chen2022fast,\n  title={Fast and scalable search of whole-slide images via self-supervised deep learning},\n  author={Chen, Chengkuan and Lu, Ming Y and Williamson, Drew FK and Chen, Tiffany Y and Schaumberg, Andrew J and Mahmood, Faisal},\n  journal={Nature Biomedical Engineering},\n  volume={6},\n  number={12},\n  pages={1420--1434},\n  year={2022},\n  publisher={Nature Publishing Group UK London}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmahmoodlab%2Fsish","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmahmoodlab%2Fsish","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmahmoodlab%2Fsish/lists"}