{"id":13752659,"url":"https://github.com/owkin/HE2RNA_code","last_synced_at":"2025-05-09T20:34:10.213Z","repository":{"id":37634238,"uuid":"275776886","full_name":"owkin/HE2RNA_code","owner":"owkin","description":"Train a model to predict gene expression from histology slides.","archived":true,"fork":false,"pushed_at":"2022-07-06T20:53:24.000Z","size":2674,"stargazers_count":93,"open_issues_count":14,"forks_count":37,"subscribers_count":7,"default_branch":"master","last_synced_at":"2024-11-16T05:32:08.932Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/owkin.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-06-29T08:39:30.000Z","updated_at":"2024-11-15T02:04:22.000Z","dependencies_parsed_at":"2022-07-12T16:34:51.634Z","dependency_job_id":null,"html_url":"https://github.com/owkin/HE2RNA_code","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/owkin%2FHE2RNA_code","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/owkin%2FHE2RNA_code/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/owkin%2FHE2RNA_code/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/owkin%2FHE2RNA_code/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/owkin","download_url":"https://codeload.github.com/owkin/HE2RNA_code/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":253321711,"owners_count":21890448,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-03T09:01:08.974Z","updated_at":"2025-05-09T20:34:10.199Z","avatar_url":"https://github.com/owkin.png","language":"Python","funding_links":[],"categories":["Ranked by starred repositories"],"sub_categories":[],"readme":"# Gene expression prediction\n\nPredict gene expression from WSIs taken from TCGA with HE2RNA [1]. The model takes as inputs arrays of size n_tiles * 2048, where n_tiles = 100 when super-tile preprocessing is used, and n_tiles = 8,000 when all tiles are treated separately. The model is implemented as a succession of 1D convolution (equivalent to an MLP shared among all tiles).\nAdditionally, Model interpretability can be explored at: https://owkin.com/he2rna-result-visualization/.\n\n## Installation\n\nCreate a virtual environment and install the required packages (the variable CUDA_TOOLKIT_ROOT_DIR is needed to install libKMcuda):\n```bash\npython3 -m venv .env\nsource .env/bin/activate\n\nexport CUDA_TOOLKIT_ROOT_DIR=/usr/local/cuda\npip install -r requirements.txt\n```\nNOTE: code was run with python 3.7.4\n\n## Data collection and preprocessing\n\nTo ensure reproducibility of the results, coordinates of the tiles used in the paper (necessary to extract tile images and features from whole-slide images) are provided in the archive tile_coordinates.gz.\n\nEDIT: due to an issue related to data quota, file tile_coordinates.gz should be downloaded instead from https://drive.google.com/file/d/1PJsUv1SQieJs7hqtWOqW68v1K9c-mIF6/view?usp=sharing.\n\nTo uncompress it, run\n```bash\ntar -xzvf tile_coordinates.gz\n```\nSplits used in the paper are also provided in patient_splits.pkl.\n\n### TCGA\n\nWe originally downloaded the whole-slide images from the TCGA data portal https://portal.gdc.cancer.gov/ via the gdc-client tool. To access all of TCGA data used in this work, follow the steps described below.\nFirst, create a folder to store data.\n```bash\nmkdir data\n```\nPaths to folders containing slides, tile features and RNAseq data should be consistent with the contant of file constant.py. If data is saved in a different location, constant.py has to be modified accordingly, as well as the example config files.\n\n#### Download TCGA slides\nGo to the dedicated folder to store files (all FFPE slides from TCGA is approx. 10To)\n```bash\ncd data\nmkdir TCGA_slides/\ncd TCGA_slides/\n```\nFor each project, create a subfolder, e.g.\n```bash\nmkdir TCGA_LIHC/\ncd TCGA_LIHC/\n```\nDownload images using the corresponding manifest:\n```bash\ngdc-client download -m gdc_manifests/gdc_manifest.2018-06-26_TCGA-LIHC.txt\n```\nFrozen slides from COAD and READ have been grouped together in CRC.\n```bash\nmkdir TCGA_CRC_frozen/\ncd TCGA_CRC_frozen/\ngdc-client download -m gdc_manifests/gdc_manifest.CRC_frozen_slides.txt\n```\n   \n#### Tile feature extraction\nThe code in extract_tile_features_from_slides.py is designed to extract resnet features of tile images directly from whole-slide images, using the coordinates of the tiles in Openslide format. To extract tile features from WSIs from a given TCGA project, e.g. LIHC, run:\n```bash\nmkdir TCGA_tiles/\n\npython extract_tile_features_from_slides.py --path_to_tiles /TCGA_slides/TCGA_LIHC --tile_coordinates tile_coordinates/tile_coordinates_TCGA_LIHC.pkl --path_to_save_features TCGA_tiles/TCGA_LIHC\n```\n\n#### Download and preprocess RNAseq data\nCreate a folder to store rnaseq data and download transcriptomes:\n```bash\ncd data\nmkdir TCGA_transcriptome\ncd TCGA_transcriptome\ngdc-client download -m gdc_manifests/gdc_manifest.2018-03-13_alltranscriptome.txt\n```\nAt this stage, there should be one folder per sample, containing a .gz archive. Extract the archives, using for instance gunzip\n```bash\ngunzip */*.txt.gz\n```\nTo make things more convenient, we already save a file containing transcriptomes matched to whole-slide images, using\n```bash\npython transcriptome_data.py\n```\n\n#### Supertile preprocessing\nFinally, once all previous steps have been performed, supertile preprocessing can be performed using the following command (the csv file containing transcriptome is used here to ensure consistency between preprocessed image samples and RNAseq data),\n```bash\npython supertile_preprocessing.py --path_to_slides data/TCGA_slides --path_to_transcriptome data/TCGA_transcriptome/all_transcriptomes.csv --path_to_save_processed_data data/TCGA_100_supertiles.h5 --n_tiles 100\n```\n\n### 100,000 histological images of human colorectal cancer and healthy tissue\nThe dataset '100,000 histological images of human colorectal cancer and healthy tissue' [2] is available from https://zenodo.org/record/1214456#.XpgF4m46--w. The file we use here is NCT-CRC-HE-100K-NONORM.zip. Download this file and unzip it. You should have a folder (e.g. data/NCT-CRC-HE-100K-NONORM) containing one subfolder per class (ADI, LYM, etc...).\n\nThe code in extract_tile_features.py is designed to extract resnet features from those tile images\n\n```bash\npython extract_tile_features.py --path_to_tiles data/NCT-CRC-HE-100K-NONORM --path_to_save_features data/NCT-CRC-HE-100K-NONORM_tiles\n```\n\n### PESO\nThe Prostate Epithelium Segmentation dataset (PESO) [3] (whole-slide images and segmentation masks) are available from https://zenodo.org/record/1485967#.Xusr2PI6--x (peso_training_wsi_x.zip and peso_training_masks.zip). Download and unzip those files in a folder (e.g. data/PESO) so that this folder contains subfolders named peso_training_wsi_x/\n\nhe code in extract_tile_features_from_slides.py can be used to extract features from the PESO dataset, using tile_coordinates_PESO.pkl\n\n```bash\npython extract_tile_features_from_slides.py --path_to_slides data/PESO --tile_coordinates tile_coordinates/tile_coordinates_PESO.pkl --path_to_save_features data/PESO_tiles\n```\n\n## Gene expression prediction experiment\n\nTo run an experiment, write first a config file or use one of the examples available in folder condigs. \n\n* config_all_genes.ini: simultaneous prediction of all genes on all TCGA data, using super-tile-preprocessed data.\n* config_CD3_all_TCGA.ini: prediction of CD3 genes on all TCGA data, using super-tile-preprocessed data.\n* config_CD3_selection.ini: prediction of CD3 genes on a subset of cancers (COAD/LIHC/PRAD/LUAD/LUSC/BRCA), using all available tiles (8,000) per slide, and starting training from checkpoint previously saved.\nSimilarly for CD19/CD20 genes, epithelium genes (TP63, KRT8 and KRT18) and MKI67.\n\nLaunch experiment with a single train-test split:\n```bash\npython main.py --config \u003cconfig_file\u003e --run single_run --logdir ./exp\n```\nLaunch cross-validation:\n```bash\npython main.py --config \u003cconfig_file\u003e --run cross_validation --n_folds 5 --logdir ./exp\n```\nLaunch TensorboardX for visualizing training curves\n```bash\ntensorboard --logdir=./exp --port=6006\n```\n\nResults will be saved in the specified path as follows:\n* for a single train/valid/test split, the model will be saved as model.pt and the correlation per gene and cancer type will be saved as results_single_split.csv\n* for a cross-validation, each model will be saved in a dedicated folder model_i/model.pt, the correlation per gene, cancer type and fold will be saved as results_per_fold.csv.\n\n### Config file options\n\n* [main]\n\t* path: Path to the directory where model's weights will be saved.\n\t* use_saved_model (optional): Path to previous experiment to reload saved models\n\t* splits (optional): Path to Pickle file containing saved patient splits for cross-validation, useful in particular when finetuning a model on a subset of the data, to ensure consistency of the train and test set with those used for pretraining.\n    * single_split (optional): Path to Pickle file containing saved patient split for single run\n\n* [data]\n\t* genes (optional): List of coma-separated Ensembl IDs, or path to a pickle file containing such a list. If None, all available genes with nonzero median expression are used.\n\t* path_to_transcriptome (optional): If None, build targets from projectname and list of genes. Otherwise, load transcriptome data from a saved csv file.\n\t* path_to_data (optional): Path to the data, saved either in a pickle file (for aggregated data) or in an hdf5 file. If None, build the dataset from .npy files.\n\n* [architecture]\n\t* layers: Integers defining the number of feature maps of the model's 1D convolutional layers\n\t* dropout: Float between 0 and 1.\n\t* ks: List of ks to sample from\n\t* nonlin: 'relu', 'sigmoid' or 'tanh'.\n\t* device: 'cpu' or 'cuda'.\n\n* [training]\n\t* max_epochs: Integer, defaults to 200.\n    * patience: Integer, defaults to 20.\n    * batch_size: Integer, defaults to 16.\n    * num_workers: number of workers used for loading batches, defaults to 0 (value should be 0 when working with hdf5-stored data)\n\n* [optimization]\n\t* algo: 'sgd' or 'adam'.\n\t* lr: Float.\n\t* momentum: Float, optional\n    \n## Spatialization of gene expression\n\n### Spatialization of lymphocyte genes in colorectal cancer\n\nOnce a model has been trained to predict the expression of genes specifically expressed by lymphocytes (for instance CD3), the following script can be used to compute the AUCs for distinguishing tiles labelled with lymphocytes (LYM) from other categories\n```bash\npython spatialization.py --experiment CRC --path_to_model CD3_selection --path_to_tiles data/NCT-CRC-HE-100K-NONORM_tiles\n```\n\n### Spatialization of epithelium genes in prostate adenocarcinoma\n\nOnce a model has been trained to predict the expression of genes specifically expressed by the epithelium in prostate,\nthe following script can be used to compare the average expression predicted by the model for those genes and the ground truth segmentation of epithelium\n```bash\npython spatialization.py --experiment PESO --path_to_model epithelium_selection --path_to_tiles data/PESO_tiles --path_to_masks data/PESO/peso_training_masks --corr pearson\n```\n\n## MSI prediction\n\nThis part is relatively independant. All that is needed here is:\n* preprocessed tiles from a dataset with MSI status: COAD(FFPE or frozen), READ (FFPE or frozen) or STAD (FFPE)\n* rnaseq data from this dataset\n```bash\npython msi_prediction.py --cancer_types COAD READ --type_of_slides FFPE --msi_l 0 --Nsplit 50 --Ncval_all 10 --Ncval 10 --n_internsplit_A 3 --n_internsplit_B 3 --n_epoch 50\n```\nNote: for this part, tile features from CRC frozen slides are expected to be located in PATH_TO_TILES/TCGA_CRC_frozen.\n\n\n## References\n\n[1] Schmauch, B., Romagnoni, A., Pronier, E., Saillard, C., Maillé, P., Calderaro, J., ... \u0026 Courtiol, P. (2019). Transcriptomic learning for digital pathology. bioRxiv, 760173.\n\n[2] Kather, J. N et al. 100,000 histological images of human colorectal cancer and healthy tissue (Version v0.1). Zenodo. http://doi.org/10.5281/zenodo.1214456 (2018).\n\n[3] Bulten, W., et al. PESO: Prostate Epithelium Segmentation on H\u0026E-stained prostatectomy whole slide images (Version 1). Zenodo. http://doi.org/10.5281/zenodo.1485967 (2018).\n\n# License\n\nGPL v3.0\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fowkin%2FHE2RNA_code","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fowkin%2FHE2RNA_code","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fowkin%2FHE2RNA_code/lists"}