https://github.com/jordan2lee/classify-lab-models-and-tumors
Cancer subtype tool for tumors and their lab grown models
https://github.com/jordan2lee/classify-lab-models-and-tumors
cancer machine-learning model molecular neural-networks subtype tumor
Last synced: 5 months ago
JSON representation
Cancer subtype tool for tumors and their lab grown models
- Host: GitHub
- URL: https://github.com/jordan2lee/classify-lab-models-and-tumors
- Owner: jordan2lee
- Created: 2025-04-14T22:35:00.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2025-04-18T23:35:34.000Z (about 1 year ago)
- Last Synced: 2025-04-19T09:51:40.454Z (about 1 year ago)
- Topics: cancer, machine-learning, model, molecular, neural-networks, subtype, tumor
- Language: Python
- Homepage:
- Size: 49.8 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
Subtype Classification of Tumors and Derived Lab Grown Models
Molecular subtyping using the TMP Toolkit
## Table of contents
- [Quickstart Guide](#quickstart-guide)
- [Download Data from Manifest File Using the GDC Client](#download-data-from-manifest-file-using-the-gdc-client)
- [Run Processing Pipeline](#run-processing-pipeline)
- [Sample Subtype Classification using Gene Expression Data](#sample-subtype-classification-using-gene-expression-data)
- [Sample Subtype Classification using DNA Methylation Data](#sample-subtype-classification-using-dna-methylation-data)
## Quickstart Guide
### Setup
Install requirements - detailed instructions are found on the [Requirements page](doc/requirements.md):
1. Install Python 3+
2. Install GDC Data Transfer Tool Client
Ensure that steps are completed on the [Requirements page](doc/requirements.md) - *(includes creating working environment, signining in, and manually downloading required data)*
## Download Data from Manifest File Using the GDC Client
Download Gene Expression Data
```bash
bash scripts/gdc_download.sh PAAD
```
This will create subfolders in `dat`a-raw/_GEXP_` and place GDC molecular matrices here.
> Options for cancer cohort includes `ALL`, `BLCA`, `BRCA`, `COADREAD`, `ESO`, `HNSC`, `KID`, `LGGGBM`, `LIHCCHOL`, `LUNG`, `OV`, `PAAD`, `SARC`, `SKCM`, `UCEC`
For more details on each cancer cohort option see [Cohort Options Page](doc/cohort_options.md)
## Run Processing Pipeline
Example shown for running PAAD cohort
```bash
bash scripts/process.sh PAAD data/prep
```
> Creates file `data/prep/_GEXP/_GEXP_prep2_.tsv` that is prepped for distance calculations
> Options for cancer cohort includes `ALL`, `BLCA`, `BRCA`, `COADREAD`, `ESO`, `HNSC`, `KID`, `LGGGBM`, `LIHCCHOL`, `LUNG`, `OV`, `PAAD`, `SARC`, `SKCM`, `UCEC`
For more details on each cancer cohort option see [Cohort Options Page](doc/cohort_options.md)
## Sample Subtype Classification Using Gene Expression Data
The goal of this analysis is to get cancer subtype predictions for HCMI samples (organoids, cell cultures , xenografts, etc). To accomplish this we will use the top performing pre-trained machine learning models (dockerized TMP models that were trained using TCGA data that has been pre-proccessed). Specifically we are interested in using gene expression from the HCMI samples and eventually compare primary tumors to their corresponding models (organoids, cell cultures , xenografts, etc).
The TMP models (pre-trained models) are specific to TCGA cancer cohorts (TCGA abbreviations), therefore we will split HCMI data into TCGA cancer cohorts(based on sample metadata).
Run gene expression classifier pipeline:
```bash
# where specify cancer, tumor-file, model-file, transformed-dir
bash scripts/run_classify_GEXP.sh \
PAAD \
data/prep/PAAD_GEXP/PAAD_GEXP_prep2_Tumor.tsv \
data/prep/PAAD_GEXP/PAAD_GEXP_prep2_Model.tsv \
data/classifier_gexp/ml_ready_qrank
```
Results can found in `data/classifier_gexp/ml_predictions_qrank/combo/HCMI_TMPsubtype_qRank_.tsv `
*Note: LUNG (includes LUAD and LUSC), ESO (includes GEA and ESCC) during transformation and classification, then is merged in post-classification summary*
## Sample Subtype Classification Using DNA Methylation Data
The goal of this analysis is to get cancer subtype predictions for HCMI samples (organoids, cell cultures , xenografts, etc). To accomplish this we will use the top performing pre-trained machine learning models (dockerized TMP models that were trained using TCGA data that has been pre-proccessed). Specifically we are interested in using gene expression from the HCMI samples and eventually compare primary tumors to their corresponding models (organoids, cell cultures , xenografts, etc).
The TMP models (pre-trained models) are specific to TCGA cancer cohorts (TCGA abbreviations), therefore we will split HCMI data into TCGA cancer cohorts(based on sample metadata).
Run DNA methylation classifier pipeline:
```bash
# where specify cancer, tumor-file, model-file, transformed-dir
bash scripts/run_classify_METHYL.sh \
SKCM \
data/classifier_methyl/processed/20231211_HCMI_TMP_subtype_prediction_feature_matrix_SKCM.tsv
```
Results can found in `data/classifier_methyl/ml_predictions/combo/HCMI_METH_TMPsubtypes..tsv`
*Note: LUNG (includes LUAD and LUSC), ESO (includes GEA and ESCC) during transformation and classification, then is merged in post-classification summary*
> *Second Example for Combination Cohort*
> ```bash
> bash scripts/run_classify_METHYL.sh \
> LUNG \
> data/classifier_methyl/processed/20231211_HCMI_TMP_subtype_prediction_feature_matrix_LUNG.tsv
> ```