An open API service indexing awesome lists of open source software.

https://github.com/jordan2lee/classify-lab-models-and-tumors

Cancer subtype tool for tumors and their lab grown models
https://github.com/jordan2lee/classify-lab-models-and-tumors

cancer machine-learning model molecular neural-networks subtype tumor

Last synced: 5 months ago
JSON representation

Cancer subtype tool for tumors and their lab grown models

Awesome Lists containing this project

README

          

Subtype Classification of Tumors and Derived Lab Grown Models


Molecular subtyping using the TMP Toolkit

## Table of contents
- [Quickstart Guide](#quickstart-guide)
- [Download Data from Manifest File Using the GDC Client](#download-data-from-manifest-file-using-the-gdc-client)
- [Run Processing Pipeline](#run-processing-pipeline)
- [Sample Subtype Classification using Gene Expression Data](#sample-subtype-classification-using-gene-expression-data)
- [Sample Subtype Classification using DNA Methylation Data](#sample-subtype-classification-using-dna-methylation-data)

## Quickstart Guide

### Setup

Install requirements - detailed instructions are found on the [Requirements page](doc/requirements.md):

1. Install Python 3+
2. Install GDC Data Transfer Tool Client

Ensure that steps are completed on the [Requirements page](doc/requirements.md) - *(includes creating working environment, signining in, and manually downloading required data)*

## Download Data from Manifest File Using the GDC Client
Download Gene Expression Data
```bash
bash scripts/gdc_download.sh PAAD
```

This will create subfolders in `dat`a-raw/_GEXP_` and place GDC molecular matrices here.

> Options for cancer cohort includes `ALL`, `BLCA`, `BRCA`, `COADREAD`, `ESO`, `HNSC`, `KID`, `LGGGBM`, `LIHCCHOL`, `LUNG`, `OV`, `PAAD`, `SARC`, `SKCM`, `UCEC`

For more details on each cancer cohort option see [Cohort Options Page](doc/cohort_options.md)

## Run Processing Pipeline
Example shown for running PAAD cohort
```bash
bash scripts/process.sh PAAD data/prep
```

> Creates file `data/prep/_GEXP/_GEXP_prep2_.tsv` that is prepped for distance calculations

> Options for cancer cohort includes `ALL`, `BLCA`, `BRCA`, `COADREAD`, `ESO`, `HNSC`, `KID`, `LGGGBM`, `LIHCCHOL`, `LUNG`, `OV`, `PAAD`, `SARC`, `SKCM`, `UCEC`

For more details on each cancer cohort option see [Cohort Options Page](doc/cohort_options.md)

## Sample Subtype Classification Using Gene Expression Data
The goal of this analysis is to get cancer subtype predictions for HCMI samples (organoids, cell cultures , xenografts, etc). To accomplish this we will use the top performing pre-trained machine learning models (dockerized TMP models that were trained using TCGA data that has been pre-proccessed). Specifically we are interested in using gene expression from the HCMI samples and eventually compare primary tumors to their corresponding models (organoids, cell cultures , xenografts, etc).

The TMP models (pre-trained models) are specific to TCGA cancer cohorts (TCGA abbreviations), therefore we will split HCMI data into TCGA cancer cohorts(based on sample metadata).

Run gene expression classifier pipeline:
```bash
# where specify cancer, tumor-file, model-file, transformed-dir
bash scripts/run_classify_GEXP.sh \
PAAD \
data/prep/PAAD_GEXP/PAAD_GEXP_prep2_Tumor.tsv \
data/prep/PAAD_GEXP/PAAD_GEXP_prep2_Model.tsv \
data/classifier_gexp/ml_ready_qrank
```

Results can found in `data/classifier_gexp/ml_predictions_qrank/combo/HCMI_TMPsubtype_qRank_.tsv `

*Note: LUNG (includes LUAD and LUSC), ESO (includes GEA and ESCC) during transformation and classification, then is merged in post-classification summary*

## Sample Subtype Classification Using DNA Methylation Data
The goal of this analysis is to get cancer subtype predictions for HCMI samples (organoids, cell cultures , xenografts, etc). To accomplish this we will use the top performing pre-trained machine learning models (dockerized TMP models that were trained using TCGA data that has been pre-proccessed). Specifically we are interested in using gene expression from the HCMI samples and eventually compare primary tumors to their corresponding models (organoids, cell cultures , xenografts, etc).

The TMP models (pre-trained models) are specific to TCGA cancer cohorts (TCGA abbreviations), therefore we will split HCMI data into TCGA cancer cohorts(based on sample metadata).

Run DNA methylation classifier pipeline:
```bash
# where specify cancer, tumor-file, model-file, transformed-dir
bash scripts/run_classify_METHYL.sh \
SKCM \
data/classifier_methyl/processed/20231211_HCMI_TMP_subtype_prediction_feature_matrix_SKCM.tsv
```

Results can found in `data/classifier_methyl/ml_predictions/combo/HCMI_METH_TMPsubtypes..tsv`

*Note: LUNG (includes LUAD and LUSC), ESO (includes GEA and ESCC) during transformation and classification, then is merged in post-classification summary*

> *Second Example for Combination Cohort*
> ```bash
> bash scripts/run_classify_METHYL.sh \
> LUNG \
> data/classifier_methyl/processed/20231211_HCMI_TMP_subtype_prediction_feature_matrix_LUNG.tsv
> ```