https://github.com/bhklab/pgx_guidelines
https://github.com/bhklab/pgx_guidelines
Last synced: 6 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/bhklab/pgx_guidelines
- Owner: bhklab
- Created: 2020-07-10T15:26:57.000Z (over 5 years ago)
- Default Branch: master
- Last Pushed: 2021-10-18T10:58:14.000Z (about 4 years ago)
- Last Synced: 2025-03-27T14:05:34.845Z (9 months ago)
- Language: Jupyter Notebook
- Size: 1.01 MB
- Stars: 10
- Watchers: 6
- Forks: 4
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Drug Sensitivity Prediction From Cell Line-Based Pharmacogenomics Data: Guidelines for Developing Machine Learning Models

# Table of contents
1. [Installation](#installation)
2. [Datasets](#Datasets)
3. [Experiments](#Experiments)
4. [Citation](#citation)
# Installation
## Requirements
- Python 3
- Conda
To get the source files of PGx Guidelines you need to clone into its repository:
```
git clone https://github.com/bhklab/PGx_Guidelines
```
### Conda environment
All the required packages to run PGx Guidelines experiments are specified in `environment` subdirectory.
To install these packages run the following command:
```
conda env create -f PGx.yml
```
This command installs `PGxG` environment.
After the successful installation of the packages into environmet you would be able to load them using `conda activate`.
# Datasets
## Download datasets
All of the utilized datasets for PGx Guidelines experiments are publicly available in the `PSet` format via ORCESTRA platform:
```
https://www.orcestra.ca/pset/stats
```
## Preprocess and load datasets
After downloading `PSet` objects, the molecular and pharmacological data can be extracted via `R` using codes provided in `Preprocess data` subdirectory.
To load all datasets and Area above dose-response curve (AAC) data, run `LoadAllPSets.R`.
To load log transformed and truncated IC50 values, run `IC50Loading_logtruncated.R`.
`tissueType_encoding.csv` file is one-hot coding of tissue types which is added to molecular profiles to adjust for tissue type.
Running `R` scripts generates the final datasets in `.tsv` format. Add them to a new subdirectory `Data_All`:
```
mkdir Data_All
```
By creating this subdirectory and adding all the data files to it, you will be able to re-run PGx Guidelines experiments.
Alternatively, we have also provided these preprocessed files on [Zenodo](https://zenodo.org/record/4642024#.YF9-FK9KiUk).
# Experiments
## Run univariable analysis
Each Rscript includes code to load required libraries and datasets.
Simply run the following for:
- all [solid and non-solid] tissues:
```
Rscript biomarker_analysis_alltissues.R "$@"
```
- after excluding non-solid tissues:
```
Rscript biomarker_analysis_solidonly.R "$@"
```
- after excluding non-solid tissues and log transformed IC50 values:
```
Rscript biomarker_analysis_log.R "$@"
```
- after excluding non-solid tissues and truncated
```
Rscript biomarker_analysis_truncated.R "$@"
```
- after excluding non-solid tissues, truncated, and log transformed IC50 values:
```
Rscript biomarker_analysis_truncated_log.R "$@"
```
## Within-domain
For this analysis, we have provided the `Python` scripts as follows:
- Ridge Regression: `Within-Ridge-aac.py` and `Within-Ridge-ic50.py`
```
sbatch ridge-wjob-aac.bs
sbatch ridge-wjob-ic50.bs
```
- Elastic Net: `Within-EN-aac.py` and `Within-EN-ic50.py`:
```
sbatch en-wjob-aac.bs
sbatch en-wjob-ic50.bs
```
- Random Forest: `Within-RF-aac.py` and `Within-RF-ic50.py`.
```
sbatch rf-wjob-aac.bs
sbatch rf-wjob-ic50.bs
```
## Cross-domain
For this analysis, we have provided the Jupyter notebooks to run Ridge Regression (`Ridge.ipynb`), Elastic Net (`ElasticNet.ipynb`), and Random Forest (`RandomForest.ipynb`). For Deep Neural Networks experiments, we have provided `python` scripts in `DNN` subdirectory to run them. First you should create directories to store logs, models, and results. You should also add your local `path` to these directories to `PGxGRun.bs`:
```
mkdir logs
mkdir models
mkdir results
sbatch PGxGRun.bs
```
We have also provided randomly generated hyperparameter settings in `filelistF10Uniquev1`.
We have provided the model objects for the best settings of DNN experiments on [Zenodo](https://zenodo.org/record/4642024#.YF9-FK9KiUk).
## CTRPv2 vs. GDSCv1
For this analysis, we have provided the Jupyter notebook `GDSCv1.ipynb`.
## Impact of non-solid cell lines
For this analysis, we have provided the Jupyter notebook `SolidandnonSolid.ipynb`. For running the random subset experiment, run `SNRidge-aac.py` script.
```
python SNRidge-aac.py
```
# Citation
```
author = {Sharifi-Noghabi, Hossein and Jahangiri-Tazehkand, Soheil and Smirnov, Petr and Hon, Casey and Mammoliti, Anthony and Nair, Sisira Kadambat and Mer, Arvind Singh and Ester, Martin and Haibe-Kains, Benjamin},
title = "{Drug sensitivity prediction from cell line-based pharmacogenomics data: guidelines for developing machine learning models}",
journal = {Briefings in Bioinformatics},
year = {2021},
month = {08},
issn = {1477-4054},
doi = {10.1093/bib/bbab294},
url = {https://doi.org/10.1093/bib/bbab294},
note = {bbab294},
eprint = {https://academic.oup.com/bib/advance-article-pdf/doi/10.1093/bib/bbab294/39679532/bbab294.pdf},
}
```