Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/paccmann/paccmann_rl
Code pipeline for the PaccMann^RL in iScience: https://www.cell.com/iscience/fulltext/S2589-0042(21)00237-6
https://github.com/paccmann/paccmann_rl
de-novo-drug-design deep-learning drug-discovery drug-sensitivity generative-models transcriptomics
Last synced: 15 days ago
JSON representation
Code pipeline for the PaccMann^RL in iScience: https://www.cell.com/iscience/fulltext/S2589-0042(21)00237-6
- Host: GitHub
- URL: https://github.com/paccmann/paccmann_rl
- Owner: PaccMann
- License: mit
- Created: 2019-11-05T17:06:51.000Z (about 5 years ago)
- Default Branch: master
- Last Pushed: 2022-02-10T15:51:38.000Z (almost 3 years ago)
- Last Synced: 2023-10-20T22:14:45.463Z (about 1 year ago)
- Topics: de-novo-drug-design, deep-learning, drug-discovery, drug-sensitivity, generative-models, transcriptomics
- Homepage:
- Size: 31.3 KB
- Stars: 29
- Watchers: 6
- Forks: 9
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
[![Build Status](https://github.com/PaccMann/paccmann_rl/actions/workflows/build.yml/badge.svg)](https://github.com/PaccMann/paccmann_rl/actions/workflows/build.yml)
# paccmann_rlPipeline to reproduce the results of the [PaccMannRL paper](https://www.cell.com/iscience/fulltext/S2589-0042(21)00237-6) published in _iScience_.
## Description
In the repo we provide a conda environment and instructions to reproduce the pipeline described in the manuscript:
1. Train a multimodal drug sensitivity predictor ([source code](https://github.com/PaccMann/paccmann_predictor))
2. Train a generative model for omic profiles, also known as the PVAE ([source code](https://github.com/PaccMann/paccmann_omics))
3. Train a generative model for molecules, also known as the SVAE ([source code](https://github.com/PaccMann/paccmann_chemistry))
4. Train PaccMann^RL ([source code](https://github.com/PaccMann/paccmann_generator))## Requirements
- `conda>=3.7`
- The following data from [here](https://ibm.ent.box.com/v/paccmann-pytoda-data):
- The processed splitted data from the folder `splitted_data`
- The processed gene expression data from [GDSC](https://www.cancerrxgene.org/): `data/gene_expression/gdsc-rnaseq_gene-expression.csv`
- The processed SMILES from the drugs from [GDSC](https://www.cancerrxgene.org/): `data/smiles/gdsc.smi`
- A pickled [SMILESLanguage](https://github.com/PaccMann/paccmann_datasets/blob/master/pytoda/smiles/smiles_language.py) object (`data/smiles_language_chembl_gdsc_ccle.pkl`)
- A pickled list of genes representing the panel considered in the paper (`data/2128_genes.pkl`)
- A pickled pandas DataFrame containing expression values and metadata for the cell lines considered in the paper (`data/gdsc_transcriptomics_for_conditional_generation.pkl`)
- The git repos linked in the [previous section](#description)**NOTE:** please refer to the [README.md](https://ibm.ent.box.com/v/paccmann-pytoda-data/file/548614344106) and to the manuscript for details on the datasets used and the preprocessing applied.
## Setup
### Install the environment
Create a conda environment:
```sh
conda env create -f conda.yml
```Activate the environment:
```sh
conda activate paccmann_rl
```### Download data
Download the data reported in the [requirements section](#requirements).
From now on, we will assume that they are stored in the root of the repository in a folder called `data`, following this structure:```console
data
├── 2128_genes.pkl
├── gdsc-rnaseq_gene-expression.csv
├── gdsc.smi
├── gdsc_transcriptomics_for_conditional_generation.pkl
├── smiles_language_chembl_gdsc_ccle.pkl
└── splitted_data
├── gdsc_cell_line_ic50_test_fraction_0.1_id_997_seed_42.csv
├── gdsc_cell_line_ic50_train_fraction_0.9_id_997_seed_42.csv
├── tcga_rnaseq_test_fraction_0.1_id_242870585127480531622270373503581547167_seed_42.csv
├── tcga_rnaseq_train_fraction_0.9_id_242870585127480531622270373503581547167_seed_42.csv
├── test_chembl_22_clean_1576904_sorted_std_final.smi
└── train_chembl_22_clean_1576904_sorted_std_final.smi1 directory, 11 files
```**NOTE:** no worries, the `data` folder is in the [.gitignore](./.gitignore).
### Clone the repos
To get the scripts to run each of the component create a `code` folder and clone the repos. Simply type this:
```sh
mkdir code && cd code && \
git clone --branch 0.0.1 https://github.com/PaccMann/paccmann_predictor && \
git clone --branch 0.0.1 https://github.com/PaccMann/paccmann_omics && \
git clone --branch 0.0.1 https://github.com/PaccMann/paccmann_chemistry && \
git clone --branch 0.0.1 https://github.com/PaccMann/paccmann_generator && \
cd ..
```**NOTE:** no worries, the `code` folder is in the [.gitignore](./.gitignore).
## Pipeline
Now it's all set to run the full pipeline.
**NOTE:** the workload required to run the full pipeline is intesive and might not be straightforward to run all the steps on a desktop laptop. For this reason, we also provide [pretrained models](https://ibm.ent.box.com/v/paccmann-pytoda-data/folder/91897885403) that can be downloaded and used to run the different steps.
**NOTE:** in the following, we assume a folder `models` has been created in the root of the repository. No worries, the `models` folder is in the [.gitignore](./.gitignore).
### Multimodal drug sensitivity predictor
```console
(paccmann_rl) $ python ./code/paccmann_predictor/examples/train_paccmann.py \
./data/splitted_data/gdsc_cell_line_ic50_train_fraction_0.9_id_997_seed_42.csv \
./data/splitted_data/gdsc_cell_line_ic50_test_fraction_0.1_id_997_seed_42.csv \
./data/gdsc-rnaseq_gene-expression.csv \
./data/gdsc.smi \
./data/2128_genes.pkl \
./data/smiles_language_chembl_gdsc_ccle.pkl \
./models/ \
./code/paccmann_predictor/examples/example_params.json paccmann
```### PVAE
``` console
(paccmann_rl) $ python ./code/paccmann_omics/examples/train_vae.py \
./data/splitted_data/tcga_rnaseq_train_fraction_0.9_id_242870585127480531622270373503581547167_seed_42.csv \
./data/splitted_data/tcga_rnaseq_test_fraction_0.1_id_242870585127480531622270373503581547167_seed_42.csv \
./data/2128_genes.pkl \
./models/ \
./code/paccmann_omics/examples/example_params.json pvae
```### SVAE
``` console
(paccmann_rl) $ python ./code/paccmann_chemistry/examples/train_vae.py \
./data/splitted_data/train_chembl_22_clean_1576904_sorted_std_final.smi \
./data/splitted_data/test_chembl_22_clean_1576904_sorted_std_final.smi \
./data/smiles_language_chembl_gdsc_ccle.pkl \
./models/ \
./code/paccmann_chemistry/examples/example_params.json svae
```### PaccMann^RL
``` console
(paccmann_rl) $ python ./code/paccmann_generator/examples/train_paccmann_rl.py \
./models/svae \
./models/pvae \
./models/paccmann \
./data/smiles_language_chembl_gdsc_ccle.pkl \
./data/gdsc_transcriptomics_for_conditional_generation.pkl \
./code/paccmann_generator/examples/example_params.json \
paccmann_rl breast
```**NOTE:** this will create a `biased_model` folder containing the conditional generator and the baseline SMILES generator used. In this case: `breast_paccmann_rl` and `baseline`. No worries, the `biased_models` folder is in the [.gitignore](./.gitignore).
## References
If you use `paccmann_rl` in your projects, please cite the following:
```bib
@article{born2021paccmannrl,
title = {PaccMann\textsuperscript{RL}: De novo generation of hit-like anticancer molecules from transcriptomic data via reinforcement learning},
journal = {iScience},
volume = {24},
number = {4},
pages = {102269},
year = {2021},
issn = {2589-0042},
doi = {https://doi.org/10.1016/j.isci.2021.102269},
url = {https://www.cell.com/iscience/fulltext/S2589-0042(21)00237-6},
author = {Born, Jannis and Manica, Matteo and Oskooei, Ali and Cadow, Joris and Markert, Greta and {Rodr{\'{i}}guez Mart{\'{i}}nez}, Mar{\'{i}}a}
}
```