Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/luoyunan/ECNet

An evolutionary context-integrated deep learning framework for protein engineering
https://github.com/luoyunan/ECNet

Last synced: 2 months ago
JSON representation

An evolutionary context-integrated deep learning framework for protein engineering

Awesome Lists containing this project

README

        

# ECNet
An evolutionary context-integrated deep learning framework for protein engineering

- [ECNet](#ecnet)
- [Overview](#overview)
- [Installation](#installation)
- [Dependencies](#dependencies)
- [Quick Example](#quick-example)
- [Running on your own data](#running-on-your-own-data)
- [Generate local features using HHblits and CCMPred](#generate-local-features-using-hhblits-and-ccmpred)
- [Train on dataset A and test on dataset B](#train-on-dataset-a-and-test-on-dataset-b)
- [Citation](#citation)
- [Contact](#contact)

## Overview
ECNet (evolutionary context-integrated neural network) is a deep learning model that guides protein engineering by predicting protein fitness from the sequence. It integrates local evolutionary context from homologous sequences that explicitly model residue-residue epistasis for the protein of interest with the global evolutionary context that encodes rich semantic and structural features from the enormous protein sequence universe. Please see our *Nature Communications* [paper](https://doi.org/10.1038/s41467-021-25976-8) for details.
![ECNet](doc/overview.png)
## Installation
Clone and export the GitHub repository directory to python path
```bash
git clone https://github.com/luoyunan/ECNet.git
cd ECNet
export PYTHONPATH=$PWD:$PYTHONPATH
```
## Dependencies
This package is tested with `Python 3.7` and `CUDA 10.1` on `Ubuntu 18.04`, with access to an Nvidia GeForce TITAN X GPU (12GB RAM) and Intel Xeon E5-2650 v3 CPU (2.30 GHz, 512G RAM). Please see `requirements.txt` for necessary python dependencies, all of which can be easily installed with `pip` or `conda`. Due to an issue of installing `pytorch 1.4.0` with `pip`, please install `pytorch` with `conda` first.
```bash
conda install pytorch==1.4.0 cudatoolkit=10.1 -c pytorch
pip install -r requirements.txt
```

## Quick Example
1. Download example data (~5.4MB) from Dropbox.
```
wget https://www.dropbox.com/s/nkgubuwfwiyy0ze/data.tar.gz
tar xf data.tar.gz
```
2. Run the example script. The following script trains an ECNet model using the fitness data of the
second RRM domain of Pab1 ([source](https://rnajournal.cshlp.org/content/19/11/1537.long)). The scripts randomly splits 70% as training data, 10% as validation data, and 20% as test data.
```bash
CUDA_VISIBLE_DEVICES=0 python scripts/run_example.py \
--train data/RRM_single.tsv \
--fasta data/RRM.fasta \
--local_feature data/RRM.braw \
--output_dir ./output/RRM_CV \
--save_prediction \
--n_ensembles 2 \
--epochs 100
```
It typically takes no more than 15 min on our tested environment to run this example. The output (printed to stdout) would be the correlation between predicted and ground-truth fitness values.

## Running on your own data
ECNet has two required input files: 1) a FASTA file of the wild-type sequence, and 2) a TSV file describes the fitness values of variants. Other optional input files include the output of CCMPred for extracting local features and separate test TSV file.

1. **Sequence FASTA file** (`--fasta`, required). A regular FASTA file of the wild-type sequence. This file should contain only one sequence.
2. **Fitness TSV file** (`--train`, required). Each line has two columns `mutation` and `score` separated by tab, describing the fitness value of a variant. The `mutation` column is a string has the format `[ref][pos][alt]`, e.g., `S100T`, meaning that the 100-th amino acid (index starting from 1) mutated from `S` to `T`. If a variant has multiple mutations, `;` is used to concatenated mutations. The `score` column is a numerical value quantifies the variant's fitness. Example:
```
mutation score
M1S 1.0
F12I;L30K 2.0
G89A 0.06
```
Note: This file is supplied using the `--train` argument. If no separate test data is provided through the `--test` argument, this TSV file will be split into three sets (train, valid, and test) using ratio specified by `--split_ratio` (which are 3 float numbers). If there is another test TSV file is provided, this TSV file will be split into two sets (train and valid) as specified by `--split_ratio` (which are 2 float numbers).
3. **Local features** (`--local_feature`, optional). A binary file generated by CCMPred using the `-b` option (note that to use the `-b` option you need to install CCMPred from its latest GitHub branch instead of the release; you may also need to install `libmsgpack-dev`. See instructions [below](#generate-local-features-using-hhblits-and-ccmpred)). ECNet will extract local features from this file. This file is optional. If not provided, please add `--no_local_feature` flag when running `run_example.py` (or, equivalently, set `use_local_features=False` for the `ECNet` class) and ECNet won't use the local features. See below for instruction of generating this binary file using HHblits and CCMPred.
3. **Additional test TSV file** (`--test`, optional). This file has the same format as the `--train` TSV file.

We suggest users tune hyperparameters for new protein. Several hyperparameters are exposed as arguments, e.g., `d_embed`, `d_model`, `d_h`, `n_layers`, etc.

## Generate local features using HHblits and CCMPred
1. Install [HHsuite](https://github.com/soedinglab/hh-suite) and [CCMPred](https://github.com/soedinglab/CCMpred) following their instructions. Note that CCMPred should be installed from the latest branch instead of the release, otherwise the `-b` option is not available. Also, as CCMPred uses `msgpack` to create the binary file, you may also need to install `libmsgpack-dev` on your system if it is not available. For example, on Ubuntu, you can run `sudo apt update` then `sudo apt install libmsgpack-dev`.
2. Prepare a FASTA file `example.fasta` of the wild-type sequence of our interested protein.
3. Search the homologous sequences of the wild-type sequence using `hhblits` in HHsuite. (There multiple ways to search homologous sequences and format the alignment. Below we describe a way that uses hhblits to search homologous sequences. Other ways are also feasible, e.g., using jackhmmer as described in the [DeepSequence](https://www.nature.com/articles/s41592-018-0138-4) paper.)
```bash
hhblits -i example.fasta \
-d ${path_to_hhblits_database} \
-o example.hhr \
-oa3m example.a3m \
-n 3 \
-id 99 \
-cov 50 \
-cpu 8
```
4. Reformat the a3m output of hhblits to PSICOV format (solution modified from [here](https://github.com/soedinglab/bbcontacts/blob/master/TUTORIAL.md#step-13-reformat-the-output-alignment)). In order to run CCMpred, the alignment must be reformatted to the "PSICOV" format used by CCMpred. We can first use the `reformat.pl` script from the `hh-suite/scripts` directory to get an alignment in fasta format and then the `convert_alignment.py` from the `CCMpred/scripts` directory to get the PSICOV format:
```bash
${path_to_hh-suite}/scripts/reformat.pl example.a3m example.fas -r
python ${path_to_CCMpred}/scripts/convert_alignment.py example.fas fasta example.psc
```
5. Run CCMPred
```bash
ccmpred example.psc example.mat -b example.braw -d 0
```
6. Use the argument `--local_feature example.braw` to provide the local features to ECNet.

## Train on dataset A and test on dataset B
The following example shows how to train ECNet on dataset A (passed via `--train`) and test it on another dataset B (passed via `--test`).
- Example 1: train on single-mutant fitness data of RRM ([source](https://rnajournal.cshlp.org/content/19/11/1537.long)), and predict for double-mutants
```
CUDA_VISIBLE_DEVICES=0 python scripts/run_example.py \
--train data/RRM_single.tsv \
--test data/RRM_double.tsv \
--fasta data/RRM.fasta \
--split_ratio 0.9 0.1 \
--local_feature data/RRM.braw \
--output_dir ./output/RRM \
--save_checkpoint \
--n_ensembles 2 \
--epochs 100
```
- Example 2: you can also load the trained model using the `--save_model_dir` argument and predict for test dataset:
```
CUDA_VISIBLE_DEVICES=0 python scripts/run_example.py \
--test data/RRM_double.tsv \
--fasta data/RRM.fasta \
--local_feature data/RRM.braw \
--n_ensembles 2 \
--output_dir ./output/RRM \
--saved_model_dir ./output/RRM
```

## Citation
> Luo, Y. et al. ECNet is an evolutionary context-integrated deep learning framework for protein engineering. *Nat Commun* **12**, 5743 (2021). https://doi.org/10.1038/s41467-021-25976-8

```
@article{luo2021ecnet,
doi = {10.1038/s41467-021-25976-8},
url = {https://doi.org/10.1038/s41467-021-25976-8},
year = {2021},
month = sep,
publisher = {Springer Science and Business Media {LLC}},
volume = {12},
number = {1},
author = {Yunan Luo and Guangde Jiang and Tianhao Yu and Yang Liu and Lam Vo and Hantian Ding and Yufeng Su and Wesley Wei Qian and Huimin Zhao and Jian Peng},
title = {{ECNet} is an evolutionary context-integrated deep learning framework for protein engineering},
journal = {Nature Communications}
}
```
## Contact
Please submit GitHub issues or contact Yunan Luo (luoyunan[at]gmail[dot]com) for any questions related to the source code.