Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/dohlee/eve-pytorch
Implementation of evolutionary model of variant effect (EVE), a deep generative model of evolutionary data, in PyTorch.
https://github.com/dohlee/eve-pytorch
bioinformatics biology computational-biology deep-learning evolution reproduction reproduction-code variant-effect-prediction variational-autoencoder
Last synced: 10 days ago
JSON representation
Implementation of evolutionary model of variant effect (EVE), a deep generative model of evolutionary data, in PyTorch.
- Host: GitHub
- URL: https://github.com/dohlee/eve-pytorch
- Owner: dohlee
- Created: 2023-02-16T08:37:53.000Z (almost 2 years ago)
- Default Branch: master
- Last Pushed: 2023-03-04T16:13:54.000Z (almost 2 years ago)
- Last Synced: 2024-12-12T07:48:35.747Z (about 1 month ago)
- Topics: bioinformatics, biology, computational-biology, deep-learning, evolution, reproduction, reproduction-code, variant-effect-prediction, variational-autoencoder
- Language: Python
- Homepage:
- Size: 134 KB
- Stars: 1
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# eve-pytorch
![model](img/banner.png)
Implementation of evolutionary model of variant effect (EVE), a deep generative model of evolutionary data, in PyTorch. It's just an re-implementation of the official model for my own learning purpose, which is also implemented in PyTorch. The official implementation can be found [here](https://github.com/OATML-Markslab/EVE).
## Installation
```bash
$ pip install eve-pytorch
```## Usage
```python
import torch
from eve_pytorch import EVESEQ_LEN = 1000
ALPHABET_SIZE = 21model = EVE(seq_len=SEQ_LEN, alphabet_size=ALPHABET_SIZE)
# ... training ...
x = torch.randn(1, 4, 1000)
# If you want to get the reconstructed sequence only,
x_reconstructed = model(x, return_latent=False)# or, if you want to get the latent variables
x_reconstructed, z_mu, z_log_var = model(x, return_latent=True)
```## Training
```bash
$ python -m eve_pytorch.train \
--msa data/msa.filtered.a2m \ # Multiple sequence alignment.
--output ckpts/best_checkpoint.pt \
--use-wandb # Optional, for logging
```## Computing evolutionary index
The authors defined an **evolutionary index** of a protein variant as the relative fitness of mutated sequence $\mathbf{s}$ compared with that of a wildtype sequence $\mathbf{w}$. Since computing the exact log-likelihood is intractable, the authors approximated the log-likelihood ratio of the two sequences as the difference between the ELBO of the two sequences:$$ELBO(\mathbf{w}) - ELBO(\mathbf{s})$$
In this reproduction, I implemented `EVE.compute_evolutionary_index()` method to compute the evolutionary index of a protein variant. The method takes two sequences as input, and returns the evolutionary index of the variant. Optionally, you can tweak `num_samples` parameter to control the number of samples for the Monte Carlo sampling of latent vectors.
```python
import torch
from eve_pytorch import EVESEQ_LEN = 1000
ALPHABET_SIZE = 21model = EVE(seq_len=SEQ_LEN, alphabet_size=ALPHABET_SIZE)
model.load_state_dict(torch.load('path/to/best/checkpoint.pt'))wt_seq = # One-hot encoded wildtype amino acid sequence.
mut_seq = # One-hot encoded mutated amino acid sequence.model.eval()
with torch.no_grad():
model.compute_evolutionary_index(wt_seq, mut_seq, num_samples=20_000)
```### Example
The example below shows how to compute the evolutionary indices of three variants in TP53 gene.
F109Q and V173Y are pathogenic variants, while R273C is a benign variant.
Note the difference of the evolutionary indices between the pathogenic variants and the benign variant.
```python
from Bio import SeqIOdef get_sequence_length(a2m_fp):
"""Get the sequence length of the first sequence in the a2m file.
a2m_fp: Path to a2m file.
"""
for record in SeqIO.parse(a2m_fp, "fasta"):
return len(record.seq)
a2i = {a:i for i, a in enumerate('ACDEFGHIKLMNPQRSTVWY-')}
def one_hot_encode_amino_acid(sequence):
return torch.eye(len(a2i))[[a2i[a] for a in sequence]].T
model = EVE(seq_len=100).cuda()msa = 'data/P53_HUMAN_b01.filtered.a2m'
ALPHABET_SIZE = 21print('Loading pretrained model.')
model = EVE(seq_len=get_sequence_length(msa), alphabet_size=ALPHABET_SIZE)
model.load_state_dict(torch.load('ckpts/TP53.best.pt'))
model.cuda()wt_seq = """LSPDDIEQWFTEDPGDEAPRMPEAAPPVAPAPAAPTPAAPAPAPSWPLSSSVPSQKTYQG
SYGFRLGFLHSGTAKSVTCTYSPALNKMFCQLAKTCPVQLWVDSTPPPGTRVRAMAIYKQ
SQHMTEVVRRCPHHERCSDSDGLAPPQHLIRVEGNLRVEYLDDRNTFRHSVVVPYEPPEV
GSDCTTIHYNYMCNSSCMGGMNRRPILTIITLEDSSGNLLGRNSFEVRVCACPGRDRRTE
EENLRKKGEPHHELPPGSTKRALPNNTSSSPQPKKKPLDGEYFTLQIRGRERFEMFRELN
EALELKDAQAGKEPGGSRAHSSHLKSKKG""".replace('\n', '')# F109Q
mut_seq_pathogenic1 = """LSPDDIEQWFTEDPGDEAPRMPEAAPPVAPAPAAPTPAAPAPAPSWPLSSSVPSQKTYQG
SYGQRLGFLHSGTAKSVTCTYSPALNKMFCQLAKTCPVQLWVDSTPPPGTRVRAMAIYKQ
SQHMTEVVRRCPHHERCSDSDGLAPPQHLIRVEGNLRVEYLDDRNTFRHSVVVPYEPPEV
GSDCTTIHYNYMCNSSCMGGMNRRPILTIITLEDSSGNLLGRNSFEVRVCACPGRDRRTE
EENLRKKGEPHHELPPGSTKRALPNNTSSSPQPKKKPLDGEYFTLQIRGRERFEMFRELN
EALELKDAQAGKEPGGSRAHSSHLKSKKG""".replace('\n', '')
# V173Y
mut_seq_pathogenic2 = """LSPDDIEQWFTEDPGDEAPRMPEAAPPVAPAPAAPTPAAPAPAPSWPLSSSVPSQKTYQG
SYGFRLGFLHSGTAKSVTCTYSPALNKMFCQLAKTCPVQLWVDSTPPPGTRVRAMAIYKQ
SQHMTEVYRRCPHHERCSDSDGLAPPQHLIRVEGNLRVEYLDDRNTFRHSVVVPYEPPEV
GSDCTTIHYNYMCNSSCMGGMNRRPILTIITLEDSSGNLLGRNSFEVRVCACPGRDRRTE
EENLRKKGEPHHELPPGSTKRALPNNTSSSPQPKKKPLDGEYFTLQIRGRERFEMFRELN
EALELKDAQAGKEPGGSRAHSSHLKSKKG""".replace('\n', '')# M66V
mut_seq_benign = """LSPDDIEQWFTEDPGDEAPRVPEAAPPVAPAPAAPTPAAPAPAPSWPLSSSVPSQKTYQG
SYGFRLGFLHSGTAKSVTCTYSPALNKMFCQLAKTCPVQLWVDSTPPPGTRVRAMAIYKQ
SQHMTEVVRRCPHHERCSDSDGLAPPQHLIRVEGNLRVEYLDDRNTFRHSVVVPYEPPEV
GSDCTTIHYNYMCNSSCMGGMNRRPILTIITLEDSSGNLLGRNSFEVRVCACPGRDRRTE
EENLRKKGEPHHELPPGSTKRALPNNTSSSPQPKKKPLDGEYFTLQIRGRERFEMFRELN
EALELKDAQAGKEPGGSRAHSSHLKSKKG""".replace('\n', '')wt_seq, mut_seq_benign, mut_seq_pathogenic1, mut_seq_pathogenic2 = map(
lambda x: one_hot_encode_amino_acid(x).cuda(),
[wt_seq, mut_seq_benign, mut_seq_pathogenic1, mut_seq_pathogenic2]
)model.eval()
with torch.no_grad():
print(model.compute_evolutionary_index(wt_seq, mut_seq_pathogenic1)) # 9.0448 (may vary slightly)
print(model.compute_evolutionary_index(wt_seq, mut_seq_pathogenic2)) # 11.9176 (may vary slightly)
print(model.compute_evolutionary_index(wt_seq, mut_seq_benign)) # -0.4336 (may vary slightly)```
## Citations
```bibtex
@article{frazer2021disease,
title={Disease variant prediction with deep generative models of evolutionary data},
author={Frazer, Jonathan and Notin, Pascal and Dias, Mafalda and Gomez, Aidan and Min, Joseph K and Brock, Kelly and Gal, Yarin and Marks, Debora S},
journal={Nature},
volume={599},
number={7883},
pages={91--95},
year={2021},
publisher={Nature Publishing Group UK London}
}
```