https://github.com/dohlee/eve-pytorch

Implementation of evolutionary model of variant effect (EVE), a deep generative model of evolutionary data, in PyTorch.
https://github.com/dohlee/eve-pytorch

bioinformatics biology computational-biology deep-learning evolution reproduction reproduction-code variant-effect-prediction variational-autoencoder

Last synced: 5 months ago
JSON representation

Implementation of evolutionary model of variant effect (EVE), a deep generative model of evolutionary data, in PyTorch.

Host: GitHub
URL: https://github.com/dohlee/eve-pytorch
Owner: dohlee
Created: 2023-02-16T08:37:53.000Z (over 2 years ago)
Default Branch: master
Last Pushed: 2023-03-04T16:13:54.000Z (over 2 years ago)
Last Synced: 2025-02-11T13:05:29.315Z (5 months ago)
Topics: bioinformatics, biology, computational-biology, deep-learning, evolution, reproduction, reproduction-code, variant-effect-prediction, variational-autoencoder
Language: Python
Homepage:
Size: 134 KB
Stars: 1
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

        # eve-pytorch

![model](img/banner.png)

Implementation of evolutionary model of variant effect (EVE), a deep generative model of evolutionary data, in PyTorch. It's just an re-implementation of the official model for my own learning purpose, which is also implemented in PyTorch. The official implementation can be found [here](https://github.com/OATML-Markslab/EVE).

## Installation

```bash

$ pip install eve-pytorch

```

## Usage

```python

import torch

from eve_pytorch import EVE

SEQ_LEN = 1000

ALPHABET_SIZE = 21

model = EVE(seq_len=SEQ_LEN, alphabet_size=ALPHABET_SIZE)

# ... training ...

x = torch.randn(1, 4, 1000)

# If you want to get the reconstructed sequence only,

x_reconstructed = model(x, return_latent=False)

# or, if you want to get the latent variables

x_reconstructed, z_mu, z_log_var = model(x, return_latent=True)

```

## Training

```bash

$ python -m eve_pytorch.train \

  --msa data/msa.filtered.a2m  \ # Multiple sequence alignment.

  --output ckpts/best_checkpoint.pt \

  --use-wandb  # Optional, for logging

```

## Computing evolutionary index

The authors defined an **evolutionary index** of a protein variant as the relative fitness of mutated sequence $\mathbf{s}$ compared with that of a wildtype sequence $\mathbf{w}$. Since computing the exact log-likelihood is intractable, the authors approximated the log-likelihood ratio of the two sequences as the difference between the ELBO of the two sequences:

$$ELBO(\mathbf{w}) - ELBO(\mathbf{s})$$

In this reproduction, I implemented `EVE.compute_evolutionary_index()` method to compute the evolutionary index of a protein variant. The method takes two sequences as input, and returns the evolutionary index of the variant. Optionally, you can tweak `num_samples` parameter to control the number of samples for the Monte Carlo sampling of latent vectors.

```python

import torch

from eve_pytorch import EVE

SEQ_LEN = 1000

ALPHABET_SIZE = 21

model = EVE(seq_len=SEQ_LEN, alphabet_size=ALPHABET_SIZE)

model.load_state_dict(torch.load('path/to/best/checkpoint.pt'))

wt_seq = # One-hot encoded wildtype amino acid sequence.

mut_seq = # One-hot encoded mutated amino acid sequence.

model.eval()

with torch.no_grad():

  model.compute_evolutionary_index(wt_seq, mut_seq, num_samples=20_000)

```

### Example

The example below shows how to compute the evolutionary indices of three variants in TP53 gene.

F109Q and V173Y are pathogenic variants, while R273C is a benign variant.

Note the difference of the evolutionary indices between the pathogenic variants and the benign variant.

```python

from Bio import SeqIO

def get_sequence_length(a2m_fp):

    """Get the sequence length of the first sequence in the a2m file.

    a2m_fp: Path to a2m file.

    """

    for record in SeqIO.parse(a2m_fp, "fasta"):

        return len(record.seq)

    

a2i = {a:i for i, a in enumerate('ACDEFGHIKLMNPQRSTVWY-')}

def one_hot_encode_amino_acid(sequence):

    return torch.eye(len(a2i))[[a2i[a] for a in sequence]].T

    

model = EVE(seq_len=100).cuda()

msa = 'data/P53_HUMAN_b01.filtered.a2m'

ALPHABET_SIZE = 21

print('Loading pretrained model.')

model = EVE(seq_len=get_sequence_length(msa), alphabet_size=ALPHABET_SIZE)

model.load_state_dict(torch.load('ckpts/TP53.best.pt'))

model.cuda()

wt_seq = """LSPDDIEQWFTEDPGDEAPRMPEAAPPVAPAPAAPTPAAPAPAPSWPLSSSVPSQKTYQG

SYGFRLGFLHSGTAKSVTCTYSPALNKMFCQLAKTCPVQLWVDSTPPPGTRVRAMAIYKQ

SQHMTEVVRRCPHHERCSDSDGLAPPQHLIRVEGNLRVEYLDDRNTFRHSVVVPYEPPEV

GSDCTTIHYNYMCNSSCMGGMNRRPILTIITLEDSSGNLLGRNSFEVRVCACPGRDRRTE

EENLRKKGEPHHELPPGSTKRALPNNTSSSPQPKKKPLDGEYFTLQIRGRERFEMFRELN

EALELKDAQAGKEPGGSRAHSSHLKSKKG""".replace('\n', '')

# F109Q

mut_seq_pathogenic1 = """LSPDDIEQWFTEDPGDEAPRMPEAAPPVAPAPAAPTPAAPAPAPSWPLSSSVPSQKTYQG

SYGQRLGFLHSGTAKSVTCTYSPALNKMFCQLAKTCPVQLWVDSTPPPGTRVRAMAIYKQ

SQHMTEVVRRCPHHERCSDSDGLAPPQHLIRVEGNLRVEYLDDRNTFRHSVVVPYEPPEV

GSDCTTIHYNYMCNSSCMGGMNRRPILTIITLEDSSGNLLGRNSFEVRVCACPGRDRRTE

EENLRKKGEPHHELPPGSTKRALPNNTSSSPQPKKKPLDGEYFTLQIRGRERFEMFRELN

EALELKDAQAGKEPGGSRAHSSHLKSKKG""".replace('\n', '')

    

# V173Y

mut_seq_pathogenic2 = """LSPDDIEQWFTEDPGDEAPRMPEAAPPVAPAPAAPTPAAPAPAPSWPLSSSVPSQKTYQG

SYGFRLGFLHSGTAKSVTCTYSPALNKMFCQLAKTCPVQLWVDSTPPPGTRVRAMAIYKQ

SQHMTEVYRRCPHHERCSDSDGLAPPQHLIRVEGNLRVEYLDDRNTFRHSVVVPYEPPEV

GSDCTTIHYNYMCNSSCMGGMNRRPILTIITLEDSSGNLLGRNSFEVRVCACPGRDRRTE

EENLRKKGEPHHELPPGSTKRALPNNTSSSPQPKKKPLDGEYFTLQIRGRERFEMFRELN

EALELKDAQAGKEPGGSRAHSSHLKSKKG""".replace('\n', '')

# M66V

mut_seq_benign = """LSPDDIEQWFTEDPGDEAPRVPEAAPPVAPAPAAPTPAAPAPAPSWPLSSSVPSQKTYQG

SYGFRLGFLHSGTAKSVTCTYSPALNKMFCQLAKTCPVQLWVDSTPPPGTRVRAMAIYKQ

SQHMTEVVRRCPHHERCSDSDGLAPPQHLIRVEGNLRVEYLDDRNTFRHSVVVPYEPPEV

GSDCTTIHYNYMCNSSCMGGMNRRPILTIITLEDSSGNLLGRNSFEVRVCACPGRDRRTE

EENLRKKGEPHHELPPGSTKRALPNNTSSSPQPKKKPLDGEYFTLQIRGRERFEMFRELN

EALELKDAQAGKEPGGSRAHSSHLKSKKG""".replace('\n', '')

wt_seq, mut_seq_benign, mut_seq_pathogenic1, mut_seq_pathogenic2 = map(

    lambda x: one_hot_encode_amino_acid(x).cuda(),

    [wt_seq, mut_seq_benign, mut_seq_pathogenic1, mut_seq_pathogenic2]

)

model.eval()

with torch.no_grad():

    print(model.compute_evolutionary_index(wt_seq, mut_seq_pathogenic1))  # 9.0448 (may vary slightly)

    print(model.compute_evolutionary_index(wt_seq, mut_seq_pathogenic2))  # 11.9176 (may vary slightly)

    print(model.compute_evolutionary_index(wt_seq, mut_seq_benign))       # -0.4336 (may vary slightly)

```

## Citations

```bibtex

@article{frazer2021disease,

  title={Disease variant prediction with deep generative models of evolutionary data},

  author={Frazer, Jonathan and Notin, Pascal and Dias, Mafalda and Gomez, Aidan and Min, Joseph K and Brock, Kelly and Gal, Yarin and Marks, Debora S},

  journal={Nature},

  volume={599},

  number={7883},

  pages={91--95},

  year={2021},

  publisher={Nature Publishing Group UK London}

}

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/dohlee/eve-pytorch

Awesome Lists containing this project

README