https://github.com/OpenProteinAI/PoET

Inference code for PoET: A generative model of protein families as sequences-of-sequences
https://github.com/OpenProteinAI/PoET

deep-learning generative-model protein-engineering protein-language-model protein-sequences proteins

Last synced: 5 months ago
JSON representation

Inference code for PoET: A generative model of protein families as sequences-of-sequences

Host: GitHub
URL: https://github.com/OpenProteinAI/PoET
Owner: OpenProteinAI
License: mit
Created: 2023-10-28T01:30:26.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2024-04-24T21:15:33.000Z (12 months ago)
Last Synced: 2024-08-03T14:08:47.454Z (9 months ago)
Topics: deep-learning, generative-model, protein-engineering, protein-language-model, protein-sequences, proteins
Language: Python
Homepage:
Size: 772 KB
Stars: 44
Watchers: 3
Forks: 2
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

awesome-protein-design-software - PoET - [paper](https://doi.org/10.48550/arXiv.2306.06156) (Sequence generation)

README

# PoET: A generative model of protein families as sequences-of-sequences

This repo contains inference code for ["PoET: A generative model of protein families as sequences-of-sequences"](https://arxiv.org/abs/2306.06156), a state-of-the-art protein language model for variant effect prediction and conditional sequence generation.

## Environment Setup

1. Have `mamba` (faster alternative to `conda`) installed ([Instructions](https://mamba.readthedocs.io/en/latest/installation/mamba-installation.html))
1. Have `conda-lock` installed in your base conda/mamba environment ([Instructions](https://github.com/conda/conda-lock#installation))
1. Run `make create_conda_env`. This will create a conda environment named `poet`.
1. Run `make download_model` to download the model (~400MB). The model will be located at `data/poet.ckpt`. Please note the [license](#License).

## Scoring variants

Use the script `scripts/score.py` to obtain fitness scores for a list of protein variants given a MSA of homologs of the WT sequence.

1. Be on a machine with a NVIDIA GPU. The model cannot run on CPU only.
1. Activate the `poet` conda environment
1. Run the script, replacing the values in angle brackets with the appropriate paths.

```
python scripts/score.py \
--msa_a3m_path \
--variants_fasta_path \
--output_npy_path
```

You can pass a lower value for the batch size (`--batch_size`) if you run out of VRAM. The script was tested on an A100 GPU with 40GB VRAM.

## Example

Run the scoring script without arguments `python scripts/score.py` to score variants in the `BLAT_ECOLX_Jacquier_2013` dataset from ProteinGym.

- the dataset is located at `data/BLAT_ECOLX_Jacquier_2013.csv`
- the variants to score as a fasta file is located at `data/BLAT_ECOLX_Jacquier_2013_variants.fasta`
- the MSA of homologs of the WT sequence, generated using ColabFold MMseqs2 with the UniRef2202 database, is located at `data/BLAT_ECOLX_ColabFold_2202.a3m`
- the scores will be saved as a numpy array at `data/BLAT_ECOLX_Jacquier_2013_variants.npy`

The scores obtained from the script should obtain `>0.65` Spearman correlation with the measured fitness (DMS_score column in the dataset file).

## Citation

You may cite the paper as

```
@inproceedings{NEURIPS2023_f4366126,
author = {Truong Jr, Timothy and Bepler, Tristan},
booktitle = {Advances in Neural Information Processing Systems},
editor = {A. Oh and T. Neumann and A. Globerson and K. Saenko and M. Hardt and S. Levine},
pages = {77379--77415},
publisher = {Curran Associates, Inc.},
title = {PoET: A generative model of protein families as sequences-of-sequences},
url = {https://proceedings.neurips.cc/paper_files/paper/2023/file/f4366126eba252699b280e8f93c0ab2f-Paper-Conference.pdf},
volume = {36},
year = {2023}
}
```

## License

This source code is licensed under the MIT license found in the LICENSE file in the root directory of this source tree.

The [PoET model weights](https://zenodo.org/records/10061322) (DOI: `10.5281/zenodo.10061322`) are available under the [CC BY-NC-SA 4.0](http://creativecommons.org/licenses/by-nc-sa/4.0/) license for academic use only. The license can also be found in the LICENSE file provided with the model weights. For commercial use, please reach out to us at [email protected] about licensing. Copyright (c) NE47 Bio, Inc. All Rights Reserved.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/OpenProteinAI/PoET

Awesome Lists containing this project

README