Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/OpenProteinAI/PoET
Inference code for PoET: A generative model of protein families as sequences-of-sequences
https://github.com/OpenProteinAI/PoET
deep-learning generative-model protein-engineering protein-language-model protein-sequences proteins
Last synced: about 1 month ago
JSON representation
Inference code for PoET: A generative model of protein families as sequences-of-sequences
- Host: GitHub
- URL: https://github.com/OpenProteinAI/PoET
- Owner: OpenProteinAI
- License: mit
- Created: 2023-10-28T01:30:26.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2024-04-24T21:15:33.000Z (8 months ago)
- Last Synced: 2024-08-03T14:08:47.454Z (4 months ago)
- Topics: deep-learning, generative-model, protein-engineering, protein-language-model, protein-sequences, proteins
- Language: Python
- Homepage:
- Size: 772 KB
- Stars: 44
- Watchers: 3
- Forks: 2
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- awesome-protein-design-software - PoET - [paper](https://doi.org/10.48550/arXiv.2306.06156) (Sequence generation)
README
# PoET: A generative model of protein families as sequences-of-sequences
This repo contains inference code for ["PoET: A generative model of protein families as sequences-of-sequences"](https://arxiv.org/abs/2306.06156), a state-of-the-art protein language model for variant effect prediction and conditional sequence generation.
## Environment Setup
1. Have `mamba` (faster alternative to `conda`) installed ([Instructions](https://mamba.readthedocs.io/en/latest/installation/mamba-installation.html))
1. Have `conda-lock` installed in your base conda/mamba environment ([Instructions](https://github.com/conda/conda-lock#installation))
1. Run `make create_conda_env`. This will create a conda environment named `poet`.
1. Run `make download_model` to download the model (~400MB). The model will be located at `data/poet.ckpt`. Please note the [license](#License).## Scoring variants
Use the script `scripts/score.py` to obtain fitness scores for a list of protein variants given a MSA of homologs of the WT sequence.
1. Be on a machine with a NVIDIA GPU. The model cannot run on CPU only.
1. Activate the `poet` conda environment
1. Run the script, replacing the values in angle brackets with the appropriate paths.```
python scripts/score.py \
--msa_a3m_path \
--variants_fasta_path \
--output_npy_path
```You can pass a lower value for the batch size (`--batch_size`) if you run out of VRAM. The script was tested on an A100 GPU with 40GB VRAM.
## Example
Run the scoring script without arguments `python scripts/score.py` to score variants in the `BLAT_ECOLX_Jacquier_2013` dataset from ProteinGym.
- the dataset is located at `data/BLAT_ECOLX_Jacquier_2013.csv`
- the variants to score as a fasta file is located at `data/BLAT_ECOLX_Jacquier_2013_variants.fasta`
- the MSA of homologs of the WT sequence, generated using ColabFold MMseqs2 with the UniRef2202 database, is located at `data/BLAT_ECOLX_ColabFold_2202.a3m`
- the scores will be saved as a numpy array at `data/BLAT_ECOLX_Jacquier_2013_variants.npy`The scores obtained from the script should obtain `>0.65` Spearman correlation with the measured fitness (DMS_score column in the dataset file).
## Citation
You may cite the paper as
```
@inproceedings{NEURIPS2023_f4366126,
author = {Truong Jr, Timothy and Bepler, Tristan},
booktitle = {Advances in Neural Information Processing Systems},
editor = {A. Oh and T. Neumann and A. Globerson and K. Saenko and M. Hardt and S. Levine},
pages = {77379--77415},
publisher = {Curran Associates, Inc.},
title = {PoET: A generative model of protein families as sequences-of-sequences},
url = {https://proceedings.neurips.cc/paper_files/paper/2023/file/f4366126eba252699b280e8f93c0ab2f-Paper-Conference.pdf},
volume = {36},
year = {2023}
}
```## License
This source code is licensed under the MIT license found in the LICENSE file in the root directory of this source tree.
The [PoET model weights](https://zenodo.org/records/10061322) (DOI: `10.5281/zenodo.10061322`) are available under the [CC BY-NC-SA 4.0](http://creativecommons.org/licenses/by-nc-sa/4.0/) license for academic use only. The license can also be found in the LICENSE file provided with the model weights. For commercial use, please reach out to us at [email protected] about licensing. Copyright (c) NE47 Bio, Inc. All Rights Reserved.