https://github.com/kklemon/protenc

Extract protein embeddings the easy way.
https://github.com/kklemon/protenc

deep-learning drug-discovery protein-sequences

Last synced: 5 months ago
JSON representation

Extract protein embeddings the easy way.

Host: GitHub
URL: https://github.com/kklemon/protenc
Owner: kklemon
License: mit
Created: 2022-11-28T10:49:05.000Z (almost 3 years ago)
Default Branch: master
Last Pushed: 2023-10-06T08:37:31.000Z (almost 2 years ago)
Last Synced: 2024-08-11T12:14:54.317Z (about 1 year ago)
Topics: deep-learning, drug-discovery, protein-sequences
Language: Python
Homepage:
Size: 117 KB
Stars: 6
Watchers: 1
Forks: 1
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          ProtEnc: generate protein embeddings the easy way

=======

[ProtEnc](https://github.com/kklemon/ProtEnc) aims to simplify extraction of protein embeddings from various pre-trained models by providing simple APIs and bulk generation scripts for the ever-growing landscape of protein language models (pLMs). Currently, supported models are:

* [ProtTrans](https://github.com/agemagician/ProtTrans) family

* [ESM](https://github.com/facebookresearch/esm)

* AlphaFold (coming soon™)

* [OmegaPLM](https://www.biorxiv.org/content/10.1101/2022.07.21.500999v1) (coming soon™)

Usage

-----

### Installation

```bash

pip install protenc

```

### Python API

```python

import protenc

# List available models

print(protenc.list_models())

# Load encoder model

encoder = protenc.get_encoder('esm2_t30_150M_UR50D', device='cuda')

proteins = [

  'MKTVRQERLKSIVRILERSKEPVSGAQLAEELSVSRQVIVQDIAYLRSLGYNIVATPRGYVLAGG',

  'KALTARQQEVFDLIRDHISQTGMPPTRAEIAQRLGFRSPNAAEEHLKALARKGVIEIVSGASRGIRLLQEE'

]

for embed in encoder(proteins, return_format='numpy'):

  # Embeddings have shape [L, D] where L is the sequence length and D the  embedding dimensionality.

  print(embed.shape)

  

  # Derive a single per-protein embedding vector by averaging along the sequence dimension

  embed.mean(0)

```

### Command-line interface

After installation, use the `protenc` shell command for bulk generation and export of protein embeddings.

```bash

protenc sequences.fasta embeddings.lmdb --model_name=

```

By default, input and output formats are inferred from the file extensions.

Run

```bash

protenc --help

```

for a detailed usage description.

**Example**

Generate protein embeddings using the ESM2 650M model for sequences provided in a [FASTA](https://en.wikipedia.org/wiki/FASTA_format) file and write embeddings to an [LMDB](https://en.wikipedia.org/wiki/Lightning_Memory-Mapped_Database):

```bash

protenc proteins.fasta embeddings.lmdb --model_name=esm2_t33_650M_UR50D

```

The generated embeddings will be stored in a lmdb key-value store and can be easily accessed using the `read_from_lmdb` utility function:

```python

from protenc.utils import read_from_lmdb

for label, embed in read_from_lmdb('embeddings.lmdb'):

    print(label, embed)

```

**Features**

Input formats:

* CSV

* JSON

* [FASTA](https://en.wikipedia.org/wiki/FASTA_format)

Output format:

* [LMDB](https://en.wikipedia.org/wiki/Lightning_Memory-Mapped_Database)

* [HDF5](https://en.wikipedia.org/wiki/Hierarchical_Data_Format) (coming soon)

General:

* Multi-GPU inference with (`--data_parallel`)

* FP16 inference (`--amp`)

Development

-----------

Clone the repository:

```bash

git clone git+https://github.com/kklemon/protenc.git

```

Install dependencies via [Poetry](https://python-poetry.org/):

```bash

poetry install

```

Contribution

------------

Have feature ideas or found a bug? Love to see support for a new model? Feel free to [create an issue](https://github.com/kklemon/ProtEnc/issues/new).

Todo

----

- [ ] Support for more input formats

  - [X] CSV

  - [ ] Parquet

  - [X] FASTA

  - [X] JSON

- [ ] Support for more output formats

  - [X] LMDB

  - [ ] HDF5

  - [ ] DataFrame

  - [ ] Pickle

- [ ] Support for large models

  - [ ] Model offloading

  - [ ] Sharding

  - [ ] FlashAttention (via Kernl?)

- [ ] Support for more protein language models

  - [X] Whole ProtTrans family

  - [X] Whole ESM family

  - [ ] AlphaFold (?)

- [X] Implement all remaining TODOs in code

- [ ] Evaluation

- [ ] Demos

- [ ] Distributed inference

- [ ] Maybe support some sort of optimized inference such as quantization

  - This may be up to the model providers

- [ ] Improve documentation

- [ ] Support translation of gene sequences

- [ ] Add tests. We need tests!!!

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/kklemon/protenc

Awesome Lists containing this project

README