Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/Merck/Sapiens
Sapiens is a human antibody language model based on BERT.
https://github.com/Merck/Sapiens
antibody bert embeddings language-model sapiens
Last synced: 14 days ago
JSON representation
Sapiens is a human antibody language model based on BERT.
- Host: GitHub
- URL: https://github.com/Merck/Sapiens
- Owner: Merck
- License: mit
- Created: 2022-02-02T17:11:37.000Z (almost 3 years ago)
- Default Branch: main
- Last Pushed: 2023-04-19T13:14:46.000Z (almost 2 years ago)
- Last Synced: 2025-01-09T02:11:30.802Z (23 days ago)
- Topics: antibody, bert, embeddings, language-model, sapiens
- Language: Jupyter Notebook
- Homepage:
- Size: 8.72 MB
- Stars: 50
- Watchers: 10
- Forks: 17
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- top-pharma50 - **Merck/Sapiens** - model`, `sapiens`<br><img src='https://github.com/HubTou/topgh/blob/main/icons/gstars.png'> 40 <img src='https://github.com/HubTou/topgh/blob/main/icons/forks.png'> 13 <img src='https://github.com/HubTou/topgh/blob/main/icons/code.png'> Jupyter Notebook <img src='https://github.com/HubTou/topgh/blob/main/icons/license.png'> MIT License <img src='https://github.com/HubTou/topgh/blob/main/icons/last.png'> 2023-04-19 13:14:46 | (Ranked by starred repositories)
- top-pharma50 - **Merck/Sapiens** - model`, `sapiens`<br><img src='https://github.com/HubTou/topgh/blob/main/icons/gstars.png'> 40 <img src='https://github.com/HubTou/topgh/blob/main/icons/forks.png'> 13 <img src='https://github.com/HubTou/topgh/blob/main/icons/code.png'> Jupyter Notebook <img src='https://github.com/HubTou/topgh/blob/main/icons/license.png'> MIT License <img src='https://github.com/HubTou/topgh/blob/main/icons/last.png'> 2023-04-19 13:14:46 | (Ranked by starred repositories)
README
# Sapiens: Human antibody language model
```
____ _
/ ___| __ _ _ __ (_) ___ _ __ ___
\___ \ / _` | '_ \| |/ _ \ '_ \/ __|
___| | |_| | |_| | | __/ | | \__ \
|____/ \__,_| __/|_|\___|_| |_|___/
|_|
```Sapiens is a human antibody language model based on BERT.
Learn more in the Sapiens, OASis and BioPhi in our publication:
> David Prihoda, Jad Maamary, Andrew Waight, Veronica Juan, Laurence Fayadat-Dilman, Daniel Svozil & Danny A. Bitton (2022)
> BioPhi: A platform for antibody design, humanization, and humanness evaluation based on natural antibody repertoires and deep learning, mAbs, 14:1, DOI: https://doi.org/10.1080/19420862.2021.2020203For more information about BioPhi, see the [BioPhi repository](https://github.com/Merck/BioPhi)
## Features
- Infilling missing residues in human antibody sequences
- Suggesting mutations (in frameworks as well as CDRs)
- Creating vector representations (embeddings) of residues or sequences![Sapiens Antibody t-SNE Example](notebooks/Embedding_t-SNE.png)
## Usage
Install Sapiens using pip:
```bash
# Recommended: Create dedicated conda environment
conda create -n sapiens python=3.8
conda activate sapiens
# Install Sapiens
pip install sapiens
```❗️ Python 3.7 or 3.8 is currently required due to fairseq bug in Python 3.9 and above: https://github.com/pytorch/fairseq/issues/3535
### Antibody sequence infilling
Positions marked with * or X will be infilled with the most likely human residues, given the rest of the sequence
```python
import sapiensbest = sapiens.predict_masked(
'**QLV*SGVEVKKPGASVKVSCKASGYTFTNYYMYWVRQAPGQGLEWMGGINPSNGGTNFNEKFKNRVTLTTDSSTTTAYMELKSLQFDDTAVYYCARRDYRFDMGFDYWGQGTTVTVSS',
'H'
)
print(best)
# QVQLVQSGVEVKKPGASVKVSCKASGYTFTNYYMYWVRQAPGQGLEWMGGINPSNGGTNFNEKFKNRVTLTTDSSTTTAYMELKSLQFDDTAVYYCARRDYRFDMGFDYWGQGTTVTVSS
```### Suggesting mutations
Return residue scores for a given sequence:
```python
import sapiensscores = sapiens.predict_scores(
'**QLV*SGVEVKKPGASVKVSCKASGYTFTNYYMYWVRQAPGQGLEWMGGINPSNGGTNFNEKFKNRVTLTTDSSTTTAYMELKSLQFDDTAVYYCARRDYRFDMGFDYWGQGTTVTVSS',
'H'
)
scores.head()
# A C D E ...
# 0 0.003272 0.004147 0.004011 0.004590 ... <- based on masked input
# 1 0.012038 0.003854 0.006803 0.008174 ... <- based on masked input
# 2 0.003384 0.003895 0.003726 0.004068 ... <- based on Q input
# 3 0.004612 0.005325 0.004443 0.004641 ... <- based on L input
# 4 0.005519 0.003664 0.003555 0.005269 ... <- based on V input
#
# Scores are given both for residues that are masked and that are present.
# When inputting a non-human antibody sequence, the output scores can be used for humanization.
```### Antibody sequence embedding
Get a vector representation of each position in a sequence
```python
import sapiensresidue_embed = sapiens.predict_residue_embedding(
'QVKLQESGAELARPGASVKLSCKASGYTFTNYWMQWVKQRPGQGLDWIGAIYPGDGNTRYTHKFKGKATLTADKSSSTAYMQLSSLASEDSGVYYCARGEGNYAWFAYWGQGTTVTVSS',
'H',
layer=None
)
residue_embed.shape
# (layer, position in sequence, features)
# (5, 119, 128)
```Get a single vector for each sequence
```python
seq_embed = sapiens.predict_sequence_embedding(
'QVKLQESGAELARPGASVKLSCKASGYTFTNYWMQWVKQRPGQGLDWIGAIYPGDGNTRYTHKFKGKATLTADKSSSTAYMQLSSLASEDSGVYYCARGEGNYAWFAYWGQGTTVTVSS',
'H',
layer=None
)
seq_embed.shape
# (layer, features)
# (5, 128)
```### Notebooks
Try out Sapiens in your browser using these example notebooks:
LinksNotebookDescription
01_sapiens_antibody_infilling
Predict missing positions in an antibody sequence
02_sapiens_antibody_embedding
Get vector representations and visualize them using t-SNE
## Acknowledgements
Sapiens is based on antibody repertoires from the Observed Antibody Space:
> Kovaltsuk, A., Leem, J., Kelm, S., Snowden, J., Deane, C. M., & Krawczyk, K. (2018). Observed Antibody Space: A Resource for Data Mining Next-Generation Sequencing of Antibody Repertoires. The Journal of Immunology, 201(8), 2502–2509. https://doi.org/10.4049/jimmunol.1800708