Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/agemagician/Ankh
Ankh: Optimized Protein Language Model
https://github.com/agemagician/Ankh
Last synced: 25 days ago
JSON representation
Ankh: Optimized Protein Language Model
- Host: GitHub
- URL: https://github.com/agemagician/Ankh
- Owner: agemagician
- License: other
- Created: 2022-10-27T18:35:27.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2023-12-26T18:20:35.000Z (6 months ago)
- Last Synced: 2024-03-22T13:42:29.494Z (3 months ago)
- Language: Python
- Size: 39 MB
- Stars: 181
- Watchers: 8
- Forks: 19
- Open Issues: 6
-
Metadata Files:
- Readme: README.md
- License: LICENSE.md
Lists
- awesome-biochem-ai - Ankh (2023) - [paper](https://arxiv.org/abs/2301.06568) The 2023 protein language model by the main author of ProtTrans. Ankh, which employs T5-like architecture, claims to be superior to both ProtTrans and ESM/ESM2 on standard protein property predicitons. (Libraries on Molecule AI / 3D)
README
Ankh ☥: Optimized Protein Language Model Unlocks General-Purpose Modelling
[Ankh](https://arxiv.org/abs/2301.06568) is the first general-purpose protein language model trained on Google's **TPU-V4** surpassing the state-of-the-art performance with dramatically less parameters, promoting accessibility to research innovation via attainable resources.
This repository will be updated regulary with **new pre-trained models for proteins** in part of supporting the **biotech** community in revolutinizing protein engineering using AI.
Table of Contents
=================
* [ Installation](#install)
* [ Models Availability](#models)
* [ Dataset Availability](#datasets)
* [ Usage](#usage)
* [ Original downstream Predictions](#results)
* [ Followup use-cases](#inaction)
* [ Comparisons to other tools](#comparison)
* [ Community and Contributions](#community)
* [ Have a question?](#question)
* [ Found a bug?](#bug)
* [ Requirements](#requirements)
* [ Sponsors](#sponsors)
* [ Team](#team)
* [ License](#license)
* [ Citation](#citation)```python
python -m pip install ankh
```| Model | ankh | Hugging Face |
|------------------------------------|-----------------------------------|-----------------------------------------------------------|
| Ankh Large | `ankh.load_large_model()` |[Ankh Large](https://huggingface.co/ElnaggarLab/ankh-large)|
| Ankh Base | `ankh.load_base_model()` |[Ankh Base](https://huggingface.co/ElnaggarLab/ankh-base) |## Datasets Availability
| Dataset | Hugging Face |
| ----------------------------- |---------------------------------------------------------------------------------------------------|
| Remote Homology | `load_dataset("proteinea/remote_homology")` |
| CASP12 | `load_dataset("proteinea/secondary_structure_prediction", data_files={'test': ['CASP12.csv']})`|
| CASP14 | `load_dataset("proteinea/secondary_structure_prediction", data_files={'test': ['CASP14.csv']})`|
| CB513 | `load_dataset("proteinea/secondary_structure_prediction", data_files={'test': ['CB513.csv']})` |
| TS115 | `load_dataset("proteinea/secondary_structure_prediction", data_files={'test': ['TS115.csv']})` |
| DeepLoc | `load_dataset("proteinea/deeploc")` |
| Fluorescence | `load_dataset("proteinea/fluorescence")` |
| Solubility | `load_dataset("proteinea/solubility")` |
| Nearest Neighbor Search | `load_dataset("proteinea/nearest_neighbor_search")` |* Loading pre-trained models:
```python
import ankh# To load large model:
model, tokenizer = ankh.load_large_model()
model.eval()# To load base model.
model, tokenizer = ankh.load_base_model()
model.eval()
```* Feature extraction using ankh large example:
```pythonmodel, tokenizer = ankh.load_large_model()
model.eval()protein_sequences = ['MKALCLLLLPVLGLLVSSKTLCSMEEAINERIQEVAGSLIFRAISSIGLECQSVTSRGDLATCPRGFAVTGCTCGSACGSWDVRAETTCHCQCAGMDWTGARCCRVQPLEHHHHHH',
'GSHMSLFDFFKNKGSAATATDRLKLILAKERTLNLPYMEEMRKEIIAVIQKYTKSSDIHFKTLDSNQSVETIEVEIILPR']protein_sequences = [list(seq) for seq in protein_sequences]
outputs = tokenizer.batch_encode_plus(protein_sequences,
add_special_tokens=True,
padding=True,
is_split_into_words=True,
return_tensors="pt")
with torch.no_grad():
embeddings = model(input_ids=outputs['input_ids'], attention_mask=outputs['attention_mask'])
```* Loading downstream models example:
```python
# To use downstream model for binary classification:
binary_classification_model = ankh.ConvBertForBinaryClassification(input_dim=768,
nhead=4,
hidden_dim=384,
num_hidden_layers=1,
num_layers=1,
kernel_size=7,
dropout=0.2,
pooling='max')# To use downstream model for multiclass classification:
multiclass_classification_model = ankh.ConvBertForMultiClassClassification(num_tokens=2,
input_dim=768,
nhead=4,
hidden_dim=384,
num_hidden_layers=1,
num_layers=1,
kernel_size=7,
dropout=0.2)# To use downstream model for regression:
# training_labels_mean is optional parameter and it's used to fill the output layer's bias with it,
# it's useful for faster convergence.
regression_model = ankh.ConvBertForRegression(input_dim=768,
nhead=4,
hidden_dim=384,
num_hidden_layers=1,
num_layers=1,
kernel_size=7,
dropout=0,
pooling='max',
training_labels_mean=0.38145)
```
## Original downstream Predictions
* Secondary Structure Prediction (Q3):
| Model | CASP12 | CASP14 | TS115 | CB513 |
|--------------------------|:----------------:|:-------------:|:-------------:|:------------:|
|**Ankh 2 Large** | 84.18% | 76.82% | 88.59% | 88.78% |
|Ankh Large | 83.59% | 77.48% | 88.22% | 88.48% |
|Ankh Base | 80.81% | 76.67% | 86.92% | 86.94% |
|ProtT5-XL-UniRef50 | 83.34% | 75.09% | 86.82% | 86.64% |
|ESM2-15B | 83.16% | 76.56% | 87.50% | 87.35% |
|ESM2-3B | 83.14% | 76.75% | 87.50% | 87.44% |
|ESM2-650M | 82.43% | 76.97% | 87.22% | 87.18% |
|ESM-1b | 79.45% | 75.39% | 85.02% | 84.31% |
* Secondary Structure Prediction (Q8):
| Model | CASP12 | CASP14 | TS115 | CB513 |
|--------------------------|:----------------:|:-------------:|:-------------:|:------------:|
|**Ankh 2 Large** | 72.90% | 62.84% | 79.88% | 79.01% |
|Ankh Large | 71.69% | 63.17% | 79.10% | 78.45% |
|Ankh Base | 68.85% | 62.33% | 77.08% | 75.83% |
|ProtT5-XL-UniRef50 | 70.47% | 59.71% | 76.91% | 74.81% |
|ESM2-15B | 71.17% | 61.81% | 77.67% | 75.88% |
|ESM2-3B | 71.69% | 61.52% | 77.62% | 75.95% |
|ESM2-650M | 70.50% | 62.10% | 77.68% | 75.89% |
|ESM-1b | 66.02% | 60.34% | 73.82% | 71.55% |
* Contact Prediction Long Precision Using Embeddings:
| Model | ProteinNet (L/1) | ProteinNet (L/5) | CASP14 (L/1) | CASP14 (L/5) |
|--------------------------|:----------------:|:----------------:|:-------------:|:------------:|
|Ankh 2 Large | In Progress | In Progress | In Progress | In Progress |
|**Ankh Large** | 48.93% | 73.49% | 16.01% | 29.91% |
|Ankh Base | 43.21% | 66.63% | 13.50% | 28.65% |
|ProtT5-XL-UniRef50 | 44.74% | 68.95% | 11.95% | 24.45% |
|ESM2-15B | 31.62% | 52.97% | 14.44% | 26.61% |
|ESM2-3B | 30.24% | 51.34% | 12.20% | 21.91% |
|ESM2-650M | 29.36% | 50.74% | 13.71% | 22.25% |
|ESM-1b | 29.25% | 50.69% | 10.18% | 18.08% |
* Contact Prediction Long Precision Using attention scores:
| Model | ProteinNet (L/1) | ProteinNet (L/5) | CASP14 (L/1) | CASP14 (L/5) |
|--------------------------|:----------------:|:----------------:|:-------------:|:------------:|
|Ankh 2 Large | In Progress | In Progress | In Progress | In Progress |
|**Ankh Large** | 31.44% | 55.58% | 11.05% | 20.74% |
|Ankh Base | 25.93% | 46.28% | 9.32% | 19.51% |
|ProtT5-XL-UniRef50 | 30.85% | 51.90% | 8.60% | 16.09% |
|ESM2-15B | 33.32% | 57.44% | 12.25% | 24.60% |
|ESM2-3B | 33.92% | 56.63% | 12.17% | 21.36% |
|ESM2-650M | 31.87% | 54.63% | 10.66% | 21.01% |
|ESM-1b | 25.30% | 42.03% | 7.77% | 15.77% |
* Localization (Q10):
| Model | DeepLoc Dataset |
|--------------------------|:----------------:|
|Ankh 2 Large | 82.57% |
|**Ankh Large** | 83.01% |
|Ankh Base | 81.38% |
|ProtT5-XL-UniRef50 | 82.95% |
|ESM2-15B | 81.22% |
|ESM2-3B | 81.22% |
|ESM2-650M | 82.08% |
|ESM-1b | 80.51% |
* Remote Homology:
| Model | SCOPe (Fold) |
|--------------------------|:----------------:|
|**Ankh 2 Large** | 62.09% |
|Ankh Large | 61.01% |
|Ankh Base | 61.14% |
|ProtT5-XL-UniRef50 | 59.38% |
|ESM2-15B | 54.48% |
|ESM2-3B | 59.24% |
|ESM2-650M | 51.36% |
|ESM-1b | 56.93% |
* Solubility:
| Model | Solubility |
|--------------------------|:----------------:|
|Ankh 2 Large | 75.86% |
|**Ankh Large** | 76.41% |
|Ankh Base | 76.36% |
|ProtT5-XL-UniRef50 | 76.26% |
|ESM2-15B | 60.52% |
|ESM2-3B | 74.91% |
|ESM2-650M | 74.56% |
|ESM-1b | 74.91% |
* Fluorescence (Spearman Correlation):
| Model | Fluorescence |
|--------------------------|:----------------:|
|Ankh 2 Large | 0.62 |
|**Ankh Large** | 0.62 |
|Ankh Base | 0.62 |
|ProtT5-XL-UniRef50 | 0.61 |
|ESM2-15B | 0.56 |
|ESM-1b | 0.48 |
|ESM2-650M | 0.48 |
|ESM2-3B | 0.46 |
* Nearest Neighbor Search using Global Pooling:
| Model | Lookup69K (C) | Lookup69K (A) | Lookup69K (T) | Lookup69K (H) |
|--------------------------|:----------------:|:----------------:|:----------------:|:----------------:|
|Ankh 2 Large | In Progress | In Progress | In Progress | In Progress |
|Ankh Large | 0.83 | 0.72 | 0.60 | 0.70 |
|**Ankh Base** | 0.85 | 0.77 | 0.63 | 0.72 |
|ProtT5-XL-UniRef50 | 0.83 | 0.69 | 0.57 | 0.73 |
|ESM2-15B | 0.78 | 0.63 | 0.52 | 0.67 |
|ESM2-3B | 0.79 | 0.65 | 0.53 | 0.64 |
|ESM2-650M | 0.72 | 0.56 | 0.40 | 0.53 |
|ESM-1b | 0.78 | 0.65 | 0.51 | 0.63 |* Technical University of Munich:
| [Ahmed Elnaggar](https://github.com/agemagician) | Burkhard Rost |
|:------------------------------------------------:|:-------------------------:|
||
|
* Proteinea:
| [Hazem Essam](https://github.com/hazemessamm) | [Wafaa Ashraf](https://github.com/wafaaashraf) | [Walid Moustafa](https://github.com/wmustafaawad) | [Mohamed Elkerdawy](https://github.com/melkerdawy) |
|:-------------------------:|:-------------------------:|:-------------------------:|:-------------------------:|
||
|
|
|
* University of Columbia:
| [Charlotte Rochereau](https://github.com/crochereau) |
|:----------------------------------------------------:|
||
| Google Cloud |
:------------------------------------------------------------------------------------------------------------------------:||
## License
Ankh pretrained models are released under the under terms of the [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International](https://creativecommons.org/licenses/by/4.0/).
## Community and ContributionsThe Ankh project is a **open source project** supported by various partner companies and research institutions. We are committed to **share all our pre-trained models and knowledge**. We are more than happy if you could help us on sharing new ptrained models, fixing bugs, proposing new feature, improving our documentation, spreading the word, or support our project.
We are happy to hear your question in our issues page [Ankh](https://github.com/agemagician/Ankh/issues)! Obviously if you have a private question or want to cooperate with us, you can always **reach out to us directly** via [Hello](mailto:[email protected]?subject=[GitHub]Ankh).
Feel free to **file a new issue** with a respective title and description on the the [Ankh](https://github.com/agemagician/Ankh/issues) repository. If you already found a solution to your problem, **we would love to review your pull request**!.
## ✏️ Citation
If you use this code or our pretrained models for your publication, please cite the original paper:
```
@article{elnaggar2023ankh,
title={Ankh: Optimized Protein Language Model Unlocks General-Purpose Modelling},
author={Elnaggar, Ahmed and Essam, Hazem and Salah-Eldin, Wafaa and Moustafa, Walid and Elkerdawy, Mohamed and Rochereau, Charlotte and Rost, Burkhard},
journal={arXiv preprint arXiv:2301.06568},
year={2023}
}
```