Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/dohlee/antiberty-pytorch
An unofficial re-implementation of AntiBERTy, an antibody-specific protein language model, in PyTorch.
https://github.com/dohlee/antiberty-pytorch
antibody-sequence antibody-sequences bioinformatics biology protein protein-language-model protein-sequences
Last synced: 4 days ago
JSON representation
An unofficial re-implementation of AntiBERTy, an antibody-specific protein language model, in PyTorch.
- Host: GitHub
- URL: https://github.com/dohlee/antiberty-pytorch
- Owner: dohlee
- License: mit
- Created: 2023-03-23T08:58:48.000Z (over 1 year ago)
- Default Branch: master
- Last Pushed: 2024-03-21T14:25:13.000Z (8 months ago)
- Last Synced: 2024-09-23T20:36:35.427Z (about 2 months ago)
- Topics: antibody-sequence, antibody-sequences, bioinformatics, biology, protein, protein-language-model, protein-sequences
- Language: Jupyter Notebook
- Homepage:
- Size: 228 KB
- Stars: 22
- Watchers: 3
- Forks: 5
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# antiberty-pytorch
[![Lightning](https://img.shields.io/badge/-Lightning-792ee5?logo=pytorchlightning&logoColor=white)](https://github.com/Lightning-AI/lightning)![antiberty_model](img/banner.png)
## installation
```bash
$ pip install antiberty-pytorch
```## Reproduction status
### Number of parameters
![numparams](img/antiberty_num_params.png)
This version of AntiBERTy implementation has 25,759,769 parameters in total, and it matches well with the approx. 26M parameters specified in the paper (See above).
### Training with 1% of the entire OAS data
I've reproduced AntiBERTy training with about tiny ~1% of the entire OAS data (`batch_size=16`, `mask_prob=0.15`) and observed pretty reasonable loss decrease, though it's not for validation set.
The training log can be found [here](https://api.wandb.ai/links/dohlee/qqzxgo1v).![training_log](img/training.png)
## Observed Antibody Sequences (OAS) dataset preparation pipeline
I wrote a `snakemake` pipeline in the directory `data` to automate the dataset prep process. It will download metadata from [OAS](https://opig.stats.ox.ac.uk/webapps/oas/oas) and extract lists of sequences. The pipeline can be run as follows:
```bash
$ cd data
$ snakemake -s download.smk -j1
```*NOTE: Only 3% of the entire OAS sequences were downloaded for now due to space and computational cost. (83M sequences, 31GB)*
## Citation
```bibtex
@article{ruffolo2021deciphering,
title = {Deciphering antibody affinity maturation with language models and weakly supervised learning},
author = {Ruffolo, Jeffrey A and Gray, Jeffrey J and Sulam, Jeremias},
journal = {arXiv},
year= {2021}
}
```