https://github.com/dohlee/antiberty-pytorch

An unofficial re-implementation of AntiBERTy, an antibody-specific protein language model, in PyTorch.
https://github.com/dohlee/antiberty-pytorch

antibody-sequence antibody-sequences bioinformatics biology protein protein-language-model protein-sequences

Last synced: 2 months ago
JSON representation

An unofficial re-implementation of AntiBERTy, an antibody-specific protein language model, in PyTorch.

Host: GitHub
URL: https://github.com/dohlee/antiberty-pytorch
Owner: dohlee
License: mit
Created: 2023-03-23T08:58:48.000Z (over 2 years ago)
Default Branch: master
Last Pushed: 2024-03-21T14:25:13.000Z (over 1 year ago)
Last Synced: 2025-04-22T11:44:31.963Z (2 months ago)
Topics: antibody-sequence, antibody-sequences, bioinformatics, biology, protein, protein-language-model, protein-sequences
Language: Jupyter Notebook
Homepage:
Size: 228 KB
Stars: 24
Watchers: 2
Forks: 5
Open Issues: 3
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

        # antiberty-pytorch

[![Lightning](https://img.shields.io/badge/-Lightning-792ee5?logo=pytorchlightning&logoColor=white)](https://github.com/Lightning-AI/lightning)

![antiberty_model](img/banner.png)

## installation

```bash

$ pip install antiberty-pytorch

```

## Reproduction status

### Number of parameters

![numparams](img/antiberty_num_params.png)

This version of AntiBERTy implementation has 25,759,769 parameters in total, and it matches well with the approx. 26M parameters specified in the paper (See above).

### Training with 1% of the entire OAS data

I've reproduced AntiBERTy training with about tiny ~1% of the entire OAS data (`batch_size=16`, `mask_prob=0.15`) and observed pretty reasonable loss decrease, though it's not for validation set.

The training log can be found [here](https://api.wandb.ai/links/dohlee/qqzxgo1v).

![training_log](img/training.png)

## Observed Antibody Sequences (OAS) dataset preparation pipeline

I wrote a `snakemake` pipeline in the directory `data` to automate the dataset prep process. It will download metadata from [OAS](https://opig.stats.ox.ac.uk/webapps/oas/oas) and extract lists of sequences. The pipeline can be run as follows:

```bash

$ cd data

$ snakemake -s download.smk -j1

```

*NOTE: Only 3% of the entire OAS sequences were downloaded for now due to space and computational cost. (83M sequences, 31GB)*

## Citation

```bibtex

@article{ruffolo2021deciphering,

    title = {Deciphering antibody affinity maturation with language models and weakly supervised learning},

    author = {Ruffolo, Jeffrey A and Gray, Jeffrey J and Sulam, Jeremias},

    journal = {arXiv},

    year= {2021}

}

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/dohlee/antiberty-pytorch

Awesome Lists containing this project

README