https://github.com/songlab-cal/gpn

Genomic Pre-trained Network
https://github.com/songlab-cal/gpn

dna genomics language-model variant-effect-prediction

Last synced: 5 months ago
JSON representation

Genomic Pre-trained Network

Host: GitHub
URL: https://github.com/songlab-cal/gpn
Owner: songlab-cal
License: mit
Created: 2022-02-22T23:21:19.000Z (over 3 years ago)
Default Branch: main
Last Pushed: 2025-03-30T19:53:45.000Z (6 months ago)
Last Synced: 2025-03-30T20:28:57.546Z (6 months ago)
Topics: dna, genomics, language-model, variant-effect-prediction
Language: Jupyter Notebook
Homepage: https://doi.org/10.1073/pnas.2311219120
Size: 52.1 MB
Stars: 254
Watchers: 8
Forks: 39
Open Issues: 2
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # GPN (Genomic Pre-trained Network)

[![hgt_genome_392c4_a47ce0](https://github.com/user-attachments/assets/282b6204-156b-4b6d-83ff-2f4a53a9bb2e)](https://genome.ucsc.edu/s/gbenegas/gpn-arabidopsis)

 

Code and resources from [GPN](https://doi.org/10.1073/pnas.2311219120) and related genomic language models.

## Table of contents

- [Installation](#installation)

- [Quick start](#quick-start)

- [Modeling frameworks](#modeling-frameworks)

- [Applications of the models](#applications-of-the-models)

- [GPN](#gpn)

- [GPN-MSA](#gpn-msa)

- [PhyloGPN](#phylogpn)

- [Citation](#citation)

## Installation

```bash

pip install git+https://github.com/songlab-cal/gpn.git

```

## Quick start

```python

import gpn.model

from transformers import AutoModelForMaskedLM, AutoModel

model = AutoModelForMaskedLM.from_pretrained("songlab/gpn-brassicales")

# or

model = AutoModelForMaskedLM.from_pretrained("songlab/gpn-msa-sapiens")

# or

model = AutoModel.from_pretrained("songlab/PhyloGPN", trust_remote_code=True)

```

## Modeling frameworks

| Model | Paper | Notes |

| --------- | --- | ----------- |

| GPN | [Benegas et al. 2023](https://doi.org/10.1073/pnas.2311219120) | Requires unaligned genomes | 

| GPN-MSA | [Benegas et al. 2025](https://www.nature.com/articles/s41587-024-02511-w) | Requires aligned genomes for both training and inference |

| PhyloGPN | [Albors et al. 2025] | Uses an alignment during training, but does not require it for inference or fine-tuning |

## Applications of the models

| Paper |  Model | Dataset | Code | Resources on HuggingFace 🤗 |

|  -- | --- | ------- | ---- | -------------- |

| [Benegas et al. 2023](https://doi.org/10.1073/pnas.2311219120) | GPN | Arabidopsis and other Brassicale plants | [analysis/gpn_arabidopsis](analysis/gpn_arabidopsis) |  [Model, dataset, intermediate results](https://huggingface.co/collections/songlab/gpn-653191edcb0270ed05ad2c3e) |

| [Benegas et al. 2025](https://www.nature.com/articles/s41587-024-02511-w) | GPN-MSA | Human and other vertebrates | [analysis/gpn-msa_human](analysis/gpn-msa_human) | [Model, dataset, benchmarks, predictions](https://huggingface.co/collections/songlab/gpn-msa-65319280c93c85e11c803887) |

| [Benegas et al. 2025b](https://www.biorxiv.org/content/10.1101/2025.02.11.637758v1) | GPN | Animal promoters | [analysis/gpn_animal_promoter](analysis/gpn_animal_promoter) | [Model, dataset, benchmarks](https://huggingface.co/collections/songlab/traitgym-6796d4fbb825d5b94e65d30f) |

 

## GPN

Can also be called GPN-SS (single sequence).

### Examples

* Play with the model: `examples/ss/basic_example.ipynb` [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/songlab-cal/gpn/blob/main/examples/ss/basic_example.ipynb)

### Training on your own data

1. [Snakemake workflow to create a dataset](workflow/make_dataset)

    - Can automatically download data from NCBI given a list of accessions, or use your own fasta files.

2. Training

    - Will automatically detect all available GPUs.

    - Track metrics on [Weights & Biases](https://wandb.ai/)

    - Implemented encoders: `convnet` (default), `roformer` (Transformer), `bytenet`

    - Specify config overrides: e.g. `--config_overrides encoder=bytenet,num_hidden_layers=30`

    - The number of steps that you can train without overfitting will be a function of the size and diversity of your dataset

    - Example:

```bash

WANDB_PROJECT=your_project torchrun --nproc_per_node=$(echo $CUDA_VISIBLE_DEVICES | awk -F',' '{print NF}') -m gpn.ss.run_mlm --do_train --do_eval \

    --report_to wandb --prediction_loss_only True --remove_unused_columns False \

    --dataset_name results/dataset --tokenizer_name gonzalobenegas/tokenizer-dna-mlm \

    --soft_masked_loss_weight_train 0.1 --soft_masked_loss_weight_evaluation 0.0 \

    --weight_decay 0.01 --optim adamw_torch \

    --dataloader_num_workers 16 --seed 42 \

    --save_strategy steps --save_steps 10000 --evaluation_strategy steps \

    --eval_steps 10000 --logging_steps 10000 --max_steps 120000 --warmup_steps 1000 \

    --learning_rate 1e-3 --lr_scheduler_type constant_with_warmup \

    --run_name your_run --output_dir your_output_dir --model_type GPN \

    --per_device_train_batch_size 512 --per_device_eval_batch_size 512 --gradient_accumulation_steps 1 --total_batch_size 2048 \ 

    --torch_compile \

    --ddp_find_unused_parameters False \

    --bf16 --bf16_full_eval \

```

3. Extract embeddings

    - Input file requires `chrom`, `start`, `end`

    - Example:

```bash

torchrun --nproc_per_node=$(echo $CUDA_VISIBLE_DEVICES | awk -F',' '{print NF}') -m gpn.ss.get_embeddings windows.parquet genome.fa.gz 100 your_output_dir \

    results.parquet --per_device_batch_size 4000 --is_file --dataloader_num_workers 16

```

4. Variant effect prediction

    - Input file requires `chrom`, `pos`, `ref`, `alt`

    - Example:

```bash

torchrun --nproc_per_node=$(echo $CUDA_VISIBLE_DEVICES | awk -F',' '{print NF}') -m gpn.ss.run_vep variants.parquet genome.fa.gz 512 your_output_dir results.parquet \

    --per_device_batch-size 4000 --is_file --dataloader_num_workers 16

```

## GPN-MSA

### Examples

* Play with the model: `examples/msa/basic_example.ipynb`

* Variant effect prediction: `examples/msa/vep.ipynb`

* Training (human): `examples/msa/training.ipynb`

### Training on other species (e.g. other vertebrates, plants)

* See https://github.com/songlab-cal/gpn/issues/28, https://github.com/songlab-cal/gpn/discussions/40, https://github.com/songlab-cal/gpn/issues/44

* Another source for plant alignments: https://plantregmap.gao-lab.org/download.php#alignment-conservation

## PhyloGPN

PhyloGPN is a convolutional neural network that takes encoded DNA sequences as input and outputs rate matrix parameters for [Felsenstein's 1981 model](https://en.wikipedia.org/wiki/Models_of_DNA_evolution#F81_model_(Felsenstein_1981)) (the F81 model, for short). It was trained to maximize the likelihood of columns in the [Zoonomia alignment](https://cglgenomics.ucsc.edu/november-2023-nature-zoonomia-with-expanded-primates-alignment/) given a phylogenetic tree. The stationary distribution of the substitution process described by the F81 model indicates the relative viability of each allele at any given locus. As a result, PhyloGPN is formally a (single-sequence) genomic language model. It can be used for transfer learning and zero-shot SNV deleteriousness prediction. It is especially useful for sequences that are not directly in the human reference genome.

## Citation

[GPN](https://doi.org/10.1073/pnas.2311219120):

```bibtex

@article{benegas2023dna,

  title={DNA language models are powerful predictors of genome-wide variant effects},

  author={Benegas, Gonzalo and Batra, Sanjit Singh and Song, Yun S},

  journal={Proceedings of the National Academy of Sciences},

  volume={120},

  number={44},

  pages={e2311219120},

  year={2023},

  publisher={National Acad Sciences}

}

```

[GPN-MSA](https://www.nature.com/articles/s41587-024-02511-w):

```bibtex

@article{benegas2025dna,

  title={A DNA language model based on multispecies alignment predicts the effects of genome-wide variants},

  author={Benegas, Gonzalo and Albors, Carlos and Aw, Alan J and Ye, Chengzhong and Song, Yun S},

  journal={Nature Biotechnology},

  pages={1--6},

  year={2025},

  publisher={Nature Publishing Group US New York}

}

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/songlab-cal/gpn

Awesome Lists containing this project

README