https://github.com/ml-jku/dna-xlstm
https://github.com/ml-jku/dna-xlstm
Last synced: 3 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/ml-jku/dna-xlstm
- Owner: ml-jku
- License: apache-2.0
- Created: 2024-11-05T10:51:58.000Z (7 months ago)
- Default Branch: main
- Last Pushed: 2024-11-08T16:36:36.000Z (7 months ago)
- Last Synced: 2025-01-10T06:11:52.283Z (5 months ago)
- Language: Python
- Size: 1.56 MB
- Stars: 12
- Watchers: 5
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
![]()
# DNA-xLSTM
This repository provides the code necessary to reproduce the experiments presented in the paper [Bio-xLSTM: Generative modeling, representation and in-context learning of biological and chemical sequences](https://arxiv.org/abs/2411.04165). The code is organized across the following repositories:
- [DNA-xLSTM](https://github.com/ml-jku/DNA-xLSTM/) (current repository)
- [Prot-xLSTM](https://github.com/ml-jku/Prot-xLSTM/)
- [Chem-xLSTM](https://github.com/ml-jku/Chem-xLSTM/)## DNA-xLSTM
### Installation
To get started, create a conda environment containing the required dependencies.
```bash
conda env create -f xlstm_dna_env.yml
```Activate the environment.
```bash
conda activate dna_xlstm
```Create the following directory to store saved models:
```bash
mkdir outputs
```### Data Preparation
(Data downloading instructions are adapted from [HyenaDNA repo](https://github.com/HazyResearch/hyena-dna?tab=readme-ov-file#pretraining-on-human-reference-genome))
First, download the Human Reference Genome data.
It's comprised of 2 files, 1 with all the sequences (the `.fasta` file), and with the intervals we use (`.bed` file).The file structure should look like
```
data
|-- hg38/
|-- hg38.ml.fa
|-- human-sequences.bed
```Download fasta (.fa format) file (of the entire human genome) into `./data/hg38`.
~24 chromosomes in the whole genome (merged into 1 file), each chromosome is a continuous sequence, basically.
Then download the .bed file with sequence intervals (contains chromosome name, start, end, split, which then allow you to retrieve from the fasta file).
```bash
mkdir -p data/hg38/
curl https://ml.jku.at/research/Bio-xLSTM/downloads/DNA-xLSTM/data/hg38/hg38.ml.fa.gz > data/hg38/hg38.ml.fa.gz
curl https://ml.jku.at/research/Bio-xLSTM/downloads/DNA-xLSTM/data/hg38/human-sequences.bed > data/hg38/human-sequences.bed
gunzip data/hg38/hg38.ml.fa.gz # unzip the fasta file
```### Pre-Training on the Human Genome
To pre-train a model on the human reference genome, move to the `scripts_pretrain` directory, then adapt and run the provided shell scripts. We provide shell scripts that detail model and training arguments for all supported models. Supported models include xLSTM, Mamba, Caduceus, Transformer++ (Llama), and Hyena.
```
scripts_pretrain
|-- run_pretrain_xlstm.sh
|-- run_pretrain_mamba.sh
|-- run_pretrain_caduceus.sh
|-- run_pretrain_hyena.sh
|-- run_pretrain_llama.sh
``````bash
cd scripts_pretrain
sh run_pretrain_xlstm.sh
```Alternatively, you can launch pre-training from the command line. The following command will train a small bidirectional mLSTM with reverse complement augmentation. For more details on xLSTM arguments see `scripts_pretrain/run_pretrain_xlstm.sh`.
```bash
python train.py \
experiment=hg38/hg38 \
callbacks.model_checkpoint_every_n_steps.every_n_train_steps=500 \
dataset.max_length=1024 \
dataset.batch_size=1024 \
dataset.mlm=true \
dataset.mlm_probability=0.15 \
dataset.rc_aug=true \
model=xlstm \
model.config.d_model=128 \
model.config.n_layer=4 \
model.config.max_length=1024 \
model.config.s_lstm_at=[] \
model.config.m_qkv_proj_blocksize=4 \
model.config.m_num_heads=4 \
model.config.m_proj_factor=2.0 \
model.config.m_backend="chunkwise" \
model.config.m_chunk_size=1024 \
model.config.m_backend_bidirectional=false \
model.config.m_position_embeddings=false \
model.config.bidirectional=true \
model.config.bidirectional_alternating=false \
model.config.rcps=false \
optimizer.lr="8e-3" \
train.global_batch_size=8 \
trainer.max_steps=10000 \
trainer.precision=bf16 \
+trainer.val_check_interval=10000 \
wandb=null
```### xLSTM Model Weights
Pre-trained xLSTM model weights can be downloaded from [here](https://ml.jku.at/research/Bio-xLSTM/downloads/DNA-xLSTM/checkpoints). Create a directory `checkpoints` in the root directory and store the downloaded weights. For downstream fine-tuning the following directory structure is expected:
```
checkpoints
|-- context_1k
|-- context_32k
```### Downstream Fine-Tuning
We support two downstream task collections for fine-tuning pre-trained models: **Genomic Benchmarks** and **Nucleotide Transformer** datasets.
[Genomic Benchmarks](https://github.com/ML-Bioinfo-CEITEC/genomic_benchmarks) introduced in [Grešová et al. (2023)](https://bmcgenomdata.biomedcentral.com/articles/10.1186/s12863-023-01123-8) is a set of 8 classification tasks. The Nucleotide Transformer tasks are comprised of 18 classification datasets that were originally used in [Dalla-Torre et al. (2023)](https://www.biorxiv.org/content/10.1101/2023.01.11.523679v1). The full task collection is hosted on Huggingface: [InstaDeepAI/nucleotide_transformer_downstream_tasks](https://huggingface.co/datasets/InstaDeepAI/nucleotide_transformer_downstream_tasks).
Scripts to fine-tune pre-trained models are provided in `scripts_downstream` and can be adapted to perform hyperparameter sweeps.
```
scripts_downstream
|-- run_genomics.sh
|-- run_nucleotide.sh
``````bash
cd scripts_downstream
```For Genomics Benchmarks:
```bash
sh run_genomics.sh
```For Nucleotide Transformer tasks:
```bash
sh run_nucleotide.sh
```### Acknowledgements
This repository is adapted from the [Caduceus](https://github.com/kuleshov-group/caduceus) repository and leverages much of the training, data loading, and logging infrastructure defined there. Caduceus itself is derived from the [HyenaDNA](https://github.com/HazyResearch/hyena-dna) codebase, which was originally built from the [S4](https://github.com/state-spaces/s4) and [Safari](https://github.com/HazyResearch/safari) repositories.
### Citation
```latex
@article{schmidinger2024bio-xlstm,
title={{Bio-xLSTM}: Generative modeling, representation and in-context learning of biological and chemical sequences},
author={Niklas Schmidinger and Lisa Schneckenreiter and Philipp Seidl and Johannes Schimunek and Pieter-Jan Hoedt and Johannes Brandstetter and Andreas Mayr and Sohvi Luukkonen and Sepp Hochreiter and Günter Klambauer},
journal={arXiv},
doi = {},
year={2024},
url={}
}
```