https://github.com/ml-jku/dna-xlstm

Last synced: 3 months ago
JSON representation

Host: GitHub
URL: https://github.com/ml-jku/dna-xlstm
Owner: ml-jku
License: apache-2.0
Created: 2024-11-05T10:51:58.000Z (7 months ago)
Default Branch: main
Last Pushed: 2024-11-08T16:36:36.000Z (7 months ago)
Last Synced: 2025-01-10T06:11:52.283Z (5 months ago)
Language: Python
Size: 1.56 MB
Stars: 12
Watchers: 5
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

xlstm

# DNA-xLSTM

This repository provides the code necessary to reproduce the experiments presented in the paper [Bio-xLSTM: Generative modeling, representation and in-context learning of biological and chemical sequences](https://arxiv.org/abs/2411.04165). The code is organized across the following repositories:

- [DNA-xLSTM](https://github.com/ml-jku/DNA-xLSTM/) (current repository)
- [Prot-xLSTM](https://github.com/ml-jku/Prot-xLSTM/)
- [Chem-xLSTM](https://github.com/ml-jku/Chem-xLSTM/)

## DNA-xLSTM

### Installation

To get started, create a conda environment containing the required dependencies.

```bash
conda env create -f xlstm_dna_env.yml
```

Activate the environment.

```bash
conda activate dna_xlstm
```

Create the following directory to store saved models:
```bash
mkdir outputs
```

### Data Preparation

(Data downloading instructions are adapted from [HyenaDNA repo](https://github.com/HazyResearch/hyena-dna?tab=readme-ov-file#pretraining-on-human-reference-genome))

First, download the Human Reference Genome data.
It's comprised of 2 files, 1 with all the sequences (the `.fasta` file), and with the intervals we use (`.bed` file).

The file structure should look like

```
data
|-- hg38/
|-- hg38.ml.fa
|-- human-sequences.bed
```

Download fasta (.fa format) file (of the entire human genome) into `./data/hg38`.
~24 chromosomes in the whole genome (merged into 1 file), each chromosome is a continuous sequence, basically.
Then download the .bed file with sequence intervals (contains chromosome name, start, end, split, which then allow you to retrieve from the fasta file).
```bash
mkdir -p data/hg38/
curl https://ml.jku.at/research/Bio-xLSTM/downloads/DNA-xLSTM/data/hg38/hg38.ml.fa.gz > data/hg38/hg38.ml.fa.gz
curl https://ml.jku.at/research/Bio-xLSTM/downloads/DNA-xLSTM/data/hg38/human-sequences.bed > data/hg38/human-sequences.bed
gunzip data/hg38/hg38.ml.fa.gz # unzip the fasta file
```

### Pre-Training on the Human Genome

To pre-train a model on the human reference genome, move to the `scripts_pretrain` directory, then adapt and run the provided shell scripts. We provide shell scripts that detail model and training arguments for all supported models. Supported models include xLSTM, Mamba, Caduceus, Transformer++ (Llama), and Hyena.

```bash
cd scripts_pretrain
sh run_pretrain_xlstm.sh
```

Alternatively, you can launch pre-training from the command line. The following command will train a small bidirectional mLSTM with reverse complement augmentation. For more details on xLSTM arguments see `scripts_pretrain/run_pretrain_xlstm.sh`.

```bash
python train.py \
experiment=hg38/hg38 \
callbacks.model_checkpoint_every_n_steps.every_n_train_steps=500 \
dataset.max_length=1024 \
dataset.batch_size=1024 \
dataset.mlm=true \
dataset.mlm_probability=0.15 \
dataset.rc_aug=true \
model=xlstm \
model.config.d_model=128 \
model.config.n_layer=4 \
model.config.max_length=1024 \
model.config.s_lstm_at=[] \
model.config.m_qkv_proj_blocksize=4 \
model.config.m_num_heads=4 \
model.config.m_proj_factor=2.0 \
model.config.m_backend="chunkwise" \
model.config.m_chunk_size=1024 \
model.config.m_backend_bidirectional=false \
model.config.m_position_embeddings=false \
model.config.bidirectional=true \
model.config.bidirectional_alternating=false \
model.config.rcps=false \
optimizer.lr="8e-3" \
train.global_batch_size=8 \
trainer.max_steps=10000 \
trainer.precision=bf16 \
+trainer.val_check_interval=10000 \
wandb=null
```

### xLSTM Model Weights

Pre-trained xLSTM model weights can be downloaded from [here](https://ml.jku.at/research/Bio-xLSTM/downloads/DNA-xLSTM/checkpoints). Create a directory `checkpoints` in the root directory and store the downloaded weights. For downstream fine-tuning the following directory structure is expected:

```
checkpoints
|-- context_1k
|-- context_32k
```

### Downstream Fine-Tuning

We support two downstream task collections for fine-tuning pre-trained models: **Genomic Benchmarks** and **Nucleotide Transformer** datasets.

[Genomic Benchmarks](https://github.com/ML-Bioinfo-CEITEC/genomic_benchmarks) introduced in [Grešová et al. (2023)](https://bmcgenomdata.biomedcentral.com/articles/10.1186/s12863-023-01123-8) is a set of 8 classification tasks. The Nucleotide Transformer tasks are comprised of 18 classification datasets that were originally used in [Dalla-Torre et al. (2023)](https://www.biorxiv.org/content/10.1101/2023.01.11.523679v1). The full task collection is hosted on Huggingface: [InstaDeepAI/nucleotide_transformer_downstream_tasks](https://huggingface.co/datasets/InstaDeepAI/nucleotide_transformer_downstream_tasks).

Scripts to fine-tune pre-trained models are provided in `scripts_downstream` and can be adapted to perform hyperparameter sweeps.

```
scripts_downstream
|-- run_genomics.sh
|-- run_nucleotide.sh
```

```bash
cd scripts_downstream
```

For Genomics Benchmarks:

```bash
sh run_genomics.sh
```

For Nucleotide Transformer tasks:

```bash
sh run_nucleotide.sh
```

### Acknowledgements

This repository is adapted from the [Caduceus](https://github.com/kuleshov-group/caduceus) repository and leverages much of the training, data loading, and logging infrastructure defined there. Caduceus itself is derived from the [HyenaDNA](https://github.com/HazyResearch/hyena-dna) codebase, which was originally built from the [S4](https://github.com/state-spaces/s4) and [Safari](https://github.com/HazyResearch/safari) repositories.

### Citation

```latex
@article{schmidinger2024bio-xlstm,
title={{Bio-xLSTM}: Generative modeling, representation and in-context learning of biological and chemical sequences},
author={Niklas Schmidinger and Lisa Schneckenreiter and Philipp Seidl and Johannes Schimunek and Pieter-Jan Hoedt and Johannes Brandstetter and Andreas Mayr and Sohvi Luukkonen and Sepp Hochreiter and Günter Klambauer},
journal={arXiv},
doi = {},
year={2024},
url={}
}
```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/ml-jku/dna-xlstm

Awesome Lists containing this project

README