{"id":31730897,"url":"https://github.com/MAGICS-LAB/DNABERT_2","last_synced_at":"2025-10-09T07:45:29.386Z","repository":{"id":179262122,"uuid":"657481315","full_name":"MAGICS-LAB/DNABERT_2","owner":"MAGICS-LAB","description":"[ICLR 2024] DNABERT-2: Efficient Foundation Model and Benchmark for Multi-Species Genome","archived":false,"fork":false,"pushed_at":"2025-08-14T19:54:59.000Z","size":875,"stargazers_count":404,"open_issues_count":48,"forks_count":86,"subscribers_count":11,"default_branch":"main","last_synced_at":"2025-08-14T21:32:31.132Z","etag":null,"topics":["covid","dataset","dna","dna-processing","dna-training","genome","genome-analysis","language-model","promoter","promoter-analysis","promoters","splice","splice-site","transcription-factor-binding","transcription-factor-binding-site","transcription-factors"],"latest_commit_sha":null,"homepage":"","language":"Shell","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/MAGICS-LAB.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2023-06-23T06:48:01.000Z","updated_at":"2025-08-14T19:55:02.000Z","dependencies_parsed_at":null,"dependency_job_id":"a1404e22-dd27-4215-b11a-de69cd81b795","html_url":"https://github.com/MAGICS-LAB/DNABERT_2","commit_stats":null,"previous_names":["zhihan1996/dnabert_2","magics-lab/dnabert_2"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/MAGICS-LAB/DNABERT_2","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MAGICS-LAB%2FDNABERT_2","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MAGICS-LAB%2FDNABERT_2/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MAGICS-LAB%2FDNABERT_2/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MAGICS-LAB%2FDNABERT_2/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/MAGICS-LAB","download_url":"https://codeload.github.com/MAGICS-LAB/DNABERT_2/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MAGICS-LAB%2FDNABERT_2/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":279000975,"owners_count":26082974,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-09T02:00:07.460Z","response_time":59,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["covid","dataset","dna","dna-processing","dna-training","genome","genome-analysis","language-model","promoter","promoter-analysis","promoters","splice","splice-site","transcription-factor-binding","transcription-factor-binding-site","transcription-factors"],"created_at":"2025-10-09T07:45:22.704Z","updated_at":"2025-10-09T07:45:29.377Z","avatar_url":"https://github.com/MAGICS-LAB.png","language":"Shell","readme":"# DNABERT-2: Efficient Foundation Model and Benchmark for Multi-Species Genome\n\nThe repo contains: \n\n1. The official implementation of [DNABERT-2: Efficient Foundation Model and Benchmark for Multi-Species Genome](https://arxiv.org/abs/2306.15006)\n2. Genome Understanding Evaluation (GUE): a comprehensize benchmark containing 28 datasets for multi-species genome understanding benchmark.\n\n\n\n## Contents\n\n- [1. Introduction](#1-introduction)\n- [2. Model and Data](#2-model-and-data)\n- [3. Setup Environment](#3-setup-environment)\n- [4. Quick Start](#4-quick-start)\n- [5. Pre-Training](#5-pre-training)\n- [6. Finetune](#6-finetune)\n- [7. Citation](#7-citation)\n\n\n\n## Update (2024/02/14)\n\nWe publish DNABERT-S,  a foundation model based on DNABERT-2 specifically designed for generating DNA embedding that naturally clusters and segregates genome of different species in the embedding space. Please check it out [here](https://github.com/Zhihan1996/DNABERT_S) if you are interested.\n\n\n\n## 1. Introduction\n\nDNABERT-2 is a foundation model trained on large-scale multi-species genome that achieves the state-of-the-art performance on $28$ tasks of the GUE benchmark. It replaces k-mer tokenization with BPE, positional embedding with Attention with Linear Bias (ALiBi), and incorporate other techniques to improve the efficiency and effectiveness of DNABERT.\n\n\n\n## 2. Model and Data\n\nThe pre-trained models is available at Huggingface as `zhihan1996/DNABERT-2-117M`. [Link to HuggingFace ModelHub](https://huggingface.co/zhihan1996/DNABERT-2-117M). [Link For Direct Downloads]().\n\n\n\n### 2.1 GUE: Genome Understanding Evaluation\n\nGUE is a comprehensive benchmark for genome understanding consising of $28$ distinct datasets across $7$ tasks and $4$ species. GUE can be download [here]([https://drive.google.com/file/d/1GRtbzTe3UXYF1oW27ASNhYX3SZ16D7N2/view?usp=sharing](https://drive.google.com/file/d/1uOrwlf07qGQuruXqGXWMpPn8avBoW7T-/view?usp=sharing)). Statistics and model performances on GUE is shown as follows:\n\n\n\n![GUE](figures/GUE.png)\n\n\n\n![Performance](figures/Performance.png)\n\n\n\n## 3. Setup environment\n\n    # create and activate virtual python environment\n    conda create -n dna python=3.8\n    conda activate dna\n    \n    # (optional if you would like to use flash attention)\n    # install triton from source\n    git clone https://github.com/openai/triton.git;\n    cd triton/python;\n    pip install cmake; # build-time dependency\n    pip install -e .\n    \n    # install required packages\n    python3 -m pip install -r requirements.txt\n\n\n\n\n\n## 4. Quick Start\n\nOur model is easy to use with the [transformers](https://github.com/huggingface/transformers) package.\n\n\nTo load the model from huggingface (version 4.28):\n```python\nimport torch\nfrom transformers import AutoTokenizer, AutoModel\n\ntokenizer = AutoTokenizer.from_pretrained(\"zhihan1996/DNABERT-2-117M\", trust_remote_code=True)\nmodel = AutoModel.from_pretrained(\"zhihan1996/DNABERT-2-117M\", trust_remote_code=True)\n```\n\nTo load the model from huggingface (version \u003e 4.28):\n```python\nfrom transformers.models.bert.configuration_bert import BertConfig\n\nconfig = BertConfig.from_pretrained(\"zhihan1996/DNABERT-2-117M\")\nmodel = AutoModel.from_pretrained(\"zhihan1996/DNABERT-2-117M\", trust_remote_code=True, config=config)\n```\n\n\nTo calculate the embedding of a dna sequence\n```\ndna = \"ACGTAGCATCGGATCTATCTATCGACACTTGGTTATCGATCTACGAGCATCTCGTTAGC\"\ninputs = tokenizer(dna, return_tensors = 'pt')[\"input_ids\"]\nhidden_states = model(inputs)[0] # [1, sequence_length, 768]\n\n# embedding with mean pooling\nembedding_mean = torch.mean(hidden_states[0], dim=0)\nprint(embedding_mean.shape) # expect to be 768\n\n# embedding with max pooling\nembedding_max = torch.max(hidden_states[0], dim=0)[0]\nprint(embedding_max.shape) # expect to be 768\n```\n\n\n\n## 5. Pre-Training\n\nWe used and slightly modified the MosaicBERT implementation for DNABERT-2 https://github.com/mosaicml/examples/tree/main/examples/benchmarks/bert . You should be able to replicate the model training following the instructions.\n\nOr you can use the run_mlm.py at https://github.com/huggingface/transformers/tree/main/examples/pytorch/language-modeling by importing the BertModelForMaskedLM from https://huggingface.co/zhihan1996/DNABERT-2-117M/blob/main/bert_layers.py. It should produce a very similar model.\n\nThe training data is available [here](https://drive.google.com/file/d/1dSXJfwGpDSJ59ry9KAp8SugQLK35V83f/view?usp=sharing.). \n\n\n\n\n\n## 6. Finetune\n\n### 6.1 Evaluate models on GUE\nPlease first download the GUE dataset from [here](https://drive.google.com/file/d/1uOrwlf07qGQuruXqGXWMpPn8avBoW7T-/view?usp=sharing). Then run the scripts to evaluate on all the tasks. \n\nCurrent script is set to use `DataParallel` for training on 4 GPUs. If you have different number of GPUs, please change the `per_device_train_batch_size` and `gradient_accumulation_steps` accordingly to adjust the global batch size to 32 to replicate the results in the paper. If you would like to perform distributed multi-gpu training (e.g., with `DistributedDataParallel`), simply change `python` to `torchrun --nproc_per_node ${n_gpu}`.\n\n\n```\nexport DATA_PATH=/path/to/GUE #(e.g., /home/user)\ncd finetune\n\n# Evaluate DNABERT-2 on GUE\nsh scripts/run_dnabert2.sh DATA_PATH\n\n# Evaluate DNABERT (e.g., DNABERT with 3-mer) on GUE\n# 3 for 3-mer, 4 for 4-mer, 5 for 5-mer, 6 for 6-mer\nsh scripts/run_dnabert1.sh DATA_PATH 3\n\n# Evaluate Nucleotide Transformers on GUE\n# 0 for 500m-1000g, 1 for 500m-human-ref, 2 for 2.5b-1000g, 3 for 2.5b-multi-species\nsh scripts/run_nt.sh DATA_PATH 0\n\n```\n\n### 6.2 Fine-tune DNABERT2 on your own datasets\n\nHere we provide an example of fine-tuning DNABERT2 on your own datasets.\n\n\n\n#### 6.2.1 Format your dataset\n\nFirst, please generate 3 `csv` files from your dataset: `train.csv`, `dev.csv`, and `test.csv`. In the training process, the model is trained on `train.csv` and is evaluated on the `dev.csv` file. After the training if finished, the checkpoint with the smallest loss on the `dev.csv `file is loaded and be evaluated on `test.csv`. If you do not have a validation set, please just make the `dev.csv` and `test.csv` the same. \n\n\n\nPlease see the `sample_data` folder for an sample of data format. Each file should be in the same format, with the first row as document head named `sequence, label`. Each following row should contain a DNA sequence and a numerical label concatenated by a `,` (e.g., `ACGTCAGTCAGCGTACGT, 1 `).\n\n\n\nThen, you are able to finetune DNABERT-2 on your own dataset with the following code:\n\n\n\n```\ncd finetune\n\nexport DATA_PATH=$path/to/data/folder  # e.g., ./sample_data\nexport MAX_LENGTH=100 # Please set the number as 0.25 * your sequence length. \n\t\t\t\t\t\t\t\t\t\t\t# e.g., set it as 250 if your DNA sequences have 1000 nucleotide bases\n\t\t\t\t\t\t\t\t\t\t\t# This is because the tokenized will reduce the sequence length by about 5 times\nexport LR=3e-5\n\n# Training use DataParallel\npython train.py \\\n    --model_name_or_path zhihan1996/DNABERT-2-117M \\\n    --data_path  ${DATA_PATH} \\\n    --kmer -1 \\\n    --run_name DNABERT2_${DATA_PATH} \\\n    --model_max_length ${MAX_LENGTH} \\\n    --per_device_train_batch_size 8 \\\n    --per_device_eval_batch_size 16 \\\n    --gradient_accumulation_steps 1 \\\n    --learning_rate ${LR} \\\n    --num_train_epochs 5 \\\n    --fp16 \\\n    --save_steps 200 \\\n    --output_dir output/dnabert2 \\\n    --evaluation_strategy steps \\\n    --eval_steps 200 \\\n    --warmup_steps 50 \\\n    --logging_steps 100 \\\n    --overwrite_output_dir True \\\n    --log_level info \\\n    --find_unused_parameters False\n    \n# Training use DistributedDataParallel (more efficient)\nexport num_gpu=4 # please change the value based on your setup\n\ntorchrun --nproc_per_node=${num_gpu} train.py \\\n    --model_name_or_path zhihan1996/DNABERT-2-117M \\\n    --data_path  ${DATA_PATH} \\\n    --kmer -1 \\\n    --run_name DNABERT2_${DATA_PATH} \\\n    --model_max_length ${MAX_LENGTH} \\\n    --per_device_train_batch_size 8 \\\n    --per_device_eval_batch_size 16 \\\n    --gradient_accumulation_steps 1 \\\n    --learning_rate ${LR} \\\n    --num_train_epochs 5 \\\n    --fp16 \\\n    --save_steps 200 \\\n    --output_dir output/dnabert2 \\\n    --evaluation_strategy steps \\\n    --eval_steps 200 \\\n    --warmup_steps 50 \\\n    --logging_steps 100 \\\n    --overwrite_output_dir True \\\n    --log_level info \\\n    --find_unused_parameters False\n```\n\n\n\n\n\n\n\n\n## 7. Citation\n\nIf you have any question regarding our paper or codes, please feel free to start an issue or email Zhihan Zhou (zhihanzhou2020@u.northwestern.edu).\n\n\n\nIf you use DNABERT-2 in your work, please kindly cite our paper:\n\n**DNABERT-2**\n\n```\n@misc{zhou2023dnabert2,\n      title={DNABERT-2: Efficient Foundation Model and Benchmark For Multi-Species Genome}, \n      author={Zhihan Zhou and Yanrong Ji and Weijian Li and Pratik Dutta and Ramana Davuluri and Han Liu},\n      year={2023},\n      eprint={2306.15006},\n      archivePrefix={arXiv},\n      primaryClass={q-bio.GN}\n}\n```\n\n**DNABERT**\n\n```\n@article{ji2021dnabert,\n    author = {Ji, Yanrong and Zhou, Zhihan and Liu, Han and Davuluri, Ramana V},\n    title = \"{DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome}\",\n    journal = {Bioinformatics},\n    volume = {37},\n    number = {15},\n    pages = {2112-2120},\n    year = {2021},\n    month = {02},\n    issn = {1367-4803},\n    doi = {10.1093/bioinformatics/btab083},\n    url = {https://doi.org/10.1093/bioinformatics/btab083},\n    eprint = {https://academic.oup.com/bioinformatics/article-pdf/37/15/2112/50578892/btab083.pdf},\n}\n```\n\n","funding_links":[],"categories":["📊 Sequence Analysis \u0026 Language Models","📝 Scientific Evaluation Datasets"],"sub_categories":["DNA Language Models","🧬 Life Science"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FMAGICS-LAB%2FDNABERT_2","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FMAGICS-LAB%2FDNABERT_2","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FMAGICS-LAB%2FDNABERT_2/lists"}