{"id":14286918,"url":"https://github.com/MAGICS-LAB/DNABERT_S","last_synced_at":"2025-08-15T07:31:33.090Z","repository":{"id":222591508,"uuid":"757836528","full_name":"MAGICS-LAB/DNABERT_S","owner":"MAGICS-LAB","description":"DNABERT_S: Learning Species-Aware DNA Embedding with Genome Foundation Models","archived":false,"fork":false,"pushed_at":"2024-05-13T20:30:13.000Z","size":2853,"stargazers_count":54,"open_issues_count":0,"forks_count":12,"subscribers_count":2,"default_branch":"main","last_synced_at":"2024-08-24T17:26:32.931Z","etag":null,"topics":["dna","dna-embedding","embedding"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/MAGICS-LAB.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2024-02-15T04:47:44.000Z","updated_at":"2024-08-16T07:21:24.000Z","dependencies_parsed_at":"2024-03-06T07:26:15.554Z","dependency_job_id":"81356af1-90af-4bcd-b896-8e9009bee5d5","html_url":"https://github.com/MAGICS-LAB/DNABERT_S","commit_stats":null,"previous_names":["zhihan1996/dnabert_s","magics-lab/dnabert_s"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MAGICS-LAB%2FDNABERT_S","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MAGICS-LAB%2FDNABERT_S/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MAGICS-LAB%2FDNABERT_S/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MAGICS-LAB%2FDNABERT_S/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/MAGICS-LAB","download_url":"https://codeload.github.com/MAGICS-LAB/DNABERT_S/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":229899536,"owners_count":18141525,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["dna","dna-embedding","embedding"],"created_at":"2024-08-23T17:01:07.560Z","updated_at":"2025-08-15T07:31:33.074Z","avatar_url":"https://github.com/MAGICS-LAB.png","language":"Python","readme":"# [DNABERT_S: Learning Species-Aware DNA Embedding with Genome Foundation Models](https://arxiv.org/abs/2402.08777)\n\n\n\nThis Repo is the official implementatation of [DNABERT_S: Learning Species-Aware DNA Embedding with Genome Foundation Models](https://arxiv.org/abs/2402.08777).\n\n\n\n## Contents\n\n- [1. Introduction](#1-introduction)\n- [2. Model and Data](#2-model-and-data)\n- [3. Setup Environment](#3-setup-environment)\n- [4. Quick Start](#4-quick-start)\n- [5. Training](#5-training)\n- [6. Evaluation](#6-evaluation)\n- [7. Citation](#7-citation)\n\n\n\n## 1. Introduction\n\n![embedding_visualization](./figures/embedding_visualization.png)\n\nDNABERT-S is a foundation model based on [DNABERT-2](https://github.com/Zhihan1996/DNABERT_2) specifically designed for generating DNA embedding that naturally clusters and segregates genome of different species in the embedding space, which can greatly benefit a wide range of genome applications, including species classification/identification, metagenomics binning, and understanding evolutionary relationships.\n\n\n\nResults on species clustering.\n\n![image-20240214211657375](./figures/clustering_results.png)\n\n\n\n\n\n## 2. Model and Data\n\n\n\n### 2.1 Model\n\nThe pre-trained models is available at Huggingface as `zhihan1996/DNABERT-S`.\n\nTo download the model from command line:\n\n```\n# command line\ngdown 1ejNOMXdycorDzphLT6jnfGIPUxi6fO0g # pip install gdown\nunzip dnabert-s_train.zip  # unzip the data \n```\n\n\n\n\n\n### 2.2 Data\n\nThe training data of DNABERT-S is available at\n\n```\ngdown 1p59ch_MO-9DXh3LUIvorllPJGLEAwsUp # pip install gdown\nunzip dnabert-s_train.zip  # unzip the data \n```\n\n\n\nThe evaluation data is available at\n\n```\ngdown 1I44T2alXrtXPZrhkuca6QP3tFHxDW98c # pip install gdown\nunzip dnabert-s_eval.zip  # unzip the data \n```\n\n\n\n\n\n## 3. Setup environment\n\n```\nconda create -n DNABERT_S python=3.9\nconda activate DNABERT_S\n```\n\n```\npip install -r requirements.txt\npip uninstall triton # this can lead to errors in GPUs other than A100\n```\n\n\n\n\n\n## 4. Quick Start\n\nOur model is easy to use with the [transformers](https://github.com/huggingface/transformers) package.\n\n\nTo load the model from huggingface:\n\n```python\nimport torch\nfrom transformers import AutoTokenizer, AutoModel\n\ntokenizer = AutoTokenizer.from_pretrained(\"zhihan1996/DNABERT-S\", trust_remote_code=True)\nmodel = AutoModel.from_pretrained(\"zhihan1996/DNABERT-S\", trust_remote_code=True)\n```\n\n\nTo calculate the embedding of a dna sequence\n\n```\ndna = \"ACGTAGCATCGGATCTATCTATCGACACTTGGTTATCGATCTACGAGCATCTCGTTAGC\"\ninputs = tokenizer(dna, return_tensors = 'pt')[\"input_ids\"]\nhidden_states = model(inputs)[0] # [1, sequence_length, 768]\n\n# embedding with mean pooling\nembedding_mean = torch.mean(hidden_states[0], dim=0)\nprint(embedding_mean.shape) # expect to be 768\n```\n\n\n\n## 5. Training\n\nOur code base expects pairs of DNA sequencesfor pre-training. We expect the training data to be a csv file with two columns and no header. Each row contains one pair of DNA sequences that you want to model to generate similar embedding for them. See data/debug_train.csv for an example.\n\nImportant arguments:\n\n- resdir: dictionary to save model parameters\n- datapath: dictionary of data\n- train_dataname: the name of the training data file (e.g., \"a.csv\")\n- val_dataname: the name of the validating data file (e.g., \"a.csv\")\n- max_length: set it as 0.2 * DNA_length  (e.g., 200 for 1000-bp DNA)\n- train_batch_size: batch size for training data, change it to fit your GPU RAM \n- con_method: contrastive learning method, including \"same species\", \"dropout\", \"double_strand\", \"mutate\"\n- mix: whether use i-Mix method\n- mix_layer_num: which layer to perform i-Mix, if the value is -1, it means manifold i-Mix\n- curriculum: whether use curriculum learning\n- Other arguments can also be adjusted.\n\nFor our curriculum contrastive learning method, you can use:\n\n```\ncd pretrain\nexport PATH_TO_DATA_DICT=/path/to/data\nexport TRAIN_FILE=debug_train.csv # use this for debug, for real training, please use train_2m.csv\n\npython main.py \\\n    --resdir ./results/ \\\n    --datapath ${PATH_TO_DATA_DICT} \\\n    --train_dataname ${TRAIN_FILE} \\\n    --val_dataname val_48k.csv \\\n    --seed 1 \\\n    --logging_step 10000 \\\n    --logging_num 12 \\\n    --max_length 2000 \\\n    --train_batch_size 48 \\\n    --val_batch_size 360 \\\n    --lr 3e-06 \\\n    --lr_scale 100 \\\n    --epochs 3 \\\n    --feat_dim 128 \\\n    --temperature 0.05 \\\n    --con_method same_species \\\n    --mix \\\n    --mix_alpha 1.0 \\\n    --mix_layer_num -1 \\\n    --curriculum \n```\n\nThis training scripts expect 8 A100 80GB GPUs. If you are using other types of devices, please change the train_batch_size and max_length accordingly.\n\nAfter model training, you will find the trained model at \n./pretrain/results/$file_name\n\nThe file_name is automatically set based on the hyperparameters, and the code regularly save checkpoint.\n\nIt should be something like ./results/contrastive.HardNeg.epoch2.debug_train.csv.lr3e-06.lrscale100.bs48.maxlength200.tmp0.05.decay1.seed1.turn1/100\n\nThe best model after validating is saved in ./pretrain/results/$file_name/best/\n\nScripts for other experiments are all in ./pretrain/results\n\n\n\n\n\n## 6. Evaluation\n\n### 6.1 Prepare model\n\n```\ncd evaluate\n```\n\n\n\n\n\n#### 6.1.1 Test pre-trained DNABERT-S\n\n```\ngdown 1ejNOMXdycorDzphLT6jnfGIPUxi6fO0g\nunzip DNABERT-S.zip\nexport MODEL_DIR=/path/to/DNABERT-S (e.g., /root/Downloads/DNABERT-S)\n```\n\n\n\n\n\n#### 6.1.2 Test you own model train with our code base\n\nCopy the necessary files to the folder where the model is saved. This is a bug in Huggingface Transformers package. Sometimes the model file such as `bert_layer.py` are not automatically saved to the model directory together with the model weights. So we manually do it.\n\n```\nexport MODEL_DIR=/path/to/the/trained/model # (e.g., /root/ICML2024/train/pretrain/results/epoch3.debug_train.csv.lr3e-06.lrscale100.bs24.maxlength2000.tmp0.05.seed1.con_methodsame_species.mixTrue.mix_layer_num-1.curriculumTrue/0)\n\ncp model_codes/* ${MODEL_DIR}\n```\n\n\n\n### 6.2 Clustering and Classification\n\n```\nexport DATA_DIR=/path/to/the/unziped/folders\n\n# evaluate the trained model\npython eval_clustering_classification.py --test_model_dir ${MODEL_DIR} --data_dir ${DATA_DIR} --model_list \"test\"\n\n# evaluate baselines (e.g., TNF and DNABERT-2)\npython eval_clustering_classification.py --data_dir ${DATA_DIR} --model_list \"tnf, dnabert2\"\n```\n\n\n\n\n\n### 6.3 Metagenomics Binning\n\n```\nexport DATA_DIR=/path/to/the/unziped/folders\nexport MODEL_DIR=/path/to/the/trained/model\n\n# evaluate the trained model\npython eval_binning.py --test_model_dir ${MODEL_DIR} --data_dir ${DATA_DIR} --model_list \"test\"\n\n# evaluate baselines (e.g., TNF and DNABERT-2)\npython eval_binning.py --data_dir ${DATA_DIR} --model_list \"tnf, dnabert2\"\n```\n\n\n\n\n\n\n\n\n## 7. Citation\n\nIf you have any question regarding our paper or codes, please feel free to start an issue or email Zhihan Zhou (zhihanzhou2020@u.northwestern.edu).\n\n\n\nIf you use DNABERT-S in your work, please consider cite our papers:\n\n\n\n**DNABERT-S**\n\n```\n@misc{zhou2024dnaberts,\n      title={DNABERT-S: Learning Species-Aware DNA Embedding with Genome Foundation Models}, \n      author={Zhihan Zhou and Winmin Wu and Harrison Ho and Jiayi Wang and Lizhen Shi and Ramana V Davuluri and Zhong Wang and Han Liu},\n      year={2024},\n      eprint={2402.08777},\n      archivePrefix={arXiv},\n      primaryClass={q-bio.GN}\n}\n```\n\n\n\n**DNABERT-2**\n\n```\n@misc{zhou2023dnabert2,\n      title={DNABERT-2: Efficient Foundation Model and Benchmark For Multi-Species Genome}, \n      author={Zhihan Zhou and Yanrong Ji and Weijian Li and Pratik Dutta and Ramana Davuluri and Han Liu},\n      year={2023},\n      eprint={2306.15006},\n      archivePrefix={arXiv},\n      primaryClass={q-bio.GN}\n}\n```\n\n**DNABERT**\n\n```\n@article{ji2021dnabert,\n    author = {Ji, Yanrong and Zhou, Zhihan and Liu, Han and Davuluri, Ramana V},\n    title = \"{DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome}\",\n    journal = {Bioinformatics},\n    volume = {37},\n    number = {15},\n    pages = {2112-2120},\n    year = {2021},\n    month = {02},\n    issn = {1367-4803},\n    doi = {10.1093/bioinformatics/btab083},\n    url = {https://doi.org/10.1093/bioinformatics/btab083},\n    eprint = {https://academic.oup.com/bioinformatics/article-pdf/37/15/2112/50578892/btab083.pdf},\n}\n```\n\n","funding_links":[],"categories":["Beyond Vision"],"sub_categories":["**Other**"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FMAGICS-LAB%2FDNABERT_S","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FMAGICS-LAB%2FDNABERT_S","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FMAGICS-LAB%2FDNABERT_S/lists"}