{"id":22668270,"url":"https://github.com/lanl/epbd-bert","last_synced_at":"2025-04-12T11:04:04.325Z","repository":{"id":238581022,"uuid":"795642956","full_name":"lanl/EPBD-BERT","owner":"lanl","description":"Transcription factor binding site prediction for novel DNA sequence data aiding in mutation identification and drug discovery","archived":false,"fork":false,"pushed_at":"2024-08-12T18:30:24.000Z","size":4376,"stargazers_count":2,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2024-08-12T21:39:38.740Z","etag":null,"topics":["cross-attention","dnabert-model","epbd","multi-modal","transformers-bert"],"latest_commit_sha":null,"homepage":"https://lanl.github.io/EPBD-BERT/","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/lanl.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-05-03T18:02:57.000Z","updated_at":"2024-08-12T18:30:27.000Z","dependencies_parsed_at":"2024-05-06T20:47:15.610Z","dependency_job_id":"ab8d408a-9de1-44dd-8a79-bf9853e28559","html_url":"https://github.com/lanl/EPBD-BERT","commit_stats":null,"previous_names":["lanl/epbd-bert"],"tags_count":2,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lanl%2FEPBD-BERT","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lanl%2FEPBD-BERT/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lanl%2FEPBD-BERT/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lanl%2FEPBD-BERT/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/lanl","download_url":"https://codeload.github.com/lanl/EPBD-BERT/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":228911888,"owners_count":17990774,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cross-attention","dnabert-model","epbd","multi-modal","transformers-bert"],"created_at":"2024-12-09T15:14:31.073Z","updated_at":"2024-12-09T15:14:31.704Z","avatar_url":"https://github.com/lanl.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Welcome to EPBD-BERT's Documentation!\n\nThis repository corresponds to the article titled **\"Advancing Transcription Factor Binding Site Prediction Using DNA Breathing Dynamics and Sequence Transformers via Cross Attention\"**.\n\n[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.11130474.svg)](https://doi.org/10.5281/zenodo.11130474)\n(https://www.biorxiv.org/content/10.1101/2024.01.16.575935v2)\n\n![EPBDxBERT Framework](plots/EPBD_Arch.jpg)\n*Figure 1: Overview of the proposed EPBDxBERT framework*\n\nUnderstanding the impact of genomic variants on transcription factor binding and gene regulation remains a key area of\nresearch, with implications for unraveling the complex mechanisms underlying various functional effects. This software\nframework delves into the role of DNA's biophysical properties, including thermodynamic stability, shape, and flexibility in\ntranscription factor (TF) binding. In this library, we have developed a multi-modal deep learning model integrating these\nproperties with DNA sequence data. Trained on ChIP-Seq (chromatin immunoprecipitation sequencing) data in-vivo\ninvolving 690 TF-DNA binding events in human genome, our model significantly improves prediction performance in over\n660 binding events, with up to 9.6% increase in AUROC metric compared to the baseline model when using no DNA\nbiophysical properties explicitly. Further, we expanded our analysis to in-vitro high-throughput Systematic Evolution\nof Ligands by Exponential enrichment (SELEX) dataset, comparing our model with\nestablished frameworks. The inclusion of EPBD features consistently improved TF binding predictions across different cell\nlines in these datasets. Notably, for complex ChIP-Seq datasets, integrating DNABERT2 with a cross-attention mechanism\nprovided greater predictive capabilities and insights into the mechanisms of disease-related non-coding variants found in\ngenome-wide association studies. This work highlights the importance of DNA biophysical characteristics in TF binding\nand the effectiveness of multi-modal deep learning models in gene regulation studies\n\n## Resources\n\n- [Paper](https://www.biorxiv.org/content/10.1101/2024.01.16.575935v2.abstract)\n- [Code](https://github.com/lanl/EPBD-BERT)\n- [Documentation](https://lanl.github.io/EPBD-BERT/)\n- [Analysis Notebooks](https://github.com/lanl/EPBD-BERT/tree/main/analysis)\n\n## Installation\n```bash\n# Installation of virtural environment\ngit clone https://github.com/lanl/EPBD-BERT.git\ncd EPBD-BERT\nconda create -c conda-forge -p .venvs/epbd_bert_condavenv_test1 python=3.11 -y\nconda activate .venvs/epbd_bert_condavenv_test1\npython setup.py install\n\nconda install -c conda-forge scikit-learn scipy -y\npip uninstall triton # We did not utilize triton for underlying hardware dependency\n\n# To deactivate and remove the venv\nconda deactivate\nconda remove --name epbd_bert_condavenv_test1 --all -y\nconda remove -p .venvs/epbd_bert_condavenv_test1 --all -y\n\n```\n\u003c!--- \nconda create -c conda-forge -p .venvs/epbd_bert_condavenv_test1 python=3.11 -y\nconda activate .venvs/epbd_bert_condavenv_test1\n\nor\n\nconda create -c conda-forge --name epbd_bert_condavenv_test1 python=3.11 -y\nconda activate epbd_bert_condavenv_test1\n\n# The other libraries to analyze the DNA breathing dynamics can be installed using the following command:\nconda install -c conda-forge scikit-learn scipy pandas matplotlib seaborn jupyterlab -y\n---\u003e\n\n\n## Data Preprocessing Steps\nThe 'data_preprocessing' directory holds all the data generation steps and divided into modules for data generation and bug tracking. We utilized '[bedtools](https://bedtools.readthedocs.io/en/latest/)' software for genome operation. Follow the [bedtools installation guide](https://bedtools.readthedocs.io/en/latest/content/installation.html). We also provide a bare minimum script that downloads the pre-compiled binary of the software into the *bedtools* directory:\n\n```bash\nbash setup_bedtools.sh\nexport PATH=$PATH:$(pwd)/bedtools\n```\n\n| Step  | Scripts |\n| :--- | :--- |\n| Download human genome assembly (GRCh37/hg19) and [uniform TFBS](https://genome.ucsc.edu/cgi-bin/hgTrackUi?hgsid=2215774794_SHfvFO0XVRMcn6xaqOTugAa1Faf1\u0026c=chr1\u0026g=wgEncodeAwgTfbsUniform)  | ```0_download_data.py```  |\n| Preprocess TFBS narrowpeak files and human genome | ```1_preprocess_narrowPeaks_and_humanGenome.sh``` |\n| Overlapping computation for label association | ```2.1_compute_overlappings_job.sh```\u003cbr /\u003e ```2.2_compute_overlappings.sh```\u003cbr /\u003e ```3_postprocess.sh``` |\n| Label association | ```5.1_extract_bins_containingOtherThanACGT.ipynb```\u003cbr /\u003e ```5.2_compute_peaks_with_labels_clean.sh```|\n| Data preprocessing for DNA breathing dynamics generation and DNABERT2 | ```6.1_create_data_for_pydnaepbd.ipynb```\u003cbr /\u003e  ```6.2_create_data_for_dnabert2.ipynb``` |\n| Train/validation/test split| ```7_create_train_val_test_set.ipynb``` |\n| Associating numeric values for each label | ```8_create_labels_dict.ipynb``` |\n| Further processing on negative regions | ```9.1_generic_neg_regions.sh```\u003cbr /\u003e ```9.2_neg_regions_otherThanACGT.ipynb```\u003cbr /\u003e ```9.3_clean_generic_neg_regions.sh```\u003cbr /\u003e ```9.4_clean_generic_neg_seqs.ipynb``` |\n\n\n## Preprocessed dataset loading\nPreprocessed dataset can be downloaded from here (will be provided).\n\n| Dataset Module  | Usage |\n| :--- | :--- |\n| ```epbd_bert.datasets.sequence_dataset```  | Loads sequence only dataset  |\n| ```epbd_bert.datasets.sequence_epbd_dataset``` | Loads sequence and EPBD (flat) features |\n| ```epbd_bert.datasets.sequence_epbd_multimodal_dataset``` | Loads sequence and EPBD (matrix) features |\n\nNote: There are some other dataset modules. Each module provides example running instructions at the bottom.\n\n\n## Training and testing the developed models\n\n\n| Model Module | Usage |\n| :--- | :--- |\n| DNABERT2-finetuned | |\n| ```epbd_bert.dnabert2_classifier.train_lightning``` | Train DNABERT2 using train/validation split |\n| ```epbd_bert.dnabert2_classifier.test``` | Test finetuned DNABERT2 on test split |\n| VanillaEPBD-DNABERT2-coordflip | |\n| ```epbd_bert.dnabert2_epbd.train_lightning``` | Train VanillaEPBD-DNABERT2 using train/validation split |\n| ```epbd_bert.dnabert2_epbd.test``` | Test VanillaEPBD-DNABERT2 on test split |\n| EPBDxDNABERT-2 | |\n| ```epbd_bert.dnabert2_epbd_crossattn.train_lightning``` | Train EPBDxDNABERT-2 using train/validation split |\n| ```epbd_bert.dnabert2_epbd_crossattn.test``` | Test EPBDxDNABERT-2 on test split  |\n\nNote: Details of each model with other ablation study can be found in the [Paper](https://www.biorxiv.org/content/10.1101/2024.01.16.575935v2.abstract). To run train/test: ```python -m epbd_bert.dnabert2_classifier.test```.\n\n\n## Authors\n\n* [Anowarul Kabir] (mailto:akabir4@gmu.edu)- Computer Sciece, George Mason University\n* [Manish Bhattarai] (mailto:ceodspspectrum@lanl.gov)- Theoretical Division, Los Alamos National Laboratory\n* [Kim Rasmussen] (mailto:kor@lanl.gov)- Theoretical Division, Los Alamos National Laboratory\n* [Amarda Shehu] (mailto:ashehu@gmu.edu)- Computer Sciece, George Mason University\n* [Anny Usheva] (mailto:Anny\\_Usheva@brown.edu\u003e)-Surgery, Rhode Island Hospital and Brown University\n* [Alan Bishop] (mailto:arb@lanl.gov)- Theoretical Division, Los Alamos National Laboratory\n* [Boian S. Alexandrov] (mailto:boian@lanl.gov)- Theoretical Division, Los Alamos National Laboratory\n\n## How to cite EPBD-BERT?\n```latex\n@article{kabir2024advancing,\n  title={Advancing Transcription Factor Binding Site Prediction Using DNA Breathing Dynamics and Sequence Transformers via Cross Attention},\n  author={Kabir, Anowarul and Bhattarai, Manish and Rasmussen, Kim {\\O} and Shehu, Amarda and Bishop, Alan R and Alexandrov, Boian and Usheva, Anny},\n  journal={bioRxiv},\n  pages={2024--01},\n  year={2024},\n  publisher={Cold Spring Harbor Laboratory}\n}\n```\n\n## Acknowledgements\n\nLos Alamos National Lab (LANL), T-1\n\n## Copyright notice\n\n© 2024. Triad National Security, LLC. All rights reserved.\nThis program was produced under U.S. Government contract 89233218CNA000001 for Los Alamos National Laboratory (LANL), which is operated by Triad National Security, LLC for the U.S. Department of Energy/National Nuclear Security Administration. All rights in the program are reserved by Triad National Security, LLC, and the U.S. Department of Energy/National Nuclear Security Administration. The Government is granted for itself and others acting on its behalf a nonexclusive, paid-up, irrevocable worldwide license in this material to reproduce, prepare. derivative works, distribute copies to the public, perform publicly and display publicly, and to permit others to do so.\n\nLANL O#4717\n\n## License\n\nThis program is Open-Source under the BSD-3 License.\n \n * Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:\n \n* Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.\n \n* Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.\n \nNeither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.\n\nTHIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS \"AS IS\" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flanl%2Fepbd-bert","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flanl%2Fepbd-bert","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flanl%2Fepbd-bert/lists"}