https://github.com/lanl/pydna_epbd

pyDNA-EPBD: A Python-based Implementation of the Extended Peyrard-Bishop-Dauxois Model for DNA Breathing Dynamics Simulation
https://github.com/lanl/pydna_epbd
dna dna-breathing flipping-characteristics genomics mcmc metropolis-hastings
Last synced: 6 months ago
JSON representation
pyDNA-EPBD: A Python-based Implementation of the Extended Peyrard-Bishop-Dauxois Model for DNA Breathing Dynamics Simulation
Host: GitHub
URL: https://github.com/lanl/pydna_epbd
Owner: lanl
License: bsd-3-clause
Created: 2023-07-14T18:20:24.000Z (about 2 years ago)
Default Branch: main
Last Pushed: 2024-11-24T03:06:15.000Z (11 months ago)
Last Synced: 2024-11-24T04:17:58.695Z (11 months ago)
Topics: dna, dna-breathing, flipping-characteristics, genomics, mcmc, metropolis-hastings
Language: Jupyter Notebook
Homepage: https://lanl.github.io/pyDNA_EPBD/
Size: 43.7 MB
Stars: 4
Watchers: 4
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.rst
- License: LICENSE
Awesome Lists containing this project

README

          .. pyDNA-EPBD documentation master file, created by

   sphinx-quickstart on Mon Jul 31 12:21:40 2023.

   You can adapt this file completely to your liking, but it should at least

   contain the root `toctree` directive.

Welcome to pyDNA-EPBD's documentation!

======================================

This repository corresponds to the article titled as **pyDNA-EPBD: A Python-based Implementation of the Extended Peyrard-Bishop-Dauxois Model for DNA Breathing Dynamics Simulation**.

.. image:: https://zenodo.org/badge/DOI/10.5281/zenodo.8222805.svg

   :target: https://doi.org/10.5281/zenodo.8222805

.. figure:: plots/mcmc_algorithm.png

    :width: 50%

    :align: center

    

    Figure 1: Overview of the pyDNA-EPBD implementation.

The dynamic behavior of DNA sequences, including local transient openings or *breathing* and *flipping*, is crucial in a wide range of biological processes and genomic disorders. However, accurate modeling and simulation of these phenomena, particularly for homogeneous and periodic DNA sequences, have remained a challenge due to the complex interplay of factors such as hydrogen bonding, electrostatic interactions, and base stacking.

To address this, we have developed **pyDNA-EPBD**, a Python-based software tool that employs an extended version of the Peyrard–Bishop–Dauxois (EPBD) model. This extension integrates a sequence-dependent stacking term, enabling a more precise description of the DNA melting behavior for homogenous and periodic sequences. Through the use of a Monte Carlo Markov Chain (MCMC) approach, pyDNA-EPBD simulates DNA dynamics and generates data on DNA breathing characteristics such as bubble coordinates and flipping.

Resources

========================================

* `Paper `_

* `Code `_

* `Documentation `_

* `Analysis Notebooks `_

* `Utility of ML models `_

* `Example Runs `_ 

Installation

========================================

.. code-block:: shell

      

      git clone https://github.com/lanl/pyDNA_EPBD.git

      cd pyDNA_EPBD

      conda create -c conda-forge --name pydnaepbd_pypy_conda pypy -y

      conda activate pydnaepbd_pypy_conda

      python setup.py install

      # Run your first pyDNA-EPBD simulation. 

      # This will generate P5 wild and mutant sequence breathing dynamics in the "outputs" directory.

      python -m pydna_epbd.run --config_filepath examples/p5/configs.txt

      # The other libraries to analyze the DNA breathing dynamics can be installed using the following command:

      conda install -c conda-forge scikit-learn scipy pandas matplotlib seaborn jupyterlab -y

      # To deactivate and remove the venv

      conda deactivate

      conda remove --name pydnaepbd_pypy_conda --all -y

Prerequisites

========================================

To run the simulation:

   * argparse>=1.4.0

   * joblib>=1.3.0

   * numpy>=1.25.1

To analyze the DNA breathing dynamics (BD):

   * scikit-learn>=1.3.0

   * scipy>=1.11.1

   * pandas>=2.0.3

   * matplotlib>=3.7.2

   * seaborn>=0.12.2

Configuration file structure

========================================================

The simulation requires a configuration filepath. The structure of a configuration file is follows:

.. list-table::

   :widths: 20 10 70

   :header-rows: 1

   * - Keys

     - Options

     - Comments

   * - IsFirstColumnId

     - Yes/No

     - Whether or not the first column in the sequence file indicates sequence id.

   * - SaveFull

     - Yes/No

     - Whether or not save full simulation outputs. `No` is space efficient.

   * - SaveRuntime

     - Yes/No

     - Whether or not save runtime for each DNA sequence.

   * - SequencesDir

     - examples/p5/p5_seqs/

     - Directory that contains sequence file(s).

   * - OutputsDir

     - outputs/

     - Directory where pyDNA-EPBD saves outputs.

   * - Flanks

     - None

     - The flanks ('GC' like sequence) will be prepend and append with all input DNA sequences. 'None' will not add any.

   * - Temperature

     - 310

     - The simulation temperature in Kelvin scale.

   * - PreheatingSteps 

     - 50000

     - The number of preheating steps.

   * - PostPreheatingSteps

     - 80000

     - The number of post-preheating steps. Usually, the monitors record observations during the post-preheating steps.

   * - ComputingNodes

     - 1

     - Number of computing nodes available to run the simulation. This parameter is only be used while running the simulation with SLURM script.

   * - BubbleMonitor

     - On/Off

     - Whether or not record DNA bubble information.

   * - CoordinateMonitor

     - On/Off

     - Whether or not record coordinate information.

   * - FlippingMonitorVerbose

     - On/Off

     - Whether or not record flipping information for five different thresholds.

   * - FlippingMonitor

     - On/Off

     - Whether or not record flipping information for one threshold.

   * - EnergyMonitor

     - On/Off

     - Whether or not record energy information.

   * - MeltingAndFractionMonitor

     - On/Off

     - Whether or not record melting and fraction information for one threshold.

   * - MeltingAndFractionManyMonitor

     - On/Off

     - Whether or not record melting and fraction information for 20 thresholds at evenly separated 100 time steps.

Example Configurations and P5 DNA sequences

==============================================

The `example simulation run `_ uses the following configuration file (`examples/p5/configs.txt `_):

.. code-block:: console

      IsFirstColumnId = Yes

      SaveFull = No

      SaveRuntime = No

      SequencesDir = examples/p5/p5_seqs/

      OutputsDir = outputs/

      Flanks = None

      Temperature = 310

      Iterations = 100

      PreheatingSteps = 50000

      PostPreheatingSteps = 80000

      ComputingNodes = 1

      BubbleMonitor = On

      CoordinateMonitor = On

      FlippingMonitorVerbose = On

      FlippingMonitor = Off

      EnergyMonitor = Off

      MeltingAndFractionMonitor = Off

      MeltingAndFractionManyMonitor = Off

The input P5 DNA sequences (`examples/p5/p5_seqs/p5_wt_mt.txt `_) are:

.. code-block:: console

      P5_wt GCGCGTGGCCATTTAGGGTATATATGGCCGAGTGAGCGAGCAGGATCTCCATTTTGACCGCGAAATTTGAACGGCGC

      P5_mt GCGCGTGGCCATTTAGGGTATATATGGCCGAGTGAGCGAGCAGGATCTCCGCTTTGACCGCGAAATTTGAACGGCGC

Example Usage

========================================

`Here `_ we provide the full documentation of the modules and packages. 

However, this section describes three easy-to-go options to run the MCMC simulation directly on DNA sequences.

**Option 1 - Using python script:**

This uses single computing node.

.. code-block:: console

      python -m pydna_epbd.run --config_filepath examples/p5/configs.txt

**Option 2 - Using multiple computing nodes (SLURM):**

To avail multiple nodes, we suggest to define *--array* variable in a SLURM script:

.. code-block:: console

      #SBATCH --array=0-5 # i.e If six nodes are avilable

Then, *ComputingNodes* variable in the confiuration file should be the total number of nodes to use. For the above case: 

.. code-block:: console

      ComputingNodes = 6

Now all the input DNA sequences will be divided into Six chunks to run independently in six computational nodes.

Example SLURM script is given `here `_ for P5.

**Option 3 - Defining own python script:**

A user can define own python script and run the simulation. An example python script is given below:

.. code-block:: python

      

      import os

      import math

      from pydna_epbd.input_reader import read_configurations

      from pydna_epbd.simulation.simulation_steps import run_sequences

      if __name__ == "__main__":

          """This runs the simulation."""

          job_idx = 0

          # array job

          if "SLURM_ARRAY_TASK_ID" in os.environ:

              job_idx = int(os.environ["SLURM_ARRAY_TASK_ID"])

          input_configs = read_configurations("examples/p5/configs.txt")

          # dividing the input sequences to the nodes based on job-idx

          chunk_size = math.ceil(len(input_configs.sequences) / input_configs.n_nodes)

          sequence_chunks = [

              input_configs.sequences[x : x + chunk_size]

              for x in range(0, len(input_configs.sequences), chunk_size)

          ]

          sequences = sequence_chunks[job_idx]

          print(f"job_idx:{job_idx}, n_seqs:{len(sequences)}")

          run_sequences(sequences, input_configs)

The above options will generate outputs in the *outputs* directory. The average coordinate and flipping profiles are plotted below.

.. |a| image:: plots/p5_wtmt_avg_coord.png

.. |b| image:: plots/p5_wtmt_avg_flip_1.414213562373096.png

.. list-table:: 

   :widths: 50 50

   :header-rows: 1

   * - Figure 2: Average coordinates.

     - Figure 3: Average flipping.

   * - |a|

     - |b|

To run simulation on other dataset

=======================================

.. code-block:: console

    

    unzip data/pydna_epbd_data.zip -d data

    python pydna_epbd.run --config_filepath examples/86_seqs/configs.txt

    python pydna_epbd.run --config_filepath examples/gcpbm/configs.txt

    python pydna_epbd.run --config_filepath examples/p5/configs.txt

    python pydna_epbd.run --config_filepath examples/qfactor/configs.txt

    python pydna_epbd.run --config_filepath examples/selex/configs.txt

    

Results

=======================

Here we also provide the other results for quick reference.

.. figure:: plots/Bubbles.png

   :width: 60%

   :align: center

   Figure 4: Overview of Bubble Tensor for P5 wild type and mutant type for different thresholds.

.. |P5_flips| image:: plots/P5_flips.png

   :width: 45%

.. |P5_qfactors| image:: plots/P5_qfactors.png

   :width: 45%

|P5_flips| |P5_qfactors|

Figure 5: P5 Q-factor analysis.

.. figure:: plots/svr_rbf_perf_comparison_selex.png

   :width: 55%

   :align: center

   

   Figure 6: Utility of breating characeristics on TF binding specificity for selex data.

.. figure:: plots/88seqs_seqlen_vs_runtime.png

   :width: 45%     

   :align: center

   

   Figure 7: Scalability analysis.

Acknowledgments

========================================

Los Alamos National Lab (LANL), T-1

Copyright Notice

========================================

© (or copyright) 2023. Triad National Security, LLC. All rights reserved. This program was produced under U.S. Government contract 89233218CNA000001 for Los Alamos National Laboratory (LANL), which is operated by Triad National Security, LLC for the U.S. Department of Energy/National Nuclear Security Administration. All rights in the program are reserved by Triad National Security, LLC, and the U.S. Department of Energy/National Nuclear Security Administration. The Government is granted for itself and others acting on its behalf a nonexclusive, paid-up, irrevocable worldwide license in this material to reproduce, prepare derivative works, distribute copies to the public, perform publicly and display publicly, and to permit others to do so.

License

========================================

This program is open source under the BSD-3 License.

Redistribution and use in source and binary forms, with or without modification, are permitted

provided that the following conditions are met:

1. Redistributions of source code must retain the above copyright notice, this list of conditions and

the following disclaimer.

2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions

and the following disclaimer in the documentation and/or other materials provided with the

distribution.

3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse

or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS

IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE

IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR

PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR

CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,

EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,

PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;

OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY,

WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR

OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF

ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Authors

========================================

- `Anowarul Kabir `_: Computer Sciece, George Mason University

- `Manish Bhattarai `_: Theoretical Division, Los Alamos National Laboratory

- `Kim Rasmussen `_: Theoretical Division, Los Alamos National Laboratory

- `Amarda Shehu `_: Computer Sciece, George Mason University

- `Anny Usheva `_: Surgery, Rhode Island Hospital and Brown University

- `Alan Bishop `_: Theoretical Division, Los Alamos National Laboratory

- `Boian S. Alexandrov `_: Theoretical Division, Los Alamos National Laboratory

How to Cite pyDNA-EPBD?

========================================

.. code-block:: console

      @software{pyDNA_EPBD,

      author       = {Kabir, Anowarul and 

                        Bhattarai, Manish and

                        Rasmussen, Kim and 

                        Shehu, Amarda and 

                        Usheva, Anny and 

                        Bishop, Alan and 

                        Alexandrov, Boian},

      title        = {pyDNA-EPBD: A Python-based Implementation of the Extended Peyrard-Bishop-Dauxois Model for DNA Breathing Dynamics Simulation},

      month        = Aug,

      year         = 2023,

      publisher    = {Zenodo},

      version      = {v1.0.0},

      doi          = {10.5281/zenodo.8222805},

      url          = {https://doi.org/10.5281/zenodo.8222805}

      }

How to Cite gcPBM and HT-SELEX dataset?

========================================

.. code-block:: console

   @article{htselex-and-gcpbm-data,

     title = {Expanding the repertoire of DNA shape features for genome-scale studies of transcription factor binding},

     volume = {45},

     ISSN = {1362-4962},

     url = {http://dx.doi.org/10.1093/nar/gkx1145},

     DOI = {10.1093/nar/gkx1145},

     number = {22},

     journal = {Nucleic Acids Research},

     publisher = {Oxford University Press (OUP)},

     author = {Li,  Jinsen and Sagendorf,  Jared M. and Chiu,  Tsu-Pei and Pasi,  Marco and Perez,  Alberto and Rohs,  Remo},

     year = {2017},

     month = nov,

     pages = {12877–12887}

   }
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/lanl/pydna_epbd

Awesome Lists containing this project

README