https://github.com/kris96tian/rna_app

RNA Secondary Structure Predictor
https://github.com/kris96tian/rna_app

Last synced: 5 months ago
JSON representation

RNA Secondary Structure Predictor

Host: GitHub
URL: https://github.com/kris96tian/rna_app
Owner: kris96tian
Created: 2025-01-19T00:23:01.000Z (9 months ago)
Default Branch: main
Last Pushed: 2025-01-19T01:00:31.000Z (9 months ago)
Last Synced: 2025-01-19T01:21:41.253Z (9 months ago)
Language: Python
Homepage: https://rna-secondary.streamlit.app/
Size: 0 Bytes
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

          # RNA Secondary Structure Predictor

## App

[FLASK (faster) RNA Secondary Structure App](https://kr1571an.pythonanywhere.com/)

[Streamlit RNA Secondary Structure App](https://rna-secondary.streamlit.app/?embed_options=dark_theme)

![image](https://github.com/user-attachments/assets/c512742a-50bc-4a82-9cb0-15f91413680b)

Interactive tool for the prediction and visualization of RNA secondary structures, using dynamic programming. It predicts base pairing and generates a visual representation of the structure based on the input RNA sequence.

The algorithm I implemented in this code is related to the **Nussinov’s algorithm**.

---

### **Objective**

Optimal secondary structure prediction of RNA sequence by **maximizing the number of base pairs** under specific constraints.

---

### **Definitions**

1. **RNA Sequence**:

   - A string of nucleotides: adenine (**A**), uracil (**U**), guanine (**G**), and cytosine (**C**).

   - Valid base-pair interactions:

     - Canonical: A-U, U-A, G-C, C-G

     - Wobble: G-U, U-G

2. **Dot-Bracket Notation**:

   - The predicted structure is represented as a string of `(`, `)`, and `.`:

     - `.`: Unpaired base.

     - `(`, `)`: Paired bases, indicating a bond.

3. **Dynamic Programming Approach**:

   - DP table `dp[i][j]` where:

     - `i` and `j` represent indices of the RNA sequence.

     - `dp[i][j]` stores the maximum number of base pairs in the subsequence from `i` to `j`.

---

### **Steps**

1. **DP Table initialization**:

   - A 2D matrix `dp` of size \(n \times n\) (where \(n\) is the sequence length) is initialized to 0.

   - A separate `traceback` matrix is used to store the decisions for reconstructing the structure later.

2. **DP Table iterative filling**:

   - Start with short subsequences and expand to larger ones.

   - For each pair of indices `(i, j)` (where \(j > i\)):

     - **Case 1: Unpaired**: The nucleotide at position `j` is left unpaired:

       \[

       dp[i][j] = dp[i][j-1]

       \]

     - **Case 2: Paired**: If `sequence[i]` can pair with `sequence[j]`:

       - Add 1 (for the new pair) to the solution of the inner subsequence \((i+1, j-1)\):

         \[

         dp[i][j] = \max(dp[i][j], 1 + dp[i+1][j-1])

         \]

     - **Case 3: Bifurcation**: Split the subsequence into two parts:

       - Combine solutions from two non-overlapping subsequences \((i, k)\) and \((k+1, j)\):

         \[

         dp[i][j] = \max(dp[i][j], dp[i][k] + dp[k+1][j]) \quad \text{for } k \in (i + \text{MIN\_LOOP}, j - \text{MIN\_LOOP})

         \]

   **Constraints**:

   - Min-loop-size: Ensures there are at least `MIN_LOOP` unpaired bases between paired bases \(i\) and \(j\).

   - Valid base pairing: Ensures following of biological rules.

3. **Traceback**:

   - For reconstructing the optimal solution.

   - Starting from the full sequence `(0, n-1)`, decisions stored in `traceback[i][j]` determine:

     - Whether `i` and `j` are paired.

     - If the subsequence bifurcates into two smaller problems.

4. **Output**:

     - **Dot-bracket notation** for the secondary structure.

     - A list of base pairs (i.e., indices of paired bases).

---

### **Mathematical Recurrence**

The algorithm can be described using the recurrence relation:

**dp[i][j]** = **max** { 

- **dp[i][j-1]** **(Unpaired)**

- **dp[i+1][j-1] + 1** **(Paired, if can_pair(sequence[i], sequence[j]))**

- **max_{k=i+MIN_LOOP}^j-MIN_LOOP(dp[i][k] + dp[k+1][j])** **(Bifurcation)**

}

---

### **Time Complexity**

- **Table Filling**:

  - The outer loop iterates over subseq. lengths \(L\) from `MIN_LOOP + 2` to \(n\), and inner loops iterate over \(i\) and \(j\).

  - For each cell `(i, j)`, there is an additional loop over possible bifurcation points \(k\).

  - **Complexity**: \(O(n^3)\).

- **Traceback**:

  - Linear in the number of base pairs.

  - **Complexity**: \(O(n)\).

Overall time complexity is \(O(n^3)\).

---

### **Biological Significance**

Simplified model for RNA secondary structure prediction:

- It **maximizes base pairing** without considering thermodynamic stability or pseudoknots.

- Extensions like Zuker’s algorithm can incorporate free-energy minimization for more accurate predictions.

---

### **Example**

For the sequence `AUGCGUA`:

1. `dp[i][j]` will compute the maximum number of pairs.

2. Traceback reconstructs the structure:

   - Dot-bracket: `.(())..`

   - Base pairs: [(1, 6), (2, 5)]

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/kris96tian/rna_app

Awesome Lists containing this project

README