https://github.com/kris96tian/rna_app
RNA Secondary Structure Predictor
https://github.com/kris96tian/rna_app
Last synced: 5 months ago
JSON representation
RNA Secondary Structure Predictor
- Host: GitHub
- URL: https://github.com/kris96tian/rna_app
- Owner: kris96tian
- Created: 2025-01-19T00:23:01.000Z (9 months ago)
- Default Branch: main
- Last Pushed: 2025-01-19T01:00:31.000Z (9 months ago)
- Last Synced: 2025-01-19T01:21:41.253Z (9 months ago)
- Language: Python
- Homepage: https://rna-secondary.streamlit.app/
- Size: 0 Bytes
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# RNA Secondary Structure Predictor
## App
[FLASK (faster) RNA Secondary Structure App](https://kr1571an.pythonanywhere.com/)[Streamlit RNA Secondary Structure App](https://rna-secondary.streamlit.app/?embed_options=dark_theme)

Interactive tool for the prediction and visualization of RNA secondary structures, using dynamic programming. It predicts base pairing and generates a visual representation of the structure based on the input RNA sequence.
The algorithm I implemented in this code is related to the **Nussinov’s algorithm**.---
### **Objective**
Optimal secondary structure prediction of RNA sequence by **maximizing the number of base pairs** under specific constraints.---
### **Definitions**
1. **RNA Sequence**:
- A string of nucleotides: adenine (**A**), uracil (**U**), guanine (**G**), and cytosine (**C**).
- Valid base-pair interactions:
- Canonical: A-U, U-A, G-C, C-G
- Wobble: G-U, U-G2. **Dot-Bracket Notation**:
- The predicted structure is represented as a string of `(`, `)`, and `.`:
- `.`: Unpaired base.
- `(`, `)`: Paired bases, indicating a bond.3. **Dynamic Programming Approach**:
- DP table `dp[i][j]` where:
- `i` and `j` represent indices of the RNA sequence.
- `dp[i][j]` stores the maximum number of base pairs in the subsequence from `i` to `j`.---
### **Steps**
1. **DP Table initialization**:
- A 2D matrix `dp` of size \(n \times n\) (where \(n\) is the sequence length) is initialized to 0.
- A separate `traceback` matrix is used to store the decisions for reconstructing the structure later.2. **DP Table iterative filling**:
- Start with short subsequences and expand to larger ones.
- For each pair of indices `(i, j)` (where \(j > i\)):
- **Case 1: Unpaired**: The nucleotide at position `j` is left unpaired:
\[
dp[i][j] = dp[i][j-1]
\]
- **Case 2: Paired**: If `sequence[i]` can pair with `sequence[j]`:
- Add 1 (for the new pair) to the solution of the inner subsequence \((i+1, j-1)\):
\[
dp[i][j] = \max(dp[i][j], 1 + dp[i+1][j-1])
\]
- **Case 3: Bifurcation**: Split the subsequence into two parts:
- Combine solutions from two non-overlapping subsequences \((i, k)\) and \((k+1, j)\):
\[
dp[i][j] = \max(dp[i][j], dp[i][k] + dp[k+1][j]) \quad \text{for } k \in (i + \text{MIN\_LOOP}, j - \text{MIN\_LOOP})
\]**Constraints**:
- Min-loop-size: Ensures there are at least `MIN_LOOP` unpaired bases between paired bases \(i\) and \(j\).
- Valid base pairing: Ensures following of biological rules.3. **Traceback**:
- For reconstructing the optimal solution.
- Starting from the full sequence `(0, n-1)`, decisions stored in `traceback[i][j]` determine:
- Whether `i` and `j` are paired.
- If the subsequence bifurcates into two smaller problems.4. **Output**:
- **Dot-bracket notation** for the secondary structure.
- A list of base pairs (i.e., indices of paired bases).---
### **Mathematical Recurrence**
The algorithm can be described using the recurrence relation:
**dp[i][j]** = **max** {
- **dp[i][j-1]** **(Unpaired)**
- **dp[i+1][j-1] + 1** **(Paired, if can_pair(sequence[i], sequence[j]))**
- **maxk=i+MIN_LOOPj-MIN_LOOP(dp[i][k] + dp[k+1][j])** **(Bifurcation)**
}---
### **Time Complexity**
- **Table Filling**:
- The outer loop iterates over subseq. lengths \(L\) from `MIN_LOOP + 2` to \(n\), and inner loops iterate over \(i\) and \(j\).
- For each cell `(i, j)`, there is an additional loop over possible bifurcation points \(k\).
- **Complexity**: \(O(n^3)\).- **Traceback**:
- Linear in the number of base pairs.
- **Complexity**: \(O(n)\).Overall time complexity is \(O(n^3)\).
---
### **Biological Significance**
Simplified model for RNA secondary structure prediction:
- It **maximizes base pairing** without considering thermodynamic stability or pseudoknots.
- Extensions like Zuker’s algorithm can incorporate free-energy minimization for more accurate predictions.---
### **Example**
For the sequence `AUGCGUA`:
1. `dp[i][j]` will compute the maximum number of pairs.
2. Traceback reconstructs the structure:
- Dot-bracket: `.(())..`
- Base pairs: [(1, 6), (2, 5)]