https://github.com/BioJulia/BioMarkovChains.jl

Representing biological sequences as Markov chains
https://github.com/BioJulia/BioMarkovChains.jl

bioinformatics biology dna julia julialang markov-chain

Last synced: about 2 months ago
JSON representation

Representing biological sequences as Markov chains

Host: GitHub
URL: https://github.com/BioJulia/BioMarkovChains.jl
Owner: BioJulia
License: mit
Created: 2023-07-11T15:11:46.000Z (over 2 years ago)
Default Branch: main
Last Pushed: 2024-12-09T05:52:17.000Z (10 months ago)
Last Synced: 2024-12-11T11:28:40.038Z (10 months ago)
Topics: bioinformatics, biology, dna, julia, julialang, markov-chain
Language: Julia
Homepage: https://biojulia.dev/BioMarkovChains.jl/dev/
Size: 3.57 MB
Stars: 8
Watchers: 1
Forks: 2
Open Issues: 1
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE.md
- Citation: CITATION.cff

Awesome Lists containing this project

README

Representing biological sequences as Markov chains

[![Documentation](https://img.shields.io/badge/documentation-online-blue.svg?logo=Julia&logoColor=white)](https://biojulia.dev/BioMarkovChains.jl/dev/)
[![Latest Release](https://img.shields.io/github/release/BioJulia/BioMarkovChains.jl.svg)](https://github.com/BioJulia/BioMarkovChains.jl/releases/latest)
[![DOI](https://zenodo.org/badge/665161607.svg)](https://zenodo.org/badge/latestdoi/665161607)

[![CI Workflow](https://github.com/BioJulia/BioMarkovChains.jl/actions/workflows/CI.yml/badge.svg)](https://github.com/BioJulia/BioMarkovChains.jl/actions/workflows/CI.yml)
[![License](https://img.shields.io/badge/license-MIT-green.svg)](https://github.com/BioJulia/BioMarkovChains.jl/blob/main/LICENSE)
[![Work in Progress](https://www.repostatus.org/badges/latest/wip.svg)](https://www.repostatus.org/#wip)
[![Downloads](https://img.shields.io/badge/dynamic/json?url=http%3A%2F%2Fjuliapkgstats.com%2Fapi%2Fv1%2Fmonthly_downloads%2FBioMarkovChains&query=total_requests&suffix=%2Fmonth&label=Downloads)](http://juliapkgstats.com/pkg/BioMarkovChains)
[![Aqua QA](https://raw.githubusercontent.com/JuliaTesting/Aqua.jl/master/badge.svg)](https://github.com/JuliaTesting/Aqua.jl)
[![JET](https://img.shields.io/badge/%E2%9C%88%EF%B8%8F%20tested%20with%20-%20JET.jl%20-%20red)](https://github.com/aviatesk/JET.jl)

***

# BioMarkovChains

> A Julia package to represent biological sequences as Markov chains

## Installation

BioMarkovChains is a

Julia Language

package. To install BioMarkovChains,
please open
Julia's interactive session (known as REPL) and press ]
key in the REPL to use the package mode, then type the following command

```julia
pkg> add BioMarkovChains
```

## Creating Markov chain out of DNA sequences

An important step before developing several gene finding algorithms consist of having a Markov chain representation of the DNA. To do so, we implemented the `BioMarkovChain` method that will capture the initials and transition probabilities of a DNA sequence (`LongSequence`) and will create a dedicated object storing relevant information of a DNA Markov chain. Here an example:

Let find one ORF in a random `LongDNA` :

```julia
using BioSequences, BioMarkovChains, GeneFinder

# > 180195.SAMN03785337.LFLS01000089 -> finds only 1 gene in Prodigal (from Pyrodigal tests)
seq = dna"AACCAGGGCAATATCAGTACCGCGGGCAATGCAACCCTGACTGCCGGCGGTAACCTGAACAGCACTGGCAATCTGACTGTGGGCGGTGTTACCAACGGCACTGCTACTACTGGCAACATCGCACTGACCGGTAACAATGCGCTGAGCGGTCCGGTCAATCTGAATGCGTCGAATGGCACGGTGACCTTGAACACGACCGGCAATACCACGCTCGGTAACGTGACGGCACAAGGCAATGTGACGACCAATGTGTCCAACGGCAGTCTGACGGTTACCGGCAATACGACAGGTGCCAACACCAACCTCAGTGCCAGCGGCAACCTGACCGTGGGTAACCAGGGCAATATCAGTACCGCAGGCAATGCAACCCTGACGGCCGGCGACAACCTGACGAGCACTGGCAATCTGACTGTGGGCGGCGTCACCAACGGCACGGCCACCACCGGCAACATCGCGCTGACCGGTAACAATGCACTGGCTGGTCCTGTCAATCTGAACGCGCCGAACGGCACCGTGACCCTGAACACAACCGGCAATACCACGCTGGGTAATGTCACCGCACAAGGCAATGTGACGACTAATGTGTCCAACGGCAGCCTGACAGTCGCTGGCAATACCACAGGTGCCAACACCAACCTGAGTGCCAGCGGCAATCTGACCGTGGGCAACCAGGGCAATATCAGTACCGCGGGCAATGCAACCCTGACTGCCGGCGGTAACCTGAGC"

orfseq = findorfs(seq)[3] |> sequence

21nt DNA Sequence:
ATGCGTCGAATGGCACGGTGA
```

If we translate it, we get a 7aa sequence:

```julia
translate(orfseq)

7aa Amino Acid Sequence:
MRRMAR*
```

Now supposing I do want to see how transitions are occurring in this ORF sequence, the I can use the `BioMarkovChain` method and tune it to 2nd-order Markov chain:

```julia
BioMarkovChain(orfseq, 2)

BioMarkovChain of DNA alphabet and order 1:
- Transition Probability Matrix -> Matrix{Float64}(4 × 4):
0.25 0.25 0.0 0.5
0.25 0.0 0.75 0.0
0.25 0.25 0.25 0.25
0.0 0.25 0.75 0.0
- Initial Probabilities -> Vector{Float64}(4 × 1):
0.2 0.2 0.4 0.2

```

But I can also have a BioMarkovChain instance of the Ammino Acid sequence:

```julia
BioMarkovChain(translate(orfseq), 2)

BioMarkovChain
-
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.5
0.0
0.0
0.0
0.0
0.0
0.0
0.0
-
0.167 0.5 of AminoAcid alphabet and order 1: Transition Probability Matrix -> Matrix{Float64}(20 × 20): 1.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.333 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.5 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 Initial Probabilities -> Vector{Float64}(20 × 1): 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0

```

This is useful to later create HMMs and calculate sequence probability based on a given model, for instance we now have the *E. coli* CDS and No-CDS transition models or Markov chain implemented:

```julia
ECOLICDS

BioMarkovChain of DNA alphabet and order 1:
- Transition Probability Matrix -> Matrix{Float64}(4 × 4):
0.31 0.224 0.199 0.268
0.251 0.215 0.313 0.221
0.236 0.308 0.249 0.207
0.178 0.217 0.338 0.267
- Initial Probabilities -> Vector{Float64}(4 × 1):
0.245 0.243 0.273 0.239
```

What is then the probability of the previous DNA sequence given this model?

```julia
markovprobability(orfseq, model=ECOLICDS, logscale=true)

-39.71754773536592
```

This is off course not very informative, but we can later use different criteria to then classify new ORFs. For a more detailed explanation see the [docs](https://BioJulia.dev/BioMarkovChains.jl/dev/)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/BioJulia/BioMarkovChains.jl

Awesome Lists containing this project

README