https://github.com/soedinglab/bammmotif2
Bayesian Markov Model motif discovery tool version 2 - An expectation maximization algorithm for the de novo discovery of enriched motifs as modelled by higher-order Markov models.
https://github.com/soedinglab/bammmotif2
bioinformatics chip-seq motif-analysis motif-discovery ngs-analysis
Last synced: 6 months ago
JSON representation
Bayesian Markov Model motif discovery tool version 2 - An expectation maximization algorithm for the de novo discovery of enriched motifs as modelled by higher-order Markov models.
- Host: GitHub
- URL: https://github.com/soedinglab/bammmotif2
- Owner: soedinglab
- License: gpl-3.0
- Created: 2016-08-09T13:37:52.000Z (about 9 years ago)
- Default Branch: master
- Last Pushed: 2021-02-01T12:03:16.000Z (over 4 years ago)
- Last Synced: 2025-04-05T00:51:17.761Z (7 months ago)
- Topics: bioinformatics, chip-seq, motif-analysis, motif-discovery, ngs-analysis
- Language: C++
- Homepage: https://bammmotif.mpibpc.mpg.de/
- Size: 9.23 MB
- Stars: 13
- Watchers: 9
- Forks: 5
- Open Issues: 4
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# BaMM!motif - v2
**Ba**yesian **M**arkov **M**odel **motif** discovery software (version 2).
(C) Johannes Soeding, Wanwan Ge, Anja Kiesel, Matthias Siebert
[](https://travis-ci.org/soedinglab/BaMMmotif2)
## Requirements
To compile from source, you need:* [GCC](https://gcc.gnu.org/) compiler 4.7 or later (we suggest GCC-5.x)
* [CMake](http://cmake.org/) 2.8.11 or later
C++ packages
* [Boost](http://www.boost.org/)To plot BaMM logos you need R and several R packages
* [R](https://cran.r-project.org/) 2.14.1 or later
* install.packages( "zoo" )
* install.packages( "argparse" )
* install.packages( "fdrtool" )
* install.packages( "LSD" )
* install.packages( "grid" )
* install.packages( "gdata" )## Installation
### Clone it from GIT
git clone https://github.com/soedinglab/BaMMmotif2.git BaMMmotif
cd BaMMmotif### How to compile BaMM!motif?
#### Linux
mkdir build
cd build
cmake -DCMAKE_INSTALL_PREFIX=${HOME}/opt/BaMM ..
make
make install
Adjust `${HOME}/opt/BaMM` if you want to change the directory for installation#### OS X
OS X ships clang instead of gcc. We recommend using [Homebrew](http://brew.sh/) to install gcc.Having installed Homebrew, all required dependencies can be installed using the `brew` command
brew tap homebrew/versions
brew tap homebrew/science
brew install gcc5 cmake R#### Compilation
export CXX=g++-5
export CC=gcc-5
export LDFLAGS="-static-libgcc -static-libstdc++"mkdir build
cd build
cmake -DCMAKE_INSTALL_PREFIX=${HOME}/opt/BaMM ..
make
make install
#### Environment setup
Add this line to your $HOME/.bashrc (or .zshrc...) to add BaMMmotif to your PATH:export PATH=${PATH}:${HOME}/opt/BaMM/bin
Update your environment:source $HOME/.bashrc
## How to use BaMM!motif from the command line?
### SYNOPSIS
BaMMmotif DIRPATH FILEPATH [OPTIONS]
### DESCRIPTION
Bayesian Markov Model motif discovery software.
DIRPATH
Output directory for the results.FILEPATH
FASTA file with positive sequences of equal length.### OPTIONS
Sequence options
--alphabet
STANDARD. For alphabet type ACGT, default setting;
METHYLC. For alphabet type ACGTM;
HYDROXYMETHYLC. For alphabet type ACGTH;
EXTENDED. For alphabet type ACGTMH.
--ss
Search motif only on single strand strands (positive sequences).
This option is not recommended for analyzing ChIP-seq data.
By default, BaMM searches motifs on both strands.
--negSeqSet
FASTA file with negative/background sequences used to learn the
(homogeneous) background BaMM. If not specified, the background BaMM
is learned from the positive sequences.Options to initialize BaMM(s) from file
--bindingSiteFile
File with binding sites of equal length (one per line).
--PWMFile
File that contains position weight matrices (PWMs).
--BaMMFile
File that contains a model in bamm file format.--maxPWM
Number of models to be learned by BaMM!motif, specific for PWMs.Options for the (inhomogeneous) motif BaMMs
-k|--order
Model order. The default is 2.-a|--alpha [...]
Order-specific prior strength. The default is 1.0 (for k = 0) and
beta x gamma^k (for k > 0). The options -b and -g are ignored.-b|--beta
Calculate order-specific alphas according to beta x gamma^k (for
k > 0). The default is 7.0.-g|--gamma
Calculate order-specific alphas according to beta x gamma^k (for
k > 0). The default is 3.0.--extend {1,2}
Extend BaMMs by adding uniformly initialized positions to the left
and/or right of initial BaMMs. Invoking e.g. with --extend 0 2 adds
two positions to the right of initial BaMMs. Invoking with --extend 2
adds two positions to both sides of initial BaMMs. By default, BaMMs
are not being extended.
-q
Prior probability for a positive sequence to contain a motif. The
default is 0.9.
-s, --sOrder
The order of k-mer for sampling pseudo/negative set. The default is 2.Options for the (homogeneous) background BaMM
-K
Order. The default is 2.-A|--Alpha
Prior strength. The default is 10.0.
--bgModelFile
Read in background model from a bamm-formatted file.EM options
--EM
Triggers Expectation Maximization (EM) algorithm.
Gibbs sampling options--CGS
Triggers Collapsed Gibbs Sampling (CGS) algorithm.
--maxCGSIterations
Limit the number of CGS iterations.
It should be larger than 5 and defaults to 100.Options for model evaluation
--FDR
Triggers False-Discovery-Rate (FDR) estimation.
-m|--mFold
Number of negative sequences as multiple of positive sequences.
The default is 10.
-n, --cvFold
Fold number for cross-validation.
The default is 5, which means the training set is 4-fold of the test set.
Output options--saveBaMMs
Write optimized BaMM(s) to disk.--saveInitBaMMs
Write initialized BaMM(s) to disk.
--verbose
Verbose terminal printouts.-h, --help
Printout this help.## Downstream analysis
### Evaluate the performance of BaMMs
For evaluating the optimized BaMM models, a file with extension `.stats` is required. It can be generated either by running `BaMMmotif` with `--FDR` flag, or by running `FDR` program independently.
Either
${HOME}/opt/BaMM/bin/BaMMmotif [OUTPUT_FIR] [FASTAFILE] [MOTIF_FILE] [options] --FDR
or
${HOME}/opt/BaMM/bin/FDR [OUTPUT_FIR] [FASTAFILE] [MOTIF_FILE]
R script `evaluateBaMM.R` is provided in the installation directory `${HOME}/opt/BaMM/bin` to calculate the performance score AUSFC and optionally plot precision-recall curve, partial ROC, and sensitivity-FDR curve. You can run it like:
${HOME}/opt/BaMM/bin/evaluateBaMM.R [INPUT_DIR] [PREFIX_OF_STATS_FILE] [options]
The options are:`--SFC 1` for plotting the sensitivity-false discovery rate curve.
`--ROC5 1` for plotting the partial ROC with the first 5% of TPR.
`--PRC 1` for plotting the precision-recall curve.
You will get the following plots:



The performance scores such as AUSFC, pAUC amd AUPRC are written in the `.bmscore` file.
### How to plot BaMM logos?R script `platBaMMLogo.R` is provided in the installation directory `${HOME}/opt/BaMM/bin` to plot the BaMM logo from a BaMM flat file.
It requires output files with extension `.ihbcp`, `.ihbp`, `.hbcp` or `.hbp` from BaMMmotif as input.
The logo order is an integer between 0 to 2.
plotBaMMLogo.R [INPUT_DIR] [PREFIX_OF_OCCURRENCE_FILE] [LOGO_ORDER]
You will get the following plots:



### Motif distribution analysis
For visualizing the distribution of motifs in the sequence set, you need to generate either a `.occurrence` file by executing `BaMMmotif` with a `--scoreSeqset` flag or by executing `BaMMScan`.
Either
${HOME}/opt/BaMM/bin/BaMMmotif [OUTPUT_FIR] [FASTAFILE] [MOTIF_FILE] [options] --scoreSeqset
or
${HOME}/opt/BaMM/bin/BaMMScan [OUTPUT_FIR] [FASTAFILE] [MOTIF_FILE]
After obtaining a `.occurrence` file, you can run R script `plotMotifDistribution.R` provided in the installation directory `${HOME}/opt/BaMM/bin` to visualise the motif distribution:${HOME}/opt/BaMM/bin/plotMotifDistribution.R [INPUT_DIR] [PREFIX_OF_OCCURRENCE_FILE] [option]
The option is:
`--ss 1` for only plotting the distribution of motif on single strand. Otherwise, it will visualize motif distribution on both strands.
You will get one of the following plots:


Note that, this analysis currently only work for sequences set with sequences of the same length.
## BaMM flat file format
BaMM!motif generates two files for each inhomogeneous BaMM:
1. file with extension `.ihbp` contains probabilities of BaMM model;
2. file with extension `.ihbcp` contains conditional probabilities of BaMM model.
The format is the same for these two files. While blank lines separate BaMM positions, lines 1 to *k*+1 of each BaMM position contain the (conditional) probabilities for order 0 to order *k*. For instance, the format for a BaMM of order 2 and length *W* is as follows:
Filename extension: `.ihbp`
P1(A) P1(C) P1(G) P1(T)
P1(AA) P1(AC) P1(AG) P1(AT) P1(CA) P1(CC) P1(CG) ... P1(TT)
P1(AAA) P1(AAC) P1(AAG) P1(AAT) P1(ACA) P1(ACC) P1(ACG) ... P1(TTT)P2(A) P2(C) P2(G) P2(T)
P2(AA) P2(AC) P2(AG) P2(AT) P2(CA) P2(CC) P2CG) ... P2(TT)
P2(AAA) P2(AAC) P2(AAG) P2(AAT) P2(ACA) P2(ACC) P2(ACG) ... P2(TTT)
...PW(A) PW(C) PW(G) PW(T)
PW(AA) PW(AC) PW(AG) PW(AT) PW(CA) PW(CC) PWCG) ... PW(TT)
PW(AAA) PW(AAC) PW(AAG) PW(AAT) PW(ACA) PW(ACC) PW(ACG) ... PW(TTT)Filename extension: `.ihbcp`
P1(A) P1(C) P1(G) P1(T)
P1(A|A) P1(C|A) P1(G|A) P1(T|A) P1(A|C) P1(C|C) P1(G|C) ... P1(T|T)
P1(A|AA) P1(C|AA) P1(G|AA) P1(T|AA) P1(A|AC) P1(C|AC) P1(G|AC) ... P1(T|TT)P2(A) P2(C) P2(G) P2(T)
P2(A|A) P2(C|A) P2(G|A) P2(T|A) P2(A|C) P2(C|C) P2(G|C) ... P2(T|T)
P2(A|AA) P2(C|AA) P2(G|AA) P2(T|AA) P2(A|AC) P2(C|AC) P2(G|AC) ... P2(T|TT)
...PW(A) PW(C) PW(G) PW(T)
PW(A|A) PW(C|A) PW(G|A) PW(T|A) PW(A|C) PW(C|C) PW(G|C) ... PW(T|T)
PW(A|AA) PW(C|AA) PW(G|AA) PW(T|AA) PW(A|AC) PW(C|AC) PW(G|AC) ... PW(T|TT)In addition, BaMM!motif generates two files for the homogeneous background BaMM:
1. file with extension `.ihbp` contains probabilities of background model;2. file with extension `.ihbcp` contains conditional probabilities of background model.
For instance, the format for a background BaMM of order 2 is as follows:
Filename extension: `.hbp`
P(A) P(C) P(G) P(T)
P(AA) P(AC) P(AG) P(AT) P(CA) P(CC) P(CG) ... P(TT)
P(AAA) P(AAC) P(AAG) P(AAT) P(ACA) P(ACC) P(ACG) ... P(TTT)Filename extension: `.hbcp`
P(A) P(C) P(G) P(T)
P(A|A) P(C|A) P(G|A) P(T|A) P(A|C) P(C|C) P(G|C) ... P(T|T)
P(A|AA) P(C|AA) P(G|AA) P(T|AA) P(A|AC) P(C|AC) P(G|AC) ... P(T|TT)## License
BaMM!motif is released under the GNU General Public License v3 or later. See LICENSE for more details.
## Notes
We are welcoming bug reports! Please contact us at soeding@mpibpc.mpg.de .
For the seeding phase, we recommend to use our de novo motif discovery tool [PEnG-motif](https://github.com/soedinglab/PEnG-motif).