Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.
Awesome Lists | Featured Topics | Projects
https://github.com/mailund/ziphmm

Automatically exported from code.google.com/p/ziphmm
https://github.com/mailund/ziphmm
Last synced: 13 days ago
JSON representation
Automatically exported from code.google.com/p/ziphmm
Host: GitHub
URL: https://github.com/mailund/ziphmm
Owner: mailund
License: gpl-3.0
Created: 2015-03-19T19:17:32.000Z (almost 10 years ago)
Default Branch: master
Last Pushed: 2021-05-20T05:11:07.000Z (over 3 years ago)
Last Synced: 2023-10-20T21:29:22.348Z (about 1 year ago)
Language: C++
Size: 2.92 MB
Stars: 7
Watchers: 3
Forks: 10
Open Issues: 2
Metadata Files:
- Readme: README.md
- License: COPYING
Awesome Lists containing this project

README

        
* Build procedure

* Getting started

* Using the C++ library

* Using the Python library

* Encoding HMMs

	* HMM example

	* C++ example

	* Python example

* Encoding sequences

* Executables

	* calibrate

	* build_forwarder

	* forward

	* generate_hmm

	* generate_seq

* Contact

# Build procedure

To build and install the library, unzip the directory and execute the

following commands in a terminal:

```bash

$ cd /zipHMM-1.0.1/

pip install -r python_requirements.txt

zipHMM-1.0.1 $ cmake .

zipHMM-1.0.1 $ make

zipHMM-1.0.1 $ bin/calibrate

zipHMM-1.0.1 $ make test

zipHMM-1.0.1 $ make install

```

To build in OS X, the Accellerate framework is required (see

https://developer.apple.com/performance/accelerateframework.html). This

is included in the developer tools installed with XCode (see

https://developer.apple.com/xcode/)

To build in Linux CMake must be able to find an installation of a BLAS

implementation. For now the CMake script is set up to use Atlas and to

look for it at /com/extra/ATLAS/3.9.84. This will most likely not work

on your machine. You may therefore have to change line 11 in

zipHMM/CmakeLists.txt:

```bash

  set(ATLAS_ROOT "/com/extra/ATLAS/3.9.84")

```

If you are using a different implementation of BLAS than Atlas you

will have to do a few extra simple changes in zipHMM/CMakelists.txt -

look at line 12, 13, 32, 56, 76, 91, 106, 121, 159 and 192.

bin/calibrate finds the optimal number of threads to use in the

parallelized algorithm and saves the number in a file (default is

~/.ziphmm.devices).

# Ubuntu Procedure

*(if the above does not work for you and you're on Linux)*

1. Check if Python is installed:

```

which python

```

should return the location of the first Python insallation in your path environment variable.

If not, then run:

```

sudo apt-get install python3 python3-dev

```

or, alternatively, if you want to use Python 2 for something else as well, do:

```

sudo apt-get install python python-dev

```

2. `cd` to or open a terminal in the directory in which you'd like to store ZipHMM source code (this repository).  

Do `pip install -r python_requirements.txt`.

3.  Clone the repository using git:

```

git clone https://github.com/mailund/ziphmm.git

```

Then `cd` into the directory `ziphmm`.

4. First make sure CMake is installed, do: `apt install cmake`.  Then run:

```

cmake .

```

That should return success for detecting your Python install, but not necessarily an R install.  If you'd like to use R with this library, then you'll have to install R first before building ZipHMM.

5. From within that same directory try running `make`.  If it fails in regard to pthread calls not being found, then run:

```

grep -rl pthread ./ | xargs sed -i 's/lpthread/pthread/g'

```

which simply replacess every occurence of the compiler flag `-lpthread` with `-pthread` in the ZipHMM makefiles.

6. Now try running `make` again.  This is the main build step.

7. If everything in step 6 built okay, then run these commands:

```

bin/calibrate

make test

```

Make sure all tests have been passed.

8.  Finally, install the library with `make install`.  Take note of the installation directories (`lib`, `include` etc) or run: `make install > important_install_directories.txt` to save them for later reference.

# Getting started

Have a look at zipHMM/cpp_example.cpp and zipHMM/python_example.cpp

and try running the following commands from the root directory.

```bash

$ zipHMM-0.0.1 $ bin/cpp_example

	

$ zipHMM-0.0.1 $ cd zipHMM/

$ zipHMM $ python python_example.py

```

# Using the C++ library

The main class in the library is Forwarder (forwarder.hpp and

forwarder.cpp). Objects of this class represents an observed sequence

that have been preprocessed such that the likelihood of the sequence

can be obtained from a given HMM (pi, A, B) very fast. To build a new

forwarder object just call the empty constructor:

```c

Forwarder();

```

and to read in a sequence call one of the two read_seq methods:

```c

void Forwarder::read_seq(const std::string &seq_filename, const size_t alphabet_size, 

                         std::vector nStatesSave, const size_t min_no_eval = 1);

void Forwarder::read_seq(const std::string &seq_filename, const size_t alphabet_size, 

     			 const size_t no_states, const size_t min_no_eval);

void Forwarder::read_seq(const std::string &seq_filename, const size_t alphabet_size, 

                         const size_t min_no_eval = 1)

```

Here seq_filename is the filename of the file containing the observed

sequence, alphabet_size is the size of the alphabet used in the

observed sequence, nStatesSave is a vector indicating the sizes of the

HMMs the user intends to run the Forwarder object on, and min_no_eval

is a guess of the number of times, the preprocessing will be reused

(if unsure about this then leave it out and use the default value). If

nStatesSave contains the vector (2, 4, 8), datastructures obtaining

the fastest evaluation of the forward algorithm for each of the HMM

state space sizes 2, 4, and 8 will be build. If nStatesSave is left

empty a single datastructure obtaining a very fast evaluation of the

forward algorithm for all HMM state space sizes will be saved.

The second constructor serves as a convenient way to call the first

constructor with only one HMM size in nStatesSave.

The third constructor serves as a convenient way to call the first 

constructor with an empty nStates2save vector.

After building an Forwarder object, it can be saved to disk using the method

```c

void write_to_directory(const std::string &directory) const;

```

Here directory should contain the path (relative to the root of the

library) of the intended location of the datastructure.

To read a previously saved datastructure, one of the following two methods 

can be used:

```c

void Forwarder::read_from_directory(const std::string &dirname);

void Forwarder::read_from_directory(const std::string &directory, const size_t no_states);

```

Using the first one, the entire datastructure is being rebuilt. Using

the second one only the datastructure matching no_states is being

rebuild. This will be faster in many cases. If you did not save the

datastructure for the size of your HMM, then use the first

constructor. The forward algorithm will figure out which of the saved

data structures is most optimal for your HMM.

Finally, to get the loglikelihood of the observed sequence in a

specific model, one of the following methods are used:

```c

double Forwarder::forward(const Matrix &pi, const Matrix &A, const Matrix &B) const;

double Forwarder::pthread_forward(const Matrix &pi, const Matrix &A, const Matrix &B, 

                                  const std::string &device_filename = DEFAULT_DEVICE_FILENAME) const;

```

The second method is a parallelized version of the forward algorithm,

whereas the first one is single-threaded. pi, A and B specifies the

HMM parameters. They can either be read from a file or build in C++ as

described below in the section 'Encoding HMMs'. The parallelized version

takes an additional filename as parameter. This filename should be the

path to the file created by the calibrate program, which finds the

optimal number of threads to use in the parallelized forward

algorithm. The default filename is ~/.ziphmm.devices. If you did not

move the file, then leave the device_filename parameter out.

See zipHMM/cpp_example.cpp for a simple example.

# Using the Python library

To use the Python library in another project, copy zipHMM/pyZipHMM.py

and zipHMM/libpyZipHMM.so to the root of your project folder after

building the library and import pyZipHMM. See zipHMM/python_example.py

and zipHMM/python_test.py for details on how to use the library.

A Forwarder object can be constructed from an observed sequence in the

following ways:

```python

from pyZipHMM import *

f = Forwarder.fromSequence(seqFilename = "example.seq", alphabetSize = 3, minNoEvals = 500)

```

To save the datastructure to disk do as follows:

```python

f.writeToDirectory("example_preprocessed")

```

To read a previously saved datastructure from disk use either of the

two methods:

```python

f2 = Forwarder.fromDirectory(directory = "../example_out")

f2 = Forwarder.fromDirectory(directory = "../example_out", nStates = 3)

```

Finally, to evaluate the loglikelihood of the sequence in a given model

(matrices pi, A and B) use either of

```python

loglikelihood = f.forward(pi, A, B)

loglikelihood = f.ptforward(pi, A, B)

```

where the second method is parallelized. The three matrices pi, A and B

can be read from a file or build in Python as described below.

See zipHMM/python_example.py for an example.

# Encoding HMMs

An HMM consists of three matrices: 

   - pi, containing initial state probabilities: pi_i is the

     probability of the model starting in state i.

   - A, containing transition probabilities: A_{ij} is the

   probability of the transition from state i to state j.

   - B, containing emission probabilities: B_{io} is the probability

     of state i emitting symbol o.

These three matrices can either be build in the code (in C++ or

Python) or they can be encoded in a text file. The format of the text

file is as follows:

# HMM example 

---

```

no_states

3

alphabet_size

4

pi

0.1

0.2

0.7

A

0.1 0.2 0.7

0.3 0.4 0.3

0.5 0.5 0.0

B

0.1 0.2 0.3 0.4

0.2 0.3 0.4 0.1

0.3 0.4 0.1 0.2

```

---

To read and write HMMs from and to files in C++, use the methods

```c

void read_HMM(Matrix &resultInitProbs, Matrix &resultTransProbs, Matrix &resultEmProbs, const std::string &filename);

void write_HMM(const Matrix &initProbs, const Matrix &transProbs, const Matrix &emProbs, const std::string &filename);

```

To read and write HMMs from and to files in Python, use the functions

```python

readHMM(filename) -> (pi, A, B)

writeHMM(pi, A, B, filename) -> None

```

To build a matrix in C++ do as illustrated in the following example:

## C++ example

```c

#include "matrix.hpp"

size_t nRows = 3;

size_t nCols = 4;

zipHMM::Matrix m(3,4);

m(0,0) = 0.1;

m(0,1) = 0.2;

...

m(2,3) = 0.2;

```

To build a matrix in Python do as illustrated here:

## Python example

```python

import pyZipHMM

nRows = 3

nCols = 4

m = pyZipHMM.Matrix(nRows, nCols)

m[0,0] = 0.1

m[0,1] = 0.2

...

m[2,3] = 0.2

```

# Encoding sequences

The alphabet of observables are encoded using integers. Thus if the

size of the alphabet is M, the observables are encoded using 0, 1, 2,

..., M - 1. A sequence of observables is encoded in a text file with

the single observations are seperated by whitespace. See example.seq

for an example.

# Executables

## calibrate

Usage: `bin/calibrate`

Finds the optimal number of threads to use in the parallelized version

of the forward algorithm.

## build_forwarder

Usage: `bin/build_forwarder -s  -M  -o  [-N ]*`

Builds a Forwarder object from the sequence in the file specified in

 and writes it to the directory specified in

.  should be the size of the alphabet

used in the observed sequence, and the file specified in  should contain a single line containing white space

separated integers between 0 and  - 1. The list of HMM

sizes to generate the data structure for can be specified using the -N parameter.

Examples:

```bash

bin/build_forwarder -s example.seq -M 3 -o example_out

bin/build_forwarder -s example.seq -M 3 -o example_out -N 2

bin/build_forwarder -s example.seq -M 3 -o example_out -N 2 -N 4 -N 8 -N 16

```

## forward

Usage: `bin/forward (-s  -m  [-e #expected forward calls] [-o ] ) | (-d  -m ) [-p]`

Runs the forward algorithm and outputs the loglikelhood. This

executable can be called in two different ways:

```bash

bin/forward -s example.seq -m example.hmm -e 500 -o example_out

bin/forward -d example_out/ -m example.hmm

```

In the first example the loglikelihood is evaluated based on the

observed sequence in example.seq and the HMM specified in

example.hmm. In the second example the loglikelihood is evaluated

based on the previously saved data structure in example_out/ and the

HMM specified in example.hmm. In both cases the -p parameter can be

used to use the parallelized version. In the first example the user

can optionally choose to save the data structure in eg. example_out/

using the -o parameter:

```bash

bin/forward -s example.seq -m example.hmm -e 500 -o example_out/

```

## generate_hmm

Usage: `bin/generate_hmm   `

Generates a random HMM with  states and  observables, and saves it to .

## generate_seq

Usage: `bin/generate_seq    `

Given an HMM specified in , runs the HMM for 

iterations and saves the resulting sequence of observables to

 and the resulting sequence of

hidden states to .

# Contact

If you encounter any problems or have questions about using this

software, please contact 

	  Thomas Mailund : [email protected].