https://github.com/carlobaldassi/gaussdca.jl
Multivariate Gaussian Direct Coupling Analysis for residue contact prediction in protein families - Julia module
https://github.com/carlobaldassi/gaussdca.jl
bioinformatics direct-coupling-analysis julia predicted-contacts protein-contact-prediction
Last synced: about 1 year ago
JSON representation
Multivariate Gaussian Direct Coupling Analysis for residue contact prediction in protein families - Julia module
- Host: GitHub
- URL: https://github.com/carlobaldassi/gaussdca.jl
- Owner: carlobaldassi
- License: other
- Created: 2013-12-06T00:06:02.000Z (over 12 years ago)
- Default Branch: master
- Last Pushed: 2022-01-27T10:25:03.000Z (over 4 years ago)
- Last Synced: 2025-03-27T08:45:20.610Z (about 1 year ago)
- Topics: bioinformatics, direct-coupling-analysis, julia, predicted-contacts, protein-contact-prediction
- Language: Julia
- Size: 659 KB
- Stars: 22
- Watchers: 2
- Forks: 12
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: COPYING
Awesome Lists containing this project
README
Gaussian Direct Coupling Analysis for protein contacts predicion
================================================================
[![CI][CI-img]][CI-url] [![CODECOV][codecov-img]][codecov-url]
Overview
--------
This is the code which accompanies the paper ["Fast and accurate multivariate
Gaussian modeling of protein families: Predicting residue contacts and
protein-interaction partners"][paper]
by Carlo Baldassi, Marco Zamparo, Christoph Feinauer, Andrea Procaccini,
Riccardo Zecchina, Martin Weigt and Andrea Pagnani, (2014)
PLoS ONE 9(3): e92721. doi:10.1371/journal.pone.0092721
See also [this Wikipedia article][wikiDCA] for a general overview of the Direct
Coupling Analysis technique.
This code is released under the GPL version 3 (or later) license; see the
`LICENSE.md` file for details.
The code is written in [Julia][julia] and requires julia version
1.5 or later; it provides a function which reads
a multiple sequence alignment (in FASTA format) and returns a ranking of all
pairs of residue positions in the aligned amino-acid sequences.
Since version 2, most of the internal functions used to parse and manipulate
the data have been factored out into the package [DCAUtils.jl][DCAUtils].
The code in this module is essentially a wrapper around those utilities.
[paper]: http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0092721
[julia]: https://www.julialang.org
[wikiDCA]: https://en.wikipedia.org/wiki/Direct_coupling_analysis
[CI-img]: https://github.com/carlobaldassi/GaussDCA.jl/actions/workflows/ci.yml/badge.svg
[CI-url]: https://github.com/carlobaldassi/GaussDCA.jl/actions/workflows/ci.yml
[codecov-img]: https://codecov.io/gh/carlobaldassi/GaussDCA.jl/branch/master/graph/badge.svg
[codecov-url]: https://codecov.io/gh/carlobaldassi/GaussDCA.jl
[DCAUtils]: https://github.com/carlobaldassi/DCAUtils.jl
Installation
------------
To install the package, enter in Pkg mode by pressing the ] key,
then in the pkg prompt enter
```
(@v1.5) pkg> add "https://github.com/carlobaldassi/GaussDCA.jl"
```
Usage
-----
To load the code, just type `using GaussDCA`.
This software provides one main function, `gDCA(filname::String, ...)`. This
function takes the name of a (possibly gzipped) FASTA file, and returns a
predicted contact ranking, in the form of a Vector of triples, each triple
containing two indices `i` and `j` (with `i` < `j`) and a score. The indices
start counting from 1, and denote pair of residue positions in the given
alignment; pairs which are separated by less than a given number of residues
(by default 5) are filtered out. The triples are sorted by score in descending
order, such that predicted contacts should come up on top.
For convenience, a utility function is also provided, `printrank(output, R)`,
which prints the result of `gDCA` either in a file or to a stream, given as
first argument. If the first argument `output` is omitted, the standard
terminal output will be used.
The `gDCA` function takes some additional, optional keyword arguments:
* `pseudocount`: the value of the pseudo-count parameter, between `0` and `1`.
the default is `0.8`, which gives good results when the
Frobenius norm score is used (see below); a good value for the
Direct Information score is `0.2`.
* `θ`: the value of the similarity threshold. By default it is `:auto`,
which means it will be automatically computed (this takes additional
time); otherwise, a real value between `0` and `1` can be given.
* `max_gap_fraction`: maximum fraction of gap symbols in a sequence; sequences
that exceed this threshold are discarded. The default
value is `0.9`.
* `score`: the scoring function to use. There are two possibilities, `:DI` for
the Direct Information, and `:frob` for the Frobenius norm. The
default is `:frob`. (Note the leading colon: this argument is passed
as a symbol).
* `min_separation`: the minimum separation between residues in the output
ranking. Must be ≥ `1`. The default
is `5`.
The code is multi-threaded: if you start julia with the `-t` option, for example
as `julia -t 8`, the computations will run in parallel on the given number of
threads.
Examples
--------
Here is a basic usage example, assuming an alignment in FASTA format is found
in the file "alignment.fasta.gz":
```
julia> using GaussDCA
julia> FNR = gDCA("alignment.fasta.gz");
julia> printrank("results_FN.txt", FNR)
```
The above uses the Frobenius norm ranking with default parameters.
This is how to get the Direct Information ranking instead:
```
julia> DIR = gDCA("alignment.fasta.gz", pseudocount = 0.2, score = :DI);
julia> printrank("results_DI.txt", DIR)
```