An open API service indexing awesome lists of open source software.

https://github.com/ctb/zounds

Worker bee software for using BLAST and HMMER in Beowulf-style environments
https://github.com/ctb/zounds

Last synced: about 1 year ago
JSON representation

Worker bee software for using BLAST and HMMER in Beowulf-style environments

Awesome Lists containing this project

README

          

zounds

/*
:Author: David Goodger
:Contact: goodger@users.sourceforge.net
:Date: $Date: 2005-12-18 01:56:14 +0100 (Sun, 18 Dec 2005) $
:Revision: $Revision: 4224 $
:Copyright: This stylesheet has been placed in the public domain.

Default cascading style sheet for the HTML output of Docutils.

See http://docutils.sf.net/docs/howto/html-stylesheets.html for how to
customize this style sheet.
*/

/* used to remove borders from tables and images */
.borderless, table.borderless td, table.borderless th {
border: 0 }

table.borderless td, table.borderless th {
/* Override padding for "table.docutils td" with "! important".
The right padding separates the table cells. */
padding: 0 0.5em 0 0 ! important }

.first {
/* Override more specific margin styles with "! important". */
margin-top: 0 ! important }

.last, .with-subtitle {
margin-bottom: 0 ! important }

.hidden {
display: none }

a.toc-backref {
text-decoration: none ;
color: black }

blockquote.epigraph {
margin: 2em 5em ; }

dl.docutils dd {
margin-bottom: 0.5em }

/* Uncomment (and remove this text!) to get bold-faced definition list terms
dl.docutils dt {
font-weight: bold }
*/

div.abstract {
margin: 2em 5em }

div.abstract p.topic-title {
font-weight: bold ;
text-align: center }

div.admonition, div.attention, div.caution, div.danger, div.error,
div.hint, div.important, div.note, div.tip, div.warning {
margin: 2em ;
border: medium outset ;
padding: 1em }

div.admonition p.admonition-title, div.hint p.admonition-title,
div.important p.admonition-title, div.note p.admonition-title,
div.tip p.admonition-title {
font-weight: bold ;
font-family: sans-serif }

div.attention p.admonition-title, div.caution p.admonition-title,
div.danger p.admonition-title, div.error p.admonition-title,
div.warning p.admonition-title {
color: red ;
font-weight: bold ;
font-family: sans-serif }

/* Uncomment (and remove this text!) to get reduced vertical space in
compound paragraphs.
div.compound .compound-first, div.compound .compound-middle {
margin-bottom: 0.5em }

div.compound .compound-last, div.compound .compound-middle {
margin-top: 0.5em }
*/

div.dedication {
margin: 2em 5em ;
text-align: center ;
font-style: italic }

div.dedication p.topic-title {
font-weight: bold ;
font-style: normal }

div.figure {
margin-left: 2em ;
margin-right: 2em }

div.footer, div.header {
clear: both;
font-size: smaller }

div.line-block {
display: block ;
margin-top: 1em ;
margin-bottom: 1em }

div.line-block div.line-block {
margin-top: 0 ;
margin-bottom: 0 ;
margin-left: 1.5em }

div.sidebar {
margin-left: 1em ;
border: medium outset ;
padding: 1em ;
background-color: #ffffee ;
width: 40% ;
float: right ;
clear: right }

div.sidebar p.rubric {
font-family: sans-serif ;
font-size: medium }

div.system-messages {
margin: 5em }

div.system-messages h1 {
color: red }

div.system-message {
border: medium outset ;
padding: 1em }

div.system-message p.system-message-title {
color: red ;
font-weight: bold }

div.topic {
margin: 2em }

h1.section-subtitle, h2.section-subtitle, h3.section-subtitle,
h4.section-subtitle, h5.section-subtitle, h6.section-subtitle {
margin-top: 0.4em }

h1.title {
text-align: center }

h2.subtitle {
text-align: center }

hr.docutils {
width: 75% }

img.align-left {
clear: left }

img.align-right {
clear: right }

ol.simple, ul.simple {
margin-bottom: 1em }

ol.arabic {
list-style: decimal }

ol.loweralpha {
list-style: lower-alpha }

ol.upperalpha {
list-style: upper-alpha }

ol.lowerroman {
list-style: lower-roman }

ol.upperroman {
list-style: upper-roman }

p.attribution {
text-align: right ;
margin-left: 50% }

p.caption {
font-style: italic }

p.credits {
font-style: italic ;
font-size: smaller }

p.label {
white-space: nowrap }

p.rubric {
font-weight: bold ;
font-size: larger ;
color: maroon ;
text-align: center }

p.sidebar-title {
font-family: sans-serif ;
font-weight: bold ;
font-size: larger }

p.sidebar-subtitle {
font-family: sans-serif ;
font-weight: bold }

p.topic-title {
font-weight: bold }

pre.address {
margin-bottom: 0 ;
margin-top: 0 ;
font-family: serif ;
font-size: 100% }

pre.literal-block, pre.doctest-block {
margin-left: 2em ;
margin-right: 2em ;
background-color: #eeeeee }

span.classifier {
font-family: sans-serif ;
font-style: oblique }

span.classifier-delimiter {
font-family: sans-serif ;
font-weight: bold }

span.interpreted {
font-family: sans-serif }

span.option {
white-space: nowrap }

span.pre {
white-space: pre }

span.problematic {
color: red }

span.section-subtitle {
/* font-size relative to parent (h1..h6 element) */
font-size: 80% }

table.citation {
border-left: solid 1px gray;
margin-left: 1px }

table.docinfo {
margin: 2em 4em }

table.docutils {
margin-top: 0.5em ;
margin-bottom: 0.5em }

table.footnote {
border-left: solid 1px black;
margin-left: 1px }

table.docutils td, table.docutils th,
table.docinfo td, table.docinfo th {
padding-left: 0.5em ;
padding-right: 0.5em ;
vertical-align: top }

table.docutils th.field-name, table.docinfo th.docinfo-name {
font-weight: bold ;
text-align: left ;
white-space: nowrap ;
padding-left: 0 }

h1 tt.docutils, h2 tt.docutils, h3 tt.docutils,
h4 tt.docutils, h5 tt.docutils, h6 tt.docutils {
font-size: 100% }

tt.docutils {
background-color: #eeeeee }

ul.auto-toc {
list-style-type: none }


zounds


'zounds' is a client-server setup for running many parallel commands
(typically BLAST) on clusters of computers. It uses XML-RPC to
coordinate between the server & clients.


The 'zounds-central' process runs on the server and serves both
configuration information and sequences to clients upon request.


The 'zounds-worker' processes must have access to the command (e.g.
'blastall' for BLAST, the search database(s) in question, and any code
you want to use for post-processing (e.g. the 'blastparser' Python
module). All of the sequences and configuration information is
supplied by the server to the zounds-worker; all of the actual source
code needs to be on the client machine where zounds-worker runs.



How does it work?


When you start 'zounds-worker', it contacts the 'zounds-central'
server and requests config info and a set of sequences. It then runs
whatever command you've specified (e.g. BLAST) on the sequences
individually, with the configured parameters. The results are then
optionally passed through some filter (e.g. parsed by blastparser) and
then pickled and returned to the server via XML-RPC. The
'zounds-central' server saves the returned value as a record in a
'BsdDbShelf', with the sequence name as the key.


Because XML-RPC works via HTTP, and the clients contact the server,
the individual cluster machines need to be able to talk to the server
directly over the network. However, the server never contacts the
clients so the cluster can be hidden behind a firewall, proxy, and/or
NAT.




Installing


You'll need to install Python.


For the moment, you need to get zounds via 'git', at



git@github.com:ctb/zounds.git

This can be done with 'git clone http://github.com/ctb/zounds.git'




Running 'zounds-central'


Briefly,



python zounds-central <config file> <config section>

See 'config.rc' for examples.


For BLAST, the only trickiness is that the 'blastdb' must be a path
accessible to the 'zounds-worker' processes, while the 'sequences' and
'store_db' must be paths on the server. This is because the sequences
are sent from the server to the client, and but the actual comparison
is done on the client. The same holds true for 'hmmscandb' in hmmer3
runs: hmmscandb must be accessible on the client.




Running 'zounds-worker'


The worker process runs on one (or more than one...) node, and
requires no configuration other than a server URL:



python zounds-worker [ <server URL> ]

For example,



python zounds-worker http://localhost:5678/

connects to the server process running on 'localhost', configured to
communicate on port 5678.


'zounds-worker' takes an optional timeout parameter, given by '-t',
which specifies a time (in minutes) at which the worker process will
quit. This is useful for queue systems that penalize processes that
go over their configured time limit. So,



python zounds-worker -t 1

will exit after 1 minute, overriding any other configuration options.




An Example


In one shell, run:



python zounds-central config-dev.rc test

In another shell, run:



python zounds-worker http://localhost:5678/

Once zounds-worker finishes, use CTRL-C to kill the server.


Now run:



python dump-raw-output test-output.db

You should get individual BLAST records for each of your query sequences,
almost as if you'd simply run 'blastall' locally.




Retrieving results


Use



python -i load_db.py <output filename>

You'll now have a dictionary 'db' containing the keys (query sequences)
and values.


If you haven't specified a filter, then the values will be tuples:



(stdout, stderr)

from the BLAST output.


--


Stupid note: that iterating over very large BsdDbShelf databases
is slow to start, because BsdDbShelf retrieves all of the keys at
once. You can speed things up by using the raw bsddb database to
retrieve the keys into the shelf:



_db = btopen(store_db, 'r')
db = BsdDbShelf(_db)

for key in _db:
value = db[key]




Using filters


Filters can be used both for parsing output and actual filtering of
results.


A 'filter' is specified in a config file as 'filter='; for example, in
config-dev.rc, section [test_filter],



filter=filters.parse_blast

wmakes each zounds-worker program import the module 'filters' and run
the function 'parse_blast' on the stdout and stderr of the subprocess
command; the result is then pickled and passed back to the server.


If the filter returns an empty record (None, or (), "", or whatever)
then that too is pickled and returned.



Parsing the BLAST results with filters.parse_blast


For this filter, the blastparser and parse_blast modules must be installed.



filter=filters.parse_blast

After parsing, each record is a blastparser.BlastRecord, and you can
do something like this to get some basic results:



record = db[seq_name]
for hit in record:
print hit.subject_name, hit.total_expect



Retrieving only a subset of BLAST results with filters.top_matches_only


Filters are useful for situations where you only need a small subset
of the information: e.g. rather than pickling and encoding a full
BLAST record, filters can reduce the full blastparser.BlastRecord to
something much smaller.


There's an example filter function in 'filters.py', function
'top_matches_only'.





How well does it scale?


I've BLASTed 200,000 sequences against the 'nr' database using 128
simultaneous workers, without any problems. In theory the disk and
network I/O should be the most time-consuming aspect of the server,
and since everything on the server side is threaded, I don't expect
there to be server-side performance issues.


On the client side there are likely to be a few performance problems:




  1. BLAST is run on each sequence individually, for simplicity's sake.
    This means the BLAST database is reloaded for every BLAST. This
    could be optimized at the expense of a bit more code complexity
    in 'zounds-worker'.

  2. The worker submits the BLAST data to the server directly, without
    starting a thread. This means that if the server or network
    is really busy, the worker may be network-bound. (This should be
    particularly easy to fix.)



None of these problems prevent zounds from working and so I just
ignore 'em. You can fix them if you like. Personally I'd prefer to
keep the worker code as simple as possible, but it should be fairly
easy to hack performance improvements in if you need or want them.




Author Info


zounds was hacked together by C. Titus Brown, <titus@idyll.org>. It is
freely available under the BSD license.




Acknowledgements


Tracy Teal and Qingpeng Zhang alpha- and beta-tested zounds.




Questions?


Please contact the biology-in-python mailing list with
any questions or comments about zounds.


--


TODO:




  • use sqlite shelve instead, for storing results

  • fix/work with screed v2

  • automatically set number of comparisons/seqs in db for BLAST and HMMER

  • undone sequences flush/reset at end



Bigger plans?




  • status Web site for zounds-central



--


CTB: Woods Hole MBL, 7/2008; MSU 3/2010.