https://github.com/wtsi-hgi/bdist
Submit a distributed job to your LSF compute cluster
https://github.com/wtsi-hgi/bdist
Last synced: about 2 months ago
JSON representation
Submit a distributed job to your LSF compute cluster
- Host: GitHub
- URL: https://github.com/wtsi-hgi/bdist
- Owner: wtsi-hgi
- License: gpl-3.0
- Created: 2016-01-27T11:14:55.000Z (over 9 years ago)
- Default Branch: master
- Last Pushed: 2016-02-17T14:29:00.000Z (about 9 years ago)
- Last Synced: 2025-01-26T18:48:44.664Z (3 months ago)
- Language: Shell
- Size: 41 KB
- Stars: 0
- Watchers: 3
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# bdist
Submit a distributed job to your LSF compute cluster.
## Usage
bdist [-c] COMMAND [OPTIONS] (--fofn FILE | FILE ...)
bdist (-V | --version) Display the version number
bdist (-h | --help) Display the helpDistribute the homogeneous processing of a number of files across a
compute cluster.### `COMMAND`
This is the command that will be run against each file to be processed,
with any optional arguments. By default, the filename to be processed
will be appended to this; however, if it needs to occur elsewhere in the
command, it may be inserted using `{}` any number of times (in which
case, the filename *won't* be appended automatically).For example:
bdist "gzip -9 < {} > {}.gz" *
Note that commands with arguments *must* be quoted. The `-c` flag itself
is optional, but if it is omitted, then the command *must* be the first
positional argument.### Files
The files to be processed by the above command may be supplied
individually at the command line, or by using a FOFN (file of
filenames, `EOL` delimited) with the `--fofn` option.For example:
bdist foo /path/to/some/files* /another/file /{foo,bar}/*.quux
bdist foo --fofn /path/to/my.fofn
find /some/path -type f -name "*.baz" | xargs bdist foo
find . -type f -exec bdist foo +
Note that, if using `find`(1) with `-exec`, then the `+` terminator
should be used. If the `;` terminator were used, it would still work,
but would spawn multiple individual jobs, rather than making use of an
LSF job array.A FOFN should be used if there are a lot of files to process, to avoid
blowing the `$ARG_MAX` of your shell. Also note that the number of files
to process, either directly or through a FOFN, *must not* exceed the LSF
`MAX_JOB_ARRAY_SIZE` parameter, set in `lsb.params` (default 1000).### Lazy Mode
By default, the values for the command (or `-c` flag) and `--fofn` flag
are checked: commands are checked to see if they're valid commands or
shell builtins; while each file in the FOFN will be checked to see if it
actually is a file. The purpose of this is to fail early, before
submitting potentially erroneous jobs to your cluster.However, a command on your cluster's workers may not be available on the
dispatch node and a large FOFN will take time to check through, at the
expense of IO. These checks can therefore be bypassed by using their
uppercase counterparts: `-C` and `--FOFN`, respectively.### Options
You may optionally pass through some `bsub`(1) options:
-G USER_GROUP *
-J JOB_NAME
-M MEM_LIMIT
-R RESOURCE_REQ
-n MIN_CPUS[,MAX_CPUS]
-q QUEUE_NAMEThese follow the same usage pattern as `bsub`. If a `JOB_NAME` is not
specified, one will be automatically generated. The options marked with
an asterisk, above, are passed to both the job array *and* the cleanup
job.## Logging
While jobs are running, their working directory will contain the output
of each individual process, amongst other things, in
`~/.bdist/work/JOB_NAME`. Once the entire job array has finished, a
clean up job will concatenate the logs into `~/.bdist/logs/JOB_NAME.log`
and remove the working directory.## Example
The original use case for `bdist` was to `gzip`(1) a bunch of large
files without tying up the head nodes for half a day. We can now do this
simply with:bdist "gzip -9f" *.dat
`gzip`'s `-f` flag is used to avoid jobs exiting as failed if the `.gz`
file already exists.However, we can do better than this by using, for example,
[`pigz`(1)](http://zlib.net/pigz/) for multicore compression:bdist "pigz -9 -f -p 8" -n 8 -R "span[hosts=1]" *.dat
### Advanced Example
**NOTE This is currently untested! Quoting may break things...**
When using a FOFN, each line is fed to the command, presumably as a file
to process. However, in lazy mode, there's no reason for a FOFN to be a
file of filenames; it could, for example, be a file of database keys, or
JSON object member names, etc. Then the command takes each of these as
its input argument. For example:bdist "process_item <(jq \".{}\" data.json)" --FOFN json.keys
(n.b., In the JSON case, a wrapper script is available for convenience;
see below for details.)It is also possible to just use the job index, by either creating a FOFN
of sequential numbers, or -- if both index *and* key are needed -- by
referring to the LSF environment variable `$LSB_JOBINDEX`:bdist process_number --FOFN <(seq 10)
bdist "do_something --id=\$LSB_JOBINDEX -x {}" --FOFN some.keys
Note that dollar signs must be escaped, to avoid your execution shell
from expanding them; either that, or use single quotes. Better yet, it
would be wise to consolidate more complex commands into a simple shell
script, which takes the FOFN element as its argument and will have
access to all [LSF job environment variables](https://www-01.ibm.com/support/knowledgecenter/SSETD4_9.1.3/lsf_config_ref/lsf_envars_job_exec.html).## Using JSON
**NOTE This is currently untested! Quoting may break things...**
If you have a command that needs to distribute and iterate over the
top-level elements of a JSON array or object, the above "advanced usage"
paradigm has been wrapped into a convenience script, using
[`jq`(1)](https://stedolan.github.io/jq/) to process the JSON data:bdist-json JSONFILE [OPTIONS] COMMAND
Where `JSONFILE` is the JSON file you wish to iterate over, which must
be of the form of either a single array or object, and `COMMAND` is to
be executed on a compute node distributing each element from the JSON
collection. The `OPTIONS` are the same as those for `bdist`, above, and
are simply passed through.## License
Copyright (c) 2016 Genome Research Ltd.
This program is free software: you can redistribute it and/or modify it
under the terms of the GNU General Public License as published by the
Free Software Foundation, either version 3 of the License, or (at your
option) any later version.This program is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General
Public License for more details.You should have received a copy of the GNU General Public License along
with this program. If not, see .LSF(R) is a registered trademark of IBM Corporation.