An open API service indexing awesome lists of open source software.

https://github.com/karel-brinda/samsift

SAMsift: advanced filtering and tagging of SAM/BAM alignments using Python expressions.
https://github.com/karel-brinda/samsift

alignment ngs

Last synced: 8 months ago
JSON representation

SAMsift: advanced filtering and tagging of SAM/BAM alignments using Python expressions.

Awesome Lists containing this project

README

          

SAMsift
=======

.. image:: https://travis-ci.org/karel-brinda/samsift.svg?branch=master
:target: https://travis-ci.org/karel-brinda/samsift

.. image:: https://img.shields.io/badge/install%20with-bioconda-brightgreen.svg?style=flat-square
:target: https://anaconda.org/bioconda/samsift

.. image:: https://badge.fury.io/py/samsift.svg
:target: https://badge.fury.io/py/samsift

.. image:: https://zenodo.org/badge/DOI/10.5281/zenodo.1048211.svg
:target: https://doi.org/10.5281/zenodo.1048211

SAMsift is a program for advanced filtering and tagging of SAM/BAM alignments
using Python expressions.

Getting started
---------------

.. code-block:: bash

# clone this repo and add it to PATH
git clone http://github.com/karel-brinda/samsift
cd samsift
export PATH=$(pwd)/samsift:$PATH

# filtering: keep only alignments with score >94, save them as filtered.bam
samsift -i tests/test.bam -o filtered.bam -f 'AS>94'
# filtering: keep only unaligned reads
samsift -i tests/test.bam -f 'FLAG & 0x04'
# filtering: keep only aligned reads
samsift -i tests/test.bam -f 'not(FLAG & 0x04)'
# filtering: keep only sequences containing ACCAGAGGAT
samsift -i tests/test.bam -f 'SEQ.find("ACCAGAGGAT")!=-1'
# filtering: keep only sequences containing A and T only (defined using regular expressions)
samsift -i tests/test.bam -f 're.match(r"^[AT]*$", SEQ)'
# filtering: sample alignments with 25% rate
samsift -i tests/test.bam -f 'random.random()<0.25'
# filtering: sample alignments with 25% rate with a fixed RNG seed
samsift -i tests/test.bam -f 'random.random()<0.25' -0 'random.seed(42)'
# filtering: keep only alignments of reads specified in tests/qnames.txt
samsift -i tests/test.bam -0 'q=open("tests/qnames.txt").read().splitlines()' -f 'QNAME in q'
# filtering: keep only first 5000 reads from chr1 and 5000 reads from chr2
samsift -i tests/test.bam -0 'c={"chr1":5000,"chr2":5000}' -f 'c[RNAME]>0' -c 'c[RNAME]-=1' -m nonstop-remove
# tagging: add tags 'ln' with sequence length and 'ab' with average base quality
samsift -i tests/test.bam -c 'ln=len(SEQ);ab=1.0*sum(QUALa)/ln'
# tagging: add a tag 'ii' with the number of the current alignment
samsift -i tests/test.bam -0 'i=0' -c 'i+=1;ii=i'
# updating: removing sequences and base qualities
samsift -i tests/test.bam -c 'a.query_sequence=""'
# updating: switching all reads to unaligned
samsift -i tests/test.bam -c 'a.flag|=0x4;a.reference_start=-1;a.cigarstring="";a.reference_id=-1;a.mapping_quality=0'

Installation
------------

**Using Bioconda:**

.. code-block:: bash

# add all necessary Bioconda channels
conda config --add channels defaults
conda config --add channels conda-forge
conda config --add channels bioconda

# install samsift
conda install samsift

**Using PIP from PyPI:**

.. code-block:: bash

pip install --upgrade samsift

**Using PIP from Github:**

.. code-block:: bash

pip install --upgrade git+https://github.com/karel-brinda/samsift

Command-line parameters
-----------------------

.. USAGE-BEGIN

.. code-block::

Program: samsift (advanced filtering and tagging of SAM/BAM alignments using Python expressions)
Version: 0.2.5
Author: Karel Brinda

Usage: samsift.py [-i FILE] [-o FILE] [-f PY_EXPR] [-c PY_CODE] [-m STR]
[-0 PY_CODE] [-d PY_EXPR] [-t PY_EXPR]

Basic options:
-h, --help show this help message and exit
-v, --version show program's version number and exit
-i FILE input SAM/BAM file [-]
-o FILE output SAM/BAM file [-]
-f PY_EXPR filtering expression [True]
-c PY_CODE code to be executed (e.g., assigning new tags) [None]
-m STR mode: strict (stop on first error)
nonstop-keep (keep alignments causing errors)
nonstop-remove (remove alignments causing errors) [strict]

Advanced options:
-0 PY_CODE initialization [None]
-d PY_EXPR debugging expression to print [None]
-t PY_EXPR debugging trigger [True]

.. USAGE-END

Algorithm
---------

.. code-block:: python

exec(INITIALIZATION)
for ALIGNMENT in ALIGNMENTS:
if eval(DEBUG_TRIGER):
print(eval(DEBUG_EXPR))
if eval(FILTER):
exec(CODE)
print(ALIGNMENT)

**Python expressions and code.** All expressions and code should be valid with
respect to `Python 3 `_. Expressions are evaluated
using the `eval `_
function and code is executed using the `exec
`_ function.
Initialization can be used for importing Python modules, setting global
variables (e.g., counters) or loading data from disk. Some modules (namely
``random`` and ``re``) are loaded without an explicit request.

*Example* (printing all alignments):

.. code-block:: bash

samsift -i tests/test.bam -f 'True'

**SAM fields.** Expressions and code can access variables mirroring the fields
from the alignment section of the `SAM specification
`_, i.e., ``QNAME``, ``FLAG``,
``RNAME``, ``POS`` (1-based), ``MAPQ``, ``CIGAR``, ``RNEXT``, ``PNEXT``,
``TLEN``, ``SEQ``, and ``QUAL``. Several additional variables are defined to
simply accessing some useful information: ``QUALa`` stores the base qualities
as an integer array; ``SEQs``, ``QUALs``, ``QUALsa`` skip soft-clipped bases;
and ``RNAMEi`` and ``RNEXTi`` store the reference ids as integers.

*Example* (keeping only the alignments with leftmost position <= 10000):

.. code-block:: bash

samsift -i tests/test.bam -f 'POS<=10000'

SAMsift internally uses the `PySam `_ library and
the representation of the current alignment (an instance of the class
`pysam.AlignedSegment
`_) is
available as a variable ``a``. Therefore, the previous example is equivalent to

.. code-block:: bash

samsift -i tests/test.bam -f 'a.reference_start+1<=10000'

The ``a`` variable can also be used for modifying the current alignment record.

*Example* (removing the sequence and the bases from every record):

.. code-block:: bash

samsift -i tests/test.bam -c 'a.query_sequence=""'

**SAM tags.** Every SAM tag is translated to a variable with the same name.

*Example* (removing alignments with a score smaller or equal to the sequence length):

.. code-block:: bash

samsift -i tests/test.bam -f 'AS>len(SEQ)'

If ``CODE`` is provided, all two-letter variables except ``re`` (the Python regex
module) are back-translated to tags after the code execution.

*Example* (adding a tag ``ab`` carrying the average base quality):

.. code-block:: bash

samsift -i tests/test.bam -c 'ab=1.0*sum(QUALa)/len(QUALa)'

**Errors.** If an error occurs during an evalution of an expression or an
execution of a code (e.g., due to accessing an undefined tag), then SAMsift
behavior depends on the specified mode (``-m``). With the strict mode (``-m
strict``, default), SAMsift will immediately interrupt the computation and
report an error. With the ``-m nonstop-keep`` option, SAMsift will continue
processing the alignments while keeping the error-causing alignments in the
output. With the ``-m nonstop-remove`` option, all error-causing alignments
are skipped and ommited from the output.

Similar programs
----------------

* `samtools view `_ can filter alignments based on FLAGS, read group tags, and CIGAR strings.
* `sambamba view `_ supports, in addition to SAMtools, a filtration using `simple Perl-like expressions `_. However, it is not possible to use floats or compare different tags.
* `BamQL `_ provides a simple query language for filtering SAM/BAM files.
* `bamPals `_ adds tags XB, XE, XP and XL.
* `SamJavascript `_ can filter alignments using JavaScript expressions.
* `Picard FilterSamReads `_ can also filter alignments using JavaScript expressions.

Issues
------

Please use `Github issues `_.

Changelog
---------

See `Releases `_.

Licence
-------

`MIT `_

Author
------

`Karel Brinda `_