Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

https://github.com/kehrlab/bcctools

Correcting barcodes in 10X linked-read sequencing data.
https://github.com/kehrlab/bcctools

Last synced: about 2 months ago
JSON representation

Correcting barcodes in 10X linked-read sequencing data.

Lists

README

        

bcctools
=======

A toolbox for correcting barcodes in 10X linked-read sequencing data.

Prerequisites
-------------

* GCC version >= 4.9 (supports C++14)
* SeqAn core library, version 2.3.1? (https://github.com/seqan/seqan)
* SDSL - Succinct Data Structure Lirbrary (https://github.com/simongog/sdsl-lite)
* kseq.h from HTSlib (https://github.com/samtools/htslib)

Installation
------------

1. Download the Seqan core library. You do not need to follow the SeqAn install instructions. You only need the directory .../include/seqan with all its content (the SeqAn core library).
2. Download and install the SDSL.
3. Download HTSlib or just put the kseq.h header file into a folder named htslib.
3. Edit lines 14-17 in the Makefile to point to the directories of SeqAn, SDSL and HTSlib.
4. Run 'make' in the bcctools directory.

If everything is setup correctly, this will create the binary 'bcctools'.

Usage
-----

The only input needed for barcode correction is a pair of barcoded FASTQ files generated on the 10X Chromium platform.
Optionally, you can specify a barcode whitelist file.

The program consists of several commands, which are listed when running

./bcctools --help

For a short description of each command and an overview of arguments and options, you can run

./bcctools --help

If you need the output to be sorted and/or converted to SAM, BAM, or (gzipped) FASTQ format, you can run the provided bash script. For a short description of options and arguments of this script run

./scripts/run_bcctools -h

### The whitelist command

./bcctools whitelist [OPTIONS]

Creates a barcode whitelist based on barcode occurence in the data.
Creating a whitelist from your data is recommended (rather than using the 10X whitelist) to reduce the number of alternatives during correction and prevents false corrections.

### The index command

./bcctools index [OPTIONS]

Creates a barcode index from the given barcode whitelist and writes it to disk. This command is optional as the index can be created on the fly in the 'correct' command.

### The correct command

./bcctools correct [OPTIONS]

Corrects barcodes of the given barcoded read pair data using the specified barcode whitelist. A barcode index is computed on the fly unless index files are present for the specified barcode whitelist. The output is a tab-separated file holding one read pair per line as decribed below.

### The stats command

./bcctools stats [OPTIONS]
./bcctools stats [OPTIONS]
./bcctools stats [OPTIONS]

Computes the number of read pairs with whitelisted, corrected and unrecognized barcodes, a barcode occurrence histogram and counts quality values of corrected barcode positions.

Example
-------

mkdir bcctools_example && cd bcctools_example/
ln -s /path/to/first.fq.gz
ln -s /path/to/second.fq.gz

./bcctools whitelist -o whitelist.txt first.fq.gz
./bcctools correct whitelist.txt first.fq.gz second.fq.gz > corrected.tsv

Using the bash script to create a BAM file sorted by the corrected barcode sequence:

./script/run_bcctools -f bam first.fq.gz second.fq.gz

Output format
-------------

The output format of the correct command is a simple tab-separated format, where each read pair and its barcode information is given on a single line.
The fields are as follows:

Field | Description
--- | ---
READ NAME | The read or query name taken from the FASTQ file and cropped at the first whitespace.
CORRECTED BARCODE | A comma separated list of possible barcode corrections. If the raw barcode is whitelisted, the value of this field is identical to the RAW BARCODE field. An asterisk '*' indicates that the barcode is not whitelisted and correction was unsuccessful.
RAW BARCODE | The first 16 base pairs of the first read in the read pair.
7-MER SPACER | The seven base pairs following the first 16 base pairs of the first read in the read pair.
TRIMMED FIRST READ | The remaining base pairs of the first read in the read pair after trimming the barcode and 7-mer spacer sequence.
SECOND READ | The second read sequence.
BARCODE QUALITY STRING | The first 16 values of the quality string of the first read in the read pair.
7-MER SPACER QUALITY STRING | The seven values following the first 16 values of the quality string of the first read in the read pair.
TRIMMED FIRST READ QUALITY STRING | The remaining quality string after trimming the barcode and 7-mer spacer quality values.
SECOND READ QUALITY STRING | The quality string of the second read in the read pair.

Contact
-------

For questions and comments contact birte.kehr [at] ukr.de or create an issue.