Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

https://github.com/arshajii/ema

Fast & accurate alignment of barcoded short-reads
https://github.com/arshajii/ema

bioinformatics sequence-alignment

Last synced: about 2 months ago
JSON representation

Fast & accurate alignment of barcoded short-reads

Lists

README

        

EMA: An aligner for barcoded short-read sequencing data
=======================================================
![Build Status](https://github.com/arshajii/ema/actions/workflows/ci.yml/badge.svg) [![License](https://img.shields.io/badge/license-MIT-blue.svg)](https://raw.githubusercontent.com/arshajii/ema/master/LICENSE) [![Mentioned in Awesome 10x Genomics](https://awesome.re/mentioned-badge.svg)](https://github.com/johandahlberg/awesome-10x-genomics)

EMA uses a latent variable model to align barcoded short-reads (such as those produced by [10x Genomics](https://www.10xgenomics.com)' sequencing platform). More information is available in [our paper](https://www.biorxiv.org/content/early/2017/11/16/220236). The full experimental setup is available [here](https://github.com/arshajii/ema-paper-data/blob/master/experiments.ipynb).

### Install
#### With `brew` 🍺

```bash
brew install brewsci/bio/ema
```

#### With `conda` 🐍

```bash
conda install -c bioconda ema
```

#### From source 🛠️

```bash
git clone --recursive https://github.com/arshajii/ema
cd ema
make
```

(The `--recursive` flag is needed because EMA uses BWA's C API.)

### Usage
```
usage: ema [options]

count: perform preliminary barcode count (takes interleaved FASTQ via stdin)
-w : specify barcode whitelist [required]
-o : specify output prefix [required]
-p: using haplotag barcodes

preproc: preprocess barcoded FASTQ files (takes interleaved FASTQ via stdin)
-w : specify whitelist [required]
-n : number of barcode buckets to make [500]
-h: apply Hamming-2 correction [off]
-o: specify output directory [required]
-b: output BX:Z-formatted FASTQs [off]
-p: using haplotag barcodes
-t : set number of threads [1]
all other arguments: list of all output prefixes generated by count stage

align: choose best alignments based on barcodes
-1 : first (preprocessed and sorted) FASTQ file [none]
-2 : second (preprocessed and sorted) FASTQ file [none]
-s : specify special FASTQ path [none]
-x: multi-input mode; takes input files after flags and spawns a thread for each [off]
-r : indexed reference [required]
-o : output SAM file [stdout]
-R : full read group string (e.g. '@RG\tID:foo\tSM:bar') [none]
-d: apply fragment read density optimization [off]
-p : sequencing platform (one of '10x', 'tru', 'cpt', 'haplotag', 'dbs', 'tellseq') [10x]
-i : index to follow 'BX' tag in SAM output [1]
-t : set number of threads [1]
all other arguments (only for -x): list of all preprocessed inputs

help: print this help message
```

See the [Other Sequencing Platforms](#other-sequencing-platforms) below for more information
about the implementation details of different linked-read sequencing technologies.

### Input formats
EMA has several input modes:
- `-s `: Input file is a single preprocessed "special" FASTQ generated by the preprocessing steps below.
- `-x`: Input files are listed after flags (as in `ema align -a -b -c ... `). Each of these inputs are processed and all results are written to the SAM file specified with `-o`.
- `-1 `/`-2 `: Input files are standard FASTQs. For interleaved FASTQs, `-2` can be omitted. The only restrictions in this input mode are that read identifiers must end in `:` and that the FASTQs must be sorted by barcode. For 10x data, the above two modes are preferred.

### Parallelism
Multithreading can be enabled with `-t `. The actual threading mode is dependent on how the input is being read, however:
- `-s`, `-1`/`-2`: Multiple threads are spawned to work on the single input file (or pair of input files).
- `-x`: Threads work on the input files individually.

(Note that, because of this, it never makes sense to spawn more threads than there are input files when using `-x`.)

### End-to-end workflow (10x)
In this guide, we use the following additional tools:
- [pigz](https://github.com/madler/pigz)
- [sambamba](http://lomereiter.github.io/sambamba/)
- [samtools](https://github.com/samtools/samtools)
- [GNU Parallel](https://www.gnu.org/software/parallel/)

We also use a 10x barcode whitelist, which can be found [here](http://cb.csail.mit.edu/cb/ema/data/4M-with-alts-february-2016.txt).

#### Preprocessing
Preprocessing 10x data entails several steps, the first of which is counting barcodes (`-j` specifies the number of jobs to be spawned by `parallel`):

```bash
cd /path/to/gzipped_fastqs/
parallel -j40 --bar 'pigz -c -d {} | \
ema count -w /path/to/whitelist.txt -o {/.} 2>{/.}.log' ::: *RA*.gz
```

Make sure that the FASTQs **are interleaved** and **only contain the actual reads** in the files above (as opposed to sample indices, typically with `I1` in their filenames rather than `RA`). This will produce `*.ema-ncnt` and `*.ema-fcnt` files, containing the count data.

If you do not have interleaved files, you can interleave them as follows:

```bash
parallel -j40 --bar 'paste <(pigz -c -d {} | paste - - - -) <(pigz -c -d {= s:_R1_:_R2_: =} | paste - - - -) | tr "\t" "\n" |\
ema count -w /path/to/whitelist.txt -o {/.} 2>{/.}.log' ::: *_R1_*.gz
```

where `s:_R1_:_R2_:` is the regex that casts first-end filenames into the second-end filenames (make sure to adjust this if your naming scheme is different).

Now we can do the actual preprocessing, which splits the input into barcode bins (500 by default; specified with `-n`). This preprocessing can be parallelized via `-t`, which specifies how many threads to use:

```bash
pigz -c -d *RA*.gz | ema preproc -w /path/to/whitelist.txt -n 500 -t 40 -o output_dir *.ema-ncnt 2>&1 | tee preproc.log
```

or if you do not have interleaved files:

```bash
paste <(pigz -c -d *_R1_*.gz | paste - - - -) <(pigz -c -d *_R2_*.gz | paste - - - -) | tr "\t" "\n" |\
ema preproc -w /path/to/whitelist.txt -n 500 -t 40 -o output_dir *.ema-ncnt 2>&1 | tee preproc.log
```

#### Mapping
First we map each barcode bin with EMA. Here, we'll do this using a combination of GNU Parallel and EMA's internal multithreading, which we found to be optimal due to the runtime/memory trade-off. In the following, for instance, we use 10 jobs each with 4 threads (for 40 total threads). We also pipe EMA's SAM output (stdout by default) to `samtools sort`, which produces a sorted BAM:

```bash
parallel --bar -j10 "ema align -t 4 -d -r /path/to/ref.fa -s {} |\
samtools sort -@ 4 -O bam -l 0 -m 4G -o {}.bam -" ::: output_dir/ema-bin-???
```

Lastly, we map the no-barcode bin with BWA:

```bash
bwa mem -p -t 40 -M -R "@RG\tID:rg1\tSM:sample1" /path/to/ref.fa output_dir/ema-nobc |\
samtools sort -@ 4 -O bam -l 0 -m 4G -o output_dir/ema-nobc.bam
```

Note that `@RG\tID:rg1\tSM:sample1` is EMA's default read group. If you specify another for EMA, be sure to specify the same for BWA as well (both tools take the full read group string via `-R`).

#### Postprocessing
EMA performs duplicate marking automatically. We mark duplicates on BWA's output with `sambamba markdup`:

```bash
sambamba markdup -t 40 -p -l 0 output_dir/ema-nobc.bam output_dir/ema-nobc-dupsmarked.bam
rm output_dir/ema-nobc.bam
```

Now we merge all BAMs into a single BAM (might require modifying `ulimit`s, as in `ulimit -n 10000`):

```bash
sambamba merge -t 40 -p ema_final.bam output_dir/*.bam
```

Now you should have a single, sorted, duplicate-marked BAM `ema_final.bam`.

### Other sequencing platforms
EMA can also be run using data from other linked-read or sequencing platforms than 10x Genomics. Other platforms are selected using the flag `-p `. Available platforms and their flags specifications are:

- [Haplotagging](#haplotagging): `haplotag`
- [TELL-seq](#tell-seq): `tellseq`
- [Droplet Barcode Sequencing (DBS)](#dbs): `dbs`
- [CPT-seq](#cpt-seqtruseq-slr): `cpt`
- [TruSeq Synthetic Long Reads (SLR)](#cpt-seqtruseq-slr): `tru`

For preprocessing with subcommands `count` and `preproc`, only 10x Genomics and Haplotagging reads are enabled a the moment.

#### Haplotagging

The haplotagging method for generating long reads was presented in [Meier et al. 2021 PNAS](https://doi.org/10.1073/pnas.2015005118). The platform uses a 16 bp barcode. If using haplotagging data, where barcodes are coded in the read headers as `BX:Z:AxxCxxBxxDxx`, you *do not* need to provide
a barcode whitelist for the `count` or `proproc` steps.

#### TELL-Seq

The TELL-seq linked-read platform is commercially available from [Universal Sequencing](https://www.universalsequencing.com/) and was presented in [Chen et al. 2020 GenomeRes](https://doi.org/10.1101/gr.260380.119). The platform uses a 18 bp semi-degenerate barcode. The FASTQs can for example be preprocess using the Universal Sequencing TELL-Read pipeline to generate barcodes tagged FASTQs as below where the barcode is added after the read name as below

```
@A00741:47:HCM53DRXX:1:1101:18159:7326:TTATTTAATCTTAGTCGT 1:N:0:1
TTATTTAATCTTAGTCGTCCTGGCTAATTTTTTTGTATTTTTATTAGATACGGGATTTCTCCATGTTGGCTTGGCGGGTCTCAAACTCTTGACCTTAGGTGATCTGCCTGCCTCAGCCTCCCAAAGTGCTGGGATTACCGGCGTGAGCCACCGCACCCAGCCTA
+
,FFFFFFFFF:FFF::FF,FFFFFFFFFF,FFF:F,F::,FFFF,F,F,FF,,FFFFFFFFFF,,FF:F:FF,F:F,,FFFFFFFF:FFFFFFFF:F,:FFFFFF:FFF:FF,FFFFFFFFFFFF:,::F,FFFF:FFFF,FF,FFFFFF:,,FFFFFFFFFFF
```

Note that these FASTQs need to be sorted by barcode before using `ema align`.

EMA also supports TELL-seq data provided in the `longranger basic` format, e.g. BX tagged FASTQs as below

```
@A00741:47:HCM53DRXX:1:1101:18159:7326 BX:Z:TTATTTAATCTTAGTCGT
TTATTTAATCTTAGTCGTCCTGGCTAATTTTTTTGTATTTTTATTAGATACGGGATTTCTCCATGTTGGCTTGGCGGGTCTCAAACTCTTGACCTTAGGTGATCTGCCTGCCTCAGCCTCCCAAAGTGCTGGGATTACCGGCGTGAGCCACCGCACCCAGCCTA
+
,FFFFFFFFF:FFF::FF,FFFFFFFFFF,FFF:F,F::,FFFF,F,F,FF,,FFFFFFFFFF,,FF:F:FF,F:F,,FFFFFFFF:FFFFFFFF:F,:FFFFFF:FFF:FF,FFFFFFFFFFFF:,::F,FFFF:FFFF,FF,FFFFFF:,,FFFFFFFFFFF
```

#### DBS

EMA can run using linked reads generated using the method presented in [Redin et al. 2019 SciRep](https://doi.org/10.1038/s41598-019-54446-x), commonly referred to as Droplet Barcode Sequencing (DBS). For running `ema align` with DBS linked-read the FASTQs must have the 20 base barcode present in the read header, similar to output from `longranger basic`. Here is an example FASTQ entry with the barcode `CTTGGTCATTCATACAGTCC`.

```
@A00621:130:HN5HWDSXX:4:1103:8639:28573 BX:Z:CTTGGTCATTCATACAGTCC
CAGTGGGAGCCCTGACCTTGTTTTTCTGTAAGTAGACGGTCCATCTAGGGGTGATGGGAGAAAGTGACAGATCATCAGGCATTGGATTCTCCTAAGGAGGGTGCAATGTAGATCCCTCGCGTGCAGAACTCAATGTAGGGTTCATGCTCCC
+
F,FF,FFFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFF:FFFFFFFFFF:FFFFF,FFFFFFFFFFFFFFFF:FF::FFF,FFFFFFF,FFFFFFFFFF,FFFFFF:FFFFFFFFFFFFFFFFFFFFFFFF:FFF:F:FFFFFFFF
```

#### CPT-seq/TruSeq SLR

Instructions for preprocessing and running EMA on data from CPT-seq and TruSeq Synthetic Long Reads can be found [here](https://github.com/arshajii/ema-paper-data/blob/master/experiments.ipynb).

### Output
EMA outputs a standard SAM file with several additional tags:

- `XG`: alignment probability
- `MI`: cloud identifier (compatible with Long Ranger)
- `XA`: alternate high-probability alignments