Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.
https://github.com/arshajii/ema

Fast & accurate alignment of barcoded short-reads
https://github.com/arshajii/ema
bioinformatics sequence-alignment
Last synced: about 2 months ago
JSON representation
Fast & accurate alignment of barcoded short-reads
Host: GitHub
URL: https://github.com/arshajii/ema
Owner: arshajii
License: mit
Created: 2017-03-11T18:08:45.000Z (over 7 years ago)
Default Branch: master
Last Pushed: 2023-06-29T18:48:12.000Z (about 1 year ago)
Last Synced: 2023-06-29T19:33:51.871Z (about 1 year ago)
Topics: bioinformatics, sequence-alignment
Language: C++
Homepage: http://ema.csail.mit.edu
Size: 247 KB
Stars: 30
Watchers: 5
Forks: 9
Open Issues: 17
Metadata Files:
- Readme: README.md
- License: LICENSE
Lists

awesome-linked-reads - EMerAld (EMA) - aware alignment of linked reads. Also does preprocessing of 10x Genomics data.|![GitHub last commit](https://img.shields.io/github/last-commit/arshajii/ema?label=%20) (Tools)
README

        EMA: An aligner for barcoded short-read sequencing data

=======================================================

![Build Status](https://github.com/arshajii/ema/actions/workflows/ci.yml/badge.svg) [![License](https://img.shields.io/badge/license-MIT-blue.svg)](https://raw.githubusercontent.com/arshajii/ema/master/LICENSE) [![Mentioned in Awesome 10x Genomics](https://awesome.re/mentioned-badge.svg)](https://github.com/johandahlberg/awesome-10x-genomics)

EMA uses a latent variable model to align barcoded short-reads (such as those produced by [10x Genomics](https://www.10xgenomics.com)' sequencing platform). More information is available in [our paper](https://www.biorxiv.org/content/early/2017/11/16/220236). The full experimental setup is available [here](https://github.com/arshajii/ema-paper-data/blob/master/experiments.ipynb).

### Install

#### With `brew` 🍺

```bash

brew install brewsci/bio/ema

```

#### With `conda` 🐍

```bash

conda install -c bioconda ema

```

#### From source 🛠️

```bash

git clone --recursive https://github.com/arshajii/ema

cd ema

make

```

(The `--recursive` flag is needed because EMA uses BWA's C API.)

### Usage

```

usage: ema  [options]

count: perform preliminary barcode count (takes interleaved FASTQ via stdin)

  -w : specify barcode whitelist [required]

  -o : specify output prefix [required]

  -p: using haplotag barcodes

preproc: preprocess barcoded FASTQ files (takes interleaved FASTQ via stdin)

  -w : specify whitelist [required]

  -n : number of barcode buckets to make [500]

  -h: apply Hamming-2 correction [off]

  -o:  specify output directory [required]

  -b: output BX:Z-formatted FASTQs [off]

  -p: using haplotag barcodes

  -t : set number of threads [1]

  all other arguments: list of all output prefixes generated by count stage

align: choose best alignments based on barcodes

  -1 : first (preprocessed and sorted) FASTQ file [none]

  -2 : second (preprocessed and sorted) FASTQ file [none]

  -s : specify special FASTQ path [none]

  -x: multi-input mode; takes input files after flags and spawns a thread for each [off]

  -r : indexed reference [required]

  -o : output SAM file [stdout]

  -R : full read group string (e.g. '@RG\tID:foo\tSM:bar') [none]

  -d: apply fragment read density optimization [off]

  -p : sequencing platform (one of '10x', 'tru', 'cpt', 'haplotag', 'dbs', 'tellseq') [10x]

  -i : index to follow 'BX' tag in SAM output [1]

  -t : set number of threads [1]

  all other arguments (only for -x): list of all preprocessed inputs

help: print this help message

```

See the [Other Sequencing Platforms](#other-sequencing-platforms) below for more information 

about the implementation details of different linked-read sequencing technologies.

### Input formats

EMA has several input modes:

- `-s `: Input file is a single preprocessed "special" FASTQ generated by the preprocessing steps below.

- `-x`: Input files are listed after flags (as in `ema align -a -b -c   ... `). Each of these inputs are processed and all results are written to the SAM file specified with `-o`.

- `-1 `/`-2 `: Input files are standard FASTQs. For interleaved FASTQs, `-2` can be omitted. The only restrictions in this input mode are that read identifiers must end in `:` and that the FASTQs must be sorted by barcode. For 10x data, the above two modes are preferred.

### Parallelism

Multithreading can be enabled with `-t `. The actual threading mode is dependent on how the input is being read, however:

- `-s`, `-1`/`-2`: Multiple threads are spawned to work on the single input file (or pair of input files).

- `-x`: Threads work on the input files individually.

(Note that, because of this, it never makes sense to spawn more threads than there are input files when using `-x`.)

### End-to-end workflow (10x)

In this guide, we use the following additional tools:

- [pigz](https://github.com/madler/pigz)

- [sambamba](http://lomereiter.github.io/sambamba/)

- [samtools](https://github.com/samtools/samtools)

- [GNU Parallel](https://www.gnu.org/software/parallel/)

We also use a 10x barcode whitelist, which can be found [here](http://cb.csail.mit.edu/cb/ema/data/4M-with-alts-february-2016.txt).

#### Preprocessing

Preprocessing 10x data entails several steps, the first of which is counting barcodes (`-j` specifies the number of jobs to be spawned by `parallel`):

```bash

cd /path/to/gzipped_fastqs/

parallel -j40 --bar 'pigz -c -d {} | \

  ema count -w /path/to/whitelist.txt -o {/.} 2>{/.}.log' ::: *RA*.gz

```

Make sure that the FASTQs **are interleaved** and **only contain the actual reads**  in the files above (as opposed to sample indices, typically with `I1` in their filenames rather than `RA`). This will produce `*.ema-ncnt` and `*.ema-fcnt` files, containing the count data.

If you do not have interleaved files, you can interleave them as follows:

```bash

parallel -j40 --bar 'paste <(pigz -c -d {} | paste - - - -) <(pigz -c -d {= s:_R1_:_R2_: =} | paste - - - -) | tr "\t" "\n" |\

  ema count -w /path/to/whitelist.txt -o {/.} 2>{/.}.log' ::: *_R1_*.gz

```

where `s:_R1_:_R2_:` is the regex that casts first-end filenames into the second-end filenames (make sure to adjust this if your naming scheme is different).

Now we can do the actual preprocessing, which splits the input into barcode bins (500 by default; specified with `-n`). This preprocessing can be parallelized via `-t`, which specifies how many threads to use:

```bash

pigz -c -d *RA*.gz | ema preproc -w /path/to/whitelist.txt -n 500 -t 40 -o output_dir *.ema-ncnt 2>&1 | tee preproc.log

```

or if you do not have interleaved files:

```bash

paste <(pigz -c -d *_R1_*.gz | paste - - - -) <(pigz -c -d *_R2_*.gz | paste - - - -) | tr "\t" "\n" |\

  ema preproc -w /path/to/whitelist.txt -n 500 -t 40 -o output_dir *.ema-ncnt 2>&1 | tee preproc.log

```

#### Mapping

First we map each barcode bin with EMA. Here, we'll do this using a combination of GNU Parallel and EMA's internal multithreading, which we found to be optimal due to the runtime/memory trade-off. In the following, for instance, we use 10 jobs each with 4 threads (for 40 total threads). We also pipe EMA's SAM output (stdout by default) to `samtools sort`, which produces a sorted BAM:

```bash

parallel --bar -j10 "ema align -t 4 -d -r /path/to/ref.fa -s {} |\

  samtools sort -@ 4 -O bam -l 0 -m 4G -o {}.bam -" ::: output_dir/ema-bin-???

```

Lastly, we map the no-barcode bin with BWA:

```bash

bwa mem -p -t 40 -M -R "@RG\tID:rg1\tSM:sample1" /path/to/ref.fa output_dir/ema-nobc |\

  samtools sort -@ 4 -O bam -l 0 -m 4G -o output_dir/ema-nobc.bam

```

Note that `@RG\tID:rg1\tSM:sample1` is EMA's default read group. If you specify another for EMA, be sure to specify the same for BWA as well (both tools take the full read group string via `-R`).

#### Postprocessing

EMA performs duplicate marking automatically. We mark duplicates on BWA's output with `sambamba markdup`:

```bash

sambamba markdup -t 40 -p -l 0 output_dir/ema-nobc.bam output_dir/ema-nobc-dupsmarked.bam

rm output_dir/ema-nobc.bam

```

Now we merge all BAMs into a single BAM (might require modifying `ulimit`s, as in `ulimit -n 10000`):

```bash

sambamba merge -t 40 -p ema_final.bam output_dir/*.bam

```

Now you should have a single, sorted, duplicate-marked BAM `ema_final.bam`.

### Other sequencing platforms

EMA can also be run using data from other linked-read or sequencing platforms than 10x Genomics. Other platforms are selected using the flag `-p `. Available platforms and their flags specifications are:

- [Haplotagging](#haplotagging): `haplotag`

- [TELL-seq](#tell-seq): `tellseq`

- [Droplet Barcode Sequencing (DBS)](#dbs): `dbs`

- [CPT-seq](#cpt-seqtruseq-slr): `cpt`

- [TruSeq Synthetic Long Reads (SLR)](#cpt-seqtruseq-slr): `tru`

For preprocessing with subcommands `count` and `preproc`, only 10x Genomics and Haplotagging reads are enabled a the moment.

#### Haplotagging

The haplotagging method for generating long reads was presented in [Meier et al. 2021 PNAS](https://doi.org/10.1073/pnas.2015005118). The platform uses a 16 bp barcode. If using haplotagging data, where barcodes are coded in the read headers as `BX:Z:AxxCxxBxxDxx`, you *do not* need to provide

a barcode whitelist for the `count` or `proproc` steps.

#### TELL-Seq

The TELL-seq linked-read platform is commercially available from [Universal Sequencing](https://www.universalsequencing.com/) and was presented in [Chen et al. 2020 GenomeRes](https://doi.org/10.1101/gr.260380.119). The platform uses a 18 bp semi-degenerate barcode. The FASTQs can for example be preprocess using the Universal Sequencing TELL-Read pipeline to generate barcodes tagged FASTQs as below where the barcode is added after the read name as below 

```

@A00741:47:HCM53DRXX:1:1101:18159:7326:TTATTTAATCTTAGTCGT 1:N:0:1

TTATTTAATCTTAGTCGTCCTGGCTAATTTTTTTGTATTTTTATTAGATACGGGATTTCTCCATGTTGGCTTGGCGGGTCTCAAACTCTTGACCTTAGGTGATCTGCCTGCCTCAGCCTCCCAAAGTGCTGGGATTACCGGCGTGAGCCACCGCACCCAGCCTA

+

,FFFFFFFFF:FFF::FF,FFFFFFFFFF,FFF:F,F::,FFFF,F,F,FF,,FFFFFFFFFF,,FF:F:FF,F:F,,FFFFFFFF:FFFFFFFF:F,:FFFFFF:FFF:FF,FFFFFFFFFFFF:,::F,FFFF:FFFF,FF,FFFFFF:,,FFFFFFFFFFF

```

Note that these FASTQs need to be sorted by barcode before using `ema align`.

EMA also supports TELL-seq data provided in the `longranger basic` format, e.g. BX tagged FASTQs as below

```

@A00741:47:HCM53DRXX:1:1101:18159:7326 BX:Z:TTATTTAATCTTAGTCGT

TTATTTAATCTTAGTCGTCCTGGCTAATTTTTTTGTATTTTTATTAGATACGGGATTTCTCCATGTTGGCTTGGCGGGTCTCAAACTCTTGACCTTAGGTGATCTGCCTGCCTCAGCCTCCCAAAGTGCTGGGATTACCGGCGTGAGCCACCGCACCCAGCCTA

+

,FFFFFFFFF:FFF::FF,FFFFFFFFFF,FFF:F,F::,FFFF,F,F,FF,,FFFFFFFFFF,,FF:F:FF,F:F,,FFFFFFFF:FFFFFFFF:F,:FFFFFF:FFF:FF,FFFFFFFFFFFF:,::F,FFFF:FFFF,FF,FFFFFF:,,FFFFFFFFFFF

```

#### DBS

EMA can run using linked reads generated using the method presented in [Redin et al. 2019 SciRep](https://doi.org/10.1038/s41598-019-54446-x), commonly referred to as Droplet Barcode Sequencing (DBS). For running `ema align` with DBS linked-read the FASTQs must have the 20 base barcode present in the read header, similar to output from `longranger basic`. Here is an example FASTQ entry with the barcode `CTTGGTCATTCATACAGTCC`. 

```

@A00621:130:HN5HWDSXX:4:1103:8639:28573 BX:Z:CTTGGTCATTCATACAGTCC

CAGTGGGAGCCCTGACCTTGTTTTTCTGTAAGTAGACGGTCCATCTAGGGGTGATGGGAGAAAGTGACAGATCATCAGGCATTGGATTCTCCTAAGGAGGGTGCAATGTAGATCCCTCGCGTGCAGAACTCAATGTAGGGTTCATGCTCCC

+

F,FF,FFFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFF:FFFFFFFFFF:FFFFF,FFFFFFFFFFFFFFFF:FF::FFF,FFFFFFF,FFFFFFFFFF,FFFFFF:FFFFFFFFFFFFFFFFFFFFFFFF:FFF:F:FFFFFFFF

```

#### CPT-seq/TruSeq SLR

Instructions for preprocessing and running EMA on data from CPT-seq and TruSeq Synthetic Long Reads can be found [here](https://github.com/arshajii/ema-paper-data/blob/master/experiments.ipynb).

### Output

EMA outputs a standard SAM file with several additional tags:

- `XG`: alignment probability

- `MI`: cloud identifier (compatible with Long Ranger)

- `XA`: alternate high-probability alignments