{"id":13407997,"url":"https://github.com/arshajii/ema","last_synced_at":"2026-04-05T21:38:14.884Z","repository":{"id":52955963,"uuid":"84670075","full_name":"arshajii/ema","owner":"arshajii","description":"Fast \u0026 accurate alignment of barcoded short-reads","archived":false,"fork":false,"pushed_at":"2023-06-29T18:48:12.000Z","size":258,"stargazers_count":32,"open_issues_count":18,"forks_count":6,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-10-23T05:40:34.776Z","etag":null,"topics":["bioinformatics","sequence-alignment"],"latest_commit_sha":null,"homepage":"http://ema.csail.mit.edu","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/arshajii.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2017-03-11T18:08:45.000Z","updated_at":"2025-09-09T10:06:37.000Z","dependencies_parsed_at":"2024-10-26T03:23:27.266Z","dependency_job_id":"3ca1af75-6779-4d0b-a776-4ba7e2732b83","html_url":"https://github.com/arshajii/ema","commit_stats":null,"previous_names":[],"tags_count":5,"template":false,"template_full_name":null,"purl":"pkg:github/arshajii/ema","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/arshajii%2Fema","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/arshajii%2Fema/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/arshajii%2Fema/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/arshajii%2Fema/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/arshajii","download_url":"https://codeload.github.com/arshajii/ema/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/arshajii%2Fema/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31451445,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-05T21:22:52.476Z","status":"ssl_error","status_checked_at":"2026-04-05T21:22:51.943Z","response_time":75,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bioinformatics","sequence-alignment"],"created_at":"2024-07-30T20:00:50.047Z","updated_at":"2026-04-05T21:38:14.821Z","avatar_url":"https://github.com/arshajii.png","language":"C++","funding_links":[],"categories":["Tools"],"sub_categories":[],"readme":"EMA: An aligner for barcoded short-read sequencing data\n=======================================================\n![Build Status](https://github.com/arshajii/ema/actions/workflows/ci.yml/badge.svg) [![License](https://img.shields.io/badge/license-MIT-blue.svg)](https://raw.githubusercontent.com/arshajii/ema/master/LICENSE) [![Mentioned in Awesome 10x Genomics](https://awesome.re/mentioned-badge.svg)](https://github.com/johandahlberg/awesome-10x-genomics)\n\nEMA uses a latent variable model to align barcoded short-reads (such as those produced by [10x Genomics](https://www.10xgenomics.com)' sequencing platform). More information is available in [our paper](https://www.biorxiv.org/content/early/2017/11/16/220236). The full experimental setup is available [here](https://github.com/arshajii/ema-paper-data/blob/master/experiments.ipynb).\n\n### Install\n#### With `brew` 🍺\n\n```bash\nbrew install brewsci/bio/ema\n```\n\n#### With `conda` 🐍\n\n```bash\nconda install -c bioconda ema\n```\n\n#### From source 🛠️\n\n```bash\ngit clone --recursive https://github.com/arshajii/ema\ncd ema\nmake\n```\n\n(The `--recursive` flag is needed because EMA uses BWA's C API.)\n\n### Usage\n```\nusage: ema \u003ccount|preproc|align|help\u003e [options]\n\ncount: perform preliminary barcode count (takes interleaved FASTQ via stdin)\n  -w \u003cwhitelist path\u003e: specify barcode whitelist [required]\n  -o \u003coutput prefix\u003e: specify output prefix [required]\n  -p: using haplotag barcodes\n\npreproc: preprocess barcoded FASTQ files (takes interleaved FASTQ via stdin)\n  -w \u003cwhitelist path\u003e: specify whitelist [required]\n  -n \u003cnum buckets\u003e: number of barcode buckets to make [500]\n  -h: apply Hamming-2 correction [off]\n  -o: \u003coutput directory\u003e specify output directory [required]\n  -b: output BX:Z-formatted FASTQs [off]\n  -p: using haplotag barcodes\n  -t \u003cthreads\u003e: set number of threads [1]\n  all other arguments: list of all output prefixes generated by count stage\n\nalign: choose best alignments based on barcodes\n  -1 \u003cFASTQ1 path\u003e: first (preprocessed and sorted) FASTQ file [none]\n  -2 \u003cFASTQ2 path\u003e: second (preprocessed and sorted) FASTQ file [none]\n  -s \u003cEMA-FASTQ path\u003e: specify special FASTQ path [none]\n  -x: multi-input mode; takes input files after flags and spawns a thread for each [off]\n  -r \u003cFASTA path\u003e: indexed reference [required]\n  -o \u003cSAM file\u003e: output SAM file [stdout]\n  -R \u003cRG string\u003e: full read group string (e.g. '@RG\\tID:foo\\tSM:bar') [none]\n  -d: apply fragment read density optimization [off]\n  -p \u003cplatform\u003e: sequencing platform (one of '10x', 'tru', 'cpt', 'haplotag', 'dbs', 'tellseq') [10x]\n  -i \u003cindex\u003e: index to follow 'BX' tag in SAM output [1]\n  -t \u003cthreads\u003e: set number of threads [1]\n  all other arguments (only for -x): list of all preprocessed inputs\n\nhelp: print this help message\n```\n\nSee the [Other Sequencing Platforms](#other-sequencing-platforms) below for more information \nabout the implementation details of different linked-read sequencing technologies.\n\n### Input formats\nEMA has several input modes:\n- `-s \u003cinput\u003e`: Input file is a single preprocessed \"special\" FASTQ generated by the preprocessing steps below.\n- `-x`: Input files are listed after flags (as in `ema align -a -b -c \u003cinput 1\u003e \u003cinput 2\u003e ... \u003cinput N\u003e`). Each of these inputs are processed and all results are written to the SAM file specified with `-o`.\n- `-1 \u003cfirst mate\u003e`/`-2 \u003csecond mate\u003e`: Input files are standard FASTQs. For interleaved FASTQs, `-2` can be omitted. The only restrictions in this input mode are that read identifiers must end in `:\u003cbarcode sequence\u003e` and that the FASTQs must be sorted by barcode. For 10x data, the above two modes are preferred.\n\n### Parallelism\nMultithreading can be enabled with `-t \u003cnum threads\u003e`. The actual threading mode is dependent on how the input is being read, however:\n- `-s`, `-1`/`-2`: Multiple threads are spawned to work on the single input file (or pair of input files).\n- `-x`: Threads work on the input files individually.\n\n(Note that, because of this, it never makes sense to spawn more threads than there are input files when using `-x`.)\n\n### End-to-end workflow (10x)\nIn this guide, we use the following additional tools:\n- [pigz](https://github.com/madler/pigz)\n- [sambamba](http://lomereiter.github.io/sambamba/)\n- [samtools](https://github.com/samtools/samtools)\n- [GNU Parallel](https://www.gnu.org/software/parallel/)\n\nWe also use a 10x barcode whitelist, which can be found [here](http://cb.csail.mit.edu/cb/ema/data/4M-with-alts-february-2016.txt).\n\n#### Preprocessing\nPreprocessing 10x data entails several steps, the first of which is counting barcodes (`-j` specifies the number of jobs to be spawned by `parallel`):\n\n```bash\ncd /path/to/gzipped_fastqs/\nparallel -j40 --bar 'pigz -c -d {} | \\\n  ema count -w /path/to/whitelist.txt -o {/.} 2\u003e{/.}.log' ::: *RA*.gz\n```\n\nMake sure that the FASTQs **are interleaved** and **only contain the actual reads**  in the files above (as opposed to sample indices, typically with `I1` in their filenames rather than `RA`). This will produce `*.ema-ncnt` and `*.ema-fcnt` files, containing the count data.\n\nIf you do not have interleaved files, you can interleave them as follows:\n\n```bash\nparallel -j40 --bar 'paste \u003c(pigz -c -d {} | paste - - - -) \u003c(pigz -c -d {= s:_R1_:_R2_: =} | paste - - - -) | tr \"\\t\" \"\\n\" |\\\n  ema count -w /path/to/whitelist.txt -o {/.} 2\u003e{/.}.log' ::: *_R1_*.gz\n```\n\nwhere `s:_R1_:_R2_:` is the regex that casts first-end filenames into the second-end filenames (make sure to adjust this if your naming scheme is different).\n\nNow we can do the actual preprocessing, which splits the input into barcode bins (500 by default; specified with `-n`). This preprocessing can be parallelized via `-t`, which specifies how many threads to use:\n\n```bash\npigz -c -d *RA*.gz | ema preproc -w /path/to/whitelist.txt -n 500 -t 40 -o output_dir *.ema-ncnt 2\u003e\u00261 | tee preproc.log\n```\n\nor if you do not have interleaved files:\n\n```bash\npaste \u003c(pigz -c -d *_R1_*.gz | paste - - - -) \u003c(pigz -c -d *_R2_*.gz | paste - - - -) | tr \"\\t\" \"\\n\" |\\\n  ema preproc -w /path/to/whitelist.txt -n 500 -t 40 -o output_dir *.ema-ncnt 2\u003e\u00261 | tee preproc.log\n```\n\n#### Mapping\nFirst we map each barcode bin with EMA. Here, we'll do this using a combination of GNU Parallel and EMA's internal multithreading, which we found to be optimal due to the runtime/memory trade-off. In the following, for instance, we use 10 jobs each with 4 threads (for 40 total threads). We also pipe EMA's SAM output (stdout by default) to `samtools sort`, which produces a sorted BAM:\n\n```bash\nparallel --bar -j10 \"ema align -t 4 -d -r /path/to/ref.fa -s {} |\\\n  samtools sort -@ 4 -O bam -l 0 -m 4G -o {}.bam -\" ::: output_dir/ema-bin-???\n```\n\nLastly, we map the no-barcode bin with BWA:\n\n```bash\nbwa mem -p -t 40 -M -R \"@RG\\tID:rg1\\tSM:sample1\" /path/to/ref.fa output_dir/ema-nobc |\\\n  samtools sort -@ 4 -O bam -l 0 -m 4G -o output_dir/ema-nobc.bam\n```\n\nNote that `@RG\\tID:rg1\\tSM:sample1` is EMA's default read group. If you specify another for EMA, be sure to specify the same for BWA as well (both tools take the full read group string via `-R`).\n\n#### Postprocessing\nEMA performs duplicate marking automatically. We mark duplicates on BWA's output with `sambamba markdup`:\n\n```bash\nsambamba markdup -t 40 -p -l 0 output_dir/ema-nobc.bam output_dir/ema-nobc-dupsmarked.bam\nrm output_dir/ema-nobc.bam\n```\n\nNow we merge all BAMs into a single BAM (might require modifying `ulimit`s, as in `ulimit -n 10000`):\n\n```bash\nsambamba merge -t 40 -p ema_final.bam output_dir/*.bam\n```\n\nNow you should have a single, sorted, duplicate-marked BAM `ema_final.bam`.\n\n### Other sequencing platforms\nEMA can also be run using data from other linked-read or sequencing platforms than 10x Genomics. Other platforms are selected using the flag `-p \u003cplatfrom\u003e`. Available platforms and their flags specifications are:\n\n- [Haplotagging](#haplotagging): `haplotag`\n- [TELL-seq](#tell-seq): `tellseq`\n- [Droplet Barcode Sequencing (DBS)](#dbs): `dbs`\n- [CPT-seq](#cpt-seqtruseq-slr): `cpt`\n- [TruSeq Synthetic Long Reads (SLR)](#cpt-seqtruseq-slr): `tru`\n\nFor preprocessing with subcommands `count` and `preproc`, only 10x Genomics and Haplotagging reads are enabled a the moment.\n\n#### Haplotagging\n\nThe haplotagging method for generating long reads was presented in [Meier et al. 2021 PNAS](https://doi.org/10.1073/pnas.2015005118). The platform uses a 16 bp barcode. If using haplotagging data, where barcodes are coded in the read headers as `BX:Z:AxxCxxBxxDxx`, you *do not* need to provide\na barcode whitelist for the `count` or `proproc` steps.\n\n#### TELL-Seq\n\nThe TELL-seq linked-read platform is commercially available from [Universal Sequencing](https://www.universalsequencing.com/) and was presented in [Chen et al. 2020 GenomeRes](https://doi.org/10.1101/gr.260380.119). The platform uses a 18 bp semi-degenerate barcode. The FASTQs can for example be preprocess using the Universal Sequencing TELL-Read pipeline to generate barcodes tagged FASTQs as below where the barcode is added after the read name as below \n\n```\n@A00741:47:HCM53DRXX:1:1101:18159:7326:TTATTTAATCTTAGTCGT 1:N:0:1\nTTATTTAATCTTAGTCGTCCTGGCTAATTTTTTTGTATTTTTATTAGATACGGGATTTCTCCATGTTGGCTTGGCGGGTCTCAAACTCTTGACCTTAGGTGATCTGCCTGCCTCAGCCTCCCAAAGTGCTGGGATTACCGGCGTGAGCCACCGCACCCAGCCTA\n+\n,FFFFFFFFF:FFF::FF,FFFFFFFFFF,FFF:F,F::,FFFF,F,F,FF,,FFFFFFFFFF,,FF:F:FF,F:F,,FFFFFFFF:FFFFFFFF:F,:FFFFFF:FFF:FF,FFFFFFFFFFFF:,::F,FFFF:FFFF,FF,FFFFFF:,,FFFFFFFFFFF\n```\n\nNote that these FASTQs need to be sorted by barcode before using `ema align`.\n\nEMA also supports TELL-seq data provided in the `longranger basic` format, e.g. BX tagged FASTQs as below\n\n```\n@A00741:47:HCM53DRXX:1:1101:18159:7326 BX:Z:TTATTTAATCTTAGTCGT\nTTATTTAATCTTAGTCGTCCTGGCTAATTTTTTTGTATTTTTATTAGATACGGGATTTCTCCATGTTGGCTTGGCGGGTCTCAAACTCTTGACCTTAGGTGATCTGCCTGCCTCAGCCTCCCAAAGTGCTGGGATTACCGGCGTGAGCCACCGCACCCAGCCTA\n+\n,FFFFFFFFF:FFF::FF,FFFFFFFFFF,FFF:F,F::,FFFF,F,F,FF,,FFFFFFFFFF,,FF:F:FF,F:F,,FFFFFFFF:FFFFFFFF:F,:FFFFFF:FFF:FF,FFFFFFFFFFFF:,::F,FFFF:FFFF,FF,FFFFFF:,,FFFFFFFFFFF\n```\n\n#### DBS\n\nEMA can run using linked reads generated using the method presented in [Redin et al. 2019 SciRep](https://doi.org/10.1038/s41598-019-54446-x), commonly referred to as Droplet Barcode Sequencing (DBS). For running `ema align` with DBS linked-read the FASTQs must have the 20 base barcode present in the read header, similar to output from `longranger basic`. Here is an example FASTQ entry with the barcode `CTTGGTCATTCATACAGTCC`. \n\n```\n@A00621:130:HN5HWDSXX:4:1103:8639:28573 BX:Z:CTTGGTCATTCATACAGTCC\nCAGTGGGAGCCCTGACCTTGTTTTTCTGTAAGTAGACGGTCCATCTAGGGGTGATGGGAGAAAGTGACAGATCATCAGGCATTGGATTCTCCTAAGGAGGGTGCAATGTAGATCCCTCGCGTGCAGAACTCAATGTAGGGTTCATGCTCCC\n+\nF,FF,FFFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFF:FFFFFFFFFF:FFFFF,FFFFFFFFFFFFFFFF:FF::FFF,FFFFFFF,FFFFFFFFFF,FFFFFF:FFFFFFFFFFFFFFFFFFFFFFFF:FFF:F:FFFFFFFF\n```\n\n#### CPT-seq/TruSeq SLR\n\nInstructions for preprocessing and running EMA on data from CPT-seq and TruSeq Synthetic Long Reads can be found [here](https://github.com/arshajii/ema-paper-data/blob/master/experiments.ipynb).\n\n### Output\nEMA outputs a standard SAM file with several additional tags:\n\n- `XG`: alignment probability\n- `MI`: cloud identifier (compatible with Long Ranger)\n- `XA`: alternate high-probability alignments\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Farshajii%2Fema","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Farshajii%2Fema","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Farshajii%2Fema/lists"}