https://github.com/cmdcolin/secondary_rewriter
Adds SEQ and QUAL to secondary alignments from SAM/BAM/CRAM
https://github.com/cmdcolin/secondary_rewriter
Last synced: about 1 year ago
JSON representation
Adds SEQ and QUAL to secondary alignments from SAM/BAM/CRAM
- Host: GitHub
- URL: https://github.com/cmdcolin/secondary_rewriter
- Owner: cmdcolin
- License: mit
- Created: 2022-09-11T04:46:07.000Z (almost 4 years ago)
- Default Branch: master
- Last Pushed: 2023-05-03T09:46:38.000Z (about 3 years ago)
- Last Synced: 2025-04-12T09:54:19.148Z (about 1 year ago)
- Language: Rust
- Homepage:
- Size: 366 KB
- Stars: 8
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE.md
Awesome Lists containing this project
README
# secondary_rewriter
Note: minimap2 2.25-r1173 (25 April 2023) added a --secondary-seq flag https://github.com/lh3/minimap2/commit/4483f89ee5c0e5972820a2b981ffcb88cc3eff6f which makes this workflow unnecessary
You can still use this for other BAM files
Some aligners such as minimap2 do not write the SEQ and QUAL fields to
secondary alignments (which are sometimes called multi-mappers, see Multiple
mapping in SAMv1.pdf https://samtools.github.io/hts-specs/SAMv1.pdf) making it
hard to analyze them (for example, SNPs will not be visible in a genome browser
for secondary alignments and variant calling would not work on them). This
program adds SEQ/QUAL to secondary alignments, referring to the primary
alignment to get the SEQ and QUAL.
Minimap2 reference https://github.com/lh3/minimap2/issues/458 https://github.com/lh3/minimap2/pull/687
## Install
First install rust, probably with rustup https://rustup.rs/
Then
```
cargo install secondary_rewriter
```
## Usage
This small shell script automates the multi-step pipeline (supports BAM or CRAM)
```
#!/bin/bash
# write_secondaries.sh
# usage
# ./write_secondaries.sh
# e.g.
# ./write_secondaries.sh input.cram ref.fa output.cram 16 2G
THR=${4:-4}
MEM=${5:-1G}
samtools view -@$THR -h $1 -T $2 -f256 > sec.txt
samtools view -@$THR -h $1 -T $2 -F256 | secondary_rewriter --generate-primary-loc-tag --secondaries sec.txt | samtools sort --reference $2 -@$THR -m $MEM - -o $3
```
## Two-pass strategy
The two-pass strategy works as follows
1. First pass: output ALL secondary alignments (reads with flag 256) to a
external file
2. Second pass: read secondary alignments from external file into memory,
and then scan original SAM/BAM/CRAM to add SEQ and QUAL fields on the
primary alignments to the secondary alignments, and pipe to `samtools sort`
(needed because all the secondary reads will be out of order, added right
after the primary alignment where the SEQ/QUAL is found)
This process avoids loading the entire SAM/BAM/CRAM into memory, but does
require the `samtools sort` which is a bit expensive. It does load all the
secondary alignments (pre-them-having SEQ/QUAL fields which is generally on the
order of a couple gigabytes instead of hundred(s) of gigabytes) into memory
though.
## Result
Your secondary reads will now display with SNPs and such in a genome browser.
Having SEQ is also important for variant calling.
Screenshots from both IGV and JBrowse 2 (just to show it's not browser
specific) showing the same file before and after calling with
`secondary_rewriter` on a region of the genome with many secondary alignments
in a centromeric region (ultra long read hs37d5 from
https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/HG002_NA24385_son/Ultralong_OxfordNanopore/guppy-V2.3.4_2019-06-26/

Screenshot of before/after running secondary_rewriter in JBrowse 2

Screenshot of same data, before/after, in IGV
## Help
```
secondary_rewriter 0.1.9
Adds SEQ and QUAL fields to secondary alignments from the primary alignment
USAGE:
secondary_rewriter [OPTIONS]
OPTIONS:
-g, --generate-primary-loc-tag Boolean flag on whether to produce a tag like pl:Z:chr1:1000
on the secondary alignments that says where the primary
alignment is
-h, --help Print help information
-s, --secondaries Path to file of secondary reads (generated by e.g. samtools
view -f256)
-V, --version Print version information
```
`--generate-primary-loc-tag` creates a tag on the secondary reads like
pl:Z:chr1:1000 to say where the primary read is
## Runtime
The speed of this program is mostly limited by samtools view/samtools sort
efficiency, so if you give samtools tons of threads and memory your performance
will improve.
On a t2.2xlarge AWS instance it took 189 minutes (~3 hours) with 8 threads and
1GB per-thread sorting memory to run secondary_rewriter on a 218 gigabyte BAM
file (ultra long read hs37d5 from
https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/HG002_NA24385_son/Ultralong_OxfordNanopore/guppy-V2.3.4_2019-06-26/)
Note also that the output data file is larger, in this example the result was
267Gb BAM vs the original 218Gb BAM.
## Possible consideration
- There are reasons that minimap2 may not output these fields (size of output
being cited by the author), but it is perfectly possible to add the SEQ and
QUAL back. This PR to minimap2 natively outputs the SEQ and QUAL
https://github.com/lh3/minimap2/pull/687/files but it has been stated that
minimap2 "will not" output these.
- This program does not handle hard clipping in the CIGAR (`H` operator) but I
haven't seen minimap2 output yet. If you see this let me know and a fix can
probably be made.
- Finally, you may also want to think about the implications of how to treat
secondary alignments in your pipeline, for while this program can help in
this particular circumstance, it may be unclear what the implications of
these secondary/multi-mapping alignments are.