{"id":20386942,"url":"https://github.com/cmdcolin/secondary_rewriter","last_synced_at":"2025-04-12T09:54:29.130Z","repository":{"id":59055121,"uuid":"535192777","full_name":"cmdcolin/secondary_rewriter","owner":"cmdcolin","description":"Adds SEQ and QUAL to secondary alignments from SAM/BAM/CRAM","archived":false,"fork":false,"pushed_at":"2023-05-03T09:46:38.000Z","size":375,"stargazers_count":8,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-04-12T09:54:19.148Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/cmdcolin.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-09-11T04:46:07.000Z","updated_at":"2025-03-21T21:32:25.000Z","dependencies_parsed_at":"2024-11-15T02:41:48.129Z","dependency_job_id":"348bd9df-141a-464f-a04d-6169e7d0c1e3","html_url":"https://github.com/cmdcolin/secondary_rewriter","commit_stats":{"total_commits":24,"total_committers":1,"mean_commits":24.0,"dds":0.0,"last_synced_commit":"52b6c73e927f5d5d24682ed04e970c3449f62e2d"},"previous_names":[],"tags_count":9,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cmdcolin%2Fsecondary_rewriter","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cmdcolin%2Fsecondary_rewriter/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cmdcolin%2Fsecondary_rewriter/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cmdcolin%2Fsecondary_rewriter/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/cmdcolin","download_url":"https://codeload.github.com/cmdcolin/secondary_rewriter/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248550634,"owners_count":21122932,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-15T02:41:41.595Z","updated_at":"2025-04-12T09:54:29.102Z","avatar_url":"https://github.com/cmdcolin.png","language":"Rust","funding_links":[],"categories":[],"sub_categories":[],"readme":"# secondary_rewriter\n\nNote: minimap2 2.25-r1173 (25 April 2023) added a --secondary-seq flag https://github.com/lh3/minimap2/commit/4483f89ee5c0e5972820a2b981ffcb88cc3eff6f which makes this workflow unnecessary\n\nYou can still use this for other BAM files\n\n\nSome aligners such as minimap2 do not write the SEQ and QUAL fields to\nsecondary alignments (which are sometimes called multi-mappers, see Multiple\nmapping in SAMv1.pdf https://samtools.github.io/hts-specs/SAMv1.pdf) making it\nhard to analyze them (for example, SNPs will not be visible in a genome browser\nfor secondary alignments and variant calling would not work on them). This\nprogram adds SEQ/QUAL to secondary alignments, referring to the primary\nalignment to get the SEQ and QUAL.\n\nMinimap2 reference https://github.com/lh3/minimap2/issues/458 https://github.com/lh3/minimap2/pull/687\n\n## Install\n\nFirst install rust, probably with rustup https://rustup.rs/\n\nThen\n\n```\ncargo install secondary_rewriter\n```\n\n## Usage\n\nThis small shell script automates the multi-step pipeline (supports BAM or CRAM)\n\n```\n\n#!/bin/bash\n\n# write_secondaries.sh\n# usage\n# ./write_secondaries.sh \u003cinput.cram\u003e \u003cref.fa\u003e \u003coutput.cram\u003e \u003cnthreads default 4\u003e \u003cmemory gigabytes, per-thread, default 1G\u003e\n# e.g.\n# ./write_secondaries.sh input.cram ref.fa output.cram 16 2G\n\nTHR=${4:-4}\nMEM=${5:-1G}\n\n\nsamtools view -@$THR -h $1 -T $2 -f256 \u003e sec.txt\nsamtools view -@$THR -h $1 -T $2 -F256 | secondary_rewriter --generate-primary-loc-tag --secondaries sec.txt | samtools sort --reference $2 -@$THR -m $MEM - -o $3\n\n```\n\n## Two-pass strategy\n\nThe two-pass strategy works as follows\n\n1. First pass: output ALL secondary alignments (reads with flag 256) to a\n   external file\n2. Second pass: read secondary alignments from external file into memory,\n   and then scan original SAM/BAM/CRAM to add SEQ and QUAL fields on the\n   primary alignments to the secondary alignments, and pipe to `samtools sort`\n   (needed because all the secondary reads will be out of order, added right\n   after the primary alignment where the SEQ/QUAL is found)\n\nThis process avoids loading the entire SAM/BAM/CRAM into memory, but does\nrequire the `samtools sort` which is a bit expensive. It does load all the\nsecondary alignments (pre-them-having SEQ/QUAL fields which is generally on the\norder of a couple gigabytes instead of hundred(s) of gigabytes) into memory\nthough.\n\n## Result\n\nYour secondary reads will now display with SNPs and such in a genome browser.\nHaving SEQ is also important for variant calling.\n\nScreenshots from both IGV and JBrowse 2 (just to show it's not browser\nspecific) showing the same file before and after calling with\n`secondary_rewriter` on a region of the genome with many secondary alignments\nin a centromeric region (ultra long read hs37d5 from\nhttps://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/HG002_NA24385_son/Ultralong_OxfordNanopore/guppy-V2.3.4_2019-06-26/\n\n![](img/jbrowse.png)\nScreenshot of before/after running secondary_rewriter in JBrowse 2\n\n![](img/igv.png)\nScreenshot of same data, before/after, in IGV\n\n## Help\n\n```\n\nsecondary_rewriter 0.1.9\nAdds SEQ and QUAL fields to secondary alignments from the primary alignment\n\nUSAGE:\n    secondary_rewriter [OPTIONS]\n\nOPTIONS:\n    -g, --generate-primary-loc-tag     Boolean flag on whether to produce a tag like pl:Z:chr1:1000\n                                       on the secondary alignments that says where the primary\n                                       alignment is\n    -h, --help                         Print help information\n    -s, --secondaries \u003cSECONDARIES\u003e    Path to file of secondary reads (generated by e.g. samtools\n                                       view -f256)\n    -V, --version                      Print version information\n\n```\n\n`--generate-primary-loc-tag` creates a tag on the secondary reads like\npl:Z:chr1:1000 to say where the primary read is\n\n## Runtime\n\nThe speed of this program is mostly limited by samtools view/samtools sort\nefficiency, so if you give samtools tons of threads and memory your performance\nwill improve.\n\nOn a t2.2xlarge AWS instance it took 189 minutes (~3 hours) with 8 threads and\n1GB per-thread sorting memory to run secondary_rewriter on a 218 gigabyte BAM\nfile (ultra long read hs37d5 from\nhttps://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/HG002_NA24385_son/Ultralong_OxfordNanopore/guppy-V2.3.4_2019-06-26/)\n\nNote also that the output data file is larger, in this example the result was\n267Gb BAM vs the original 218Gb BAM.\n\n## Possible consideration\n\n- There are reasons that minimap2 may not output these fields (size of output\n  being cited by the author), but it is perfectly possible to add the SEQ and\n  QUAL back. This PR to minimap2 natively outputs the SEQ and QUAL\n  https://github.com/lh3/minimap2/pull/687/files but it has been stated that\n  minimap2 \"will not\" output these.\n\n- This program does not handle hard clipping in the CIGAR (`H` operator) but I\n  haven't seen minimap2 output yet. If you see this let me know and a fix can\n  probably be made.\n\n- Finally, you may also want to think about the implications of how to treat\n  secondary alignments in your pipeline, for while this program can help in\n  this particular circumstance, it may be unclear what the implications of\n  these secondary/multi-mapping alignments are.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcmdcolin%2Fsecondary_rewriter","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcmdcolin%2Fsecondary_rewriter","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcmdcolin%2Fsecondary_rewriter/lists"}