{"id":16618546,"url":"https://github.com/brentp/duphold","last_synced_at":"2025-12-12T11:34:59.010Z","repository":{"id":66472158,"uuid":"147878604","full_name":"brentp/duphold","owner":"brentp","description":"don't get DUP'ed or DEL'ed by your putative SVs.","archived":false,"fork":false,"pushed_at":"2020-12-14T23:00:40.000Z","size":8552,"stargazers_count":102,"open_issues_count":18,"forks_count":9,"subscribers_count":9,"default_branch":"master","last_synced_at":"2025-01-17T20:46:14.716Z","etag":null,"topics":["genomics","insanity","structural-variation"],"latest_commit_sha":null,"homepage":null,"language":"Nim","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/brentp.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGES.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2018-09-07T21:57:39.000Z","updated_at":"2024-12-12T15:43:56.000Z","dependencies_parsed_at":"2023-06-05T04:15:51.877Z","dependency_job_id":null,"html_url":"https://github.com/brentp/duphold","commit_stats":null,"previous_names":[],"tags_count":19,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/brentp%2Fduphold","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/brentp%2Fduphold/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/brentp%2Fduphold/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/brentp%2Fduphold/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/brentp","download_url":"https://codeload.github.com/brentp/duphold/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":242980782,"owners_count":20216285,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["genomics","insanity","structural-variation"],"created_at":"2024-10-12T02:20:37.960Z","updated_at":"2025-12-12T11:34:53.953Z","avatar_url":"https://github.com/brentp.png","language":"Nim","funding_links":[],"categories":[],"sub_categories":[],"readme":"[![Build Status](https://travis-ci.org/brentp/duphold.svg?branch=master)](https://travis-ci.org/brentp/duphold)\n\n[![Actions Status](https://github.com/brentp/duphold/workflows/Docker%20Image%20CI/badge.svg)](https://github.com/brentp/duphold/actions)\n\n\n# duphold: uphold your DUP and DEL calls\n\nThe paper describing `duphold` is available [here](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6479422/)\n\nSV callers like [lumpy](https://github.com/arq5x/lumpy) look at split-reads and pair distances to find structural variants.\nThis tool is a fast way to add depth information to those calls. This can be used as additional\ninformation for filtering variants; for example **we will be skeptical of deletion calls that\ndo not have lower than average coverage** compared to regions with similar gc-content.\n\nIn addition, `duphold` will annotate the SV vcf with information from a SNP/Indel VCF. For example, **we will not\nbelieve a large deletion that has many heterozygote SNP calls**.\n\n\n`duphold` takes a **bam/cram**, a **VCF/BCF** of SV calls, and a **fasta** reference and it updates the FORMAT field for a\nsingle sample with:\n\n+ **DHFC**: fold-change for the variant depth *relative to the rest of the chromosome* the variant was found on\n+ **DHBFC**: fold-change for the variant depth *relative to bins in the genome with similar GC-content*.\n+ **DHFFC**: fold-change for the variant depth *relative to **F**lanking regions*.\n\nIt also adds **GCF** to the INFO field indicating the fraction of G or C bases in the variant.\n\nAfter annotating with `duphold`, a sensible way to filter to high-quality variants is:\n\n```\nbcftools view -i '(SVTYPE = \"DEL\" \u0026 FMT/DHFFC[0] \u003c 0.7) | (SVTYPE = \"DUP\" \u0026 FMT/DHBFC[0] \u003e 1.3)' $svvcf\n\n```\n\nIn our evaluations, `DHFFC` works best for deletions and `DHBFC` works slightly better for duplications.\nFor genomes/samples with more variable coverage, `DHFFC` should be the most reliable.\n\n\n## SNP/Indel annotation\n\n**NOTE** it is strongly recommended to use BCF for the `--snp` argument as otherwise VCF parsing will be a bottleneck.\n\n+ A DEL call with many HETs is unlikely to be valid.\n\nWhen the user specifies a `--snp` VCF, `duphold` finds the appropriate sample in that file and extracts high (\u003e 20) quality, bi-allelic\nSNP calls  and for each SV, it reports the number of hom-refs, heterozygote, hom-alt, unknown, and low-quality snp calls\nin the region of the event. This information is stored in 5 integers in `DHGT`.\n\nWhen a SNP/Indel VCF/BCF is given, `duphold` will annotate each DEL/DUP call with:\n\n+ **DHGT**: counts of [0] Hom-ref, [1] Het, [2] Homalt, [3] Unknown, [4] low-quality variants in the event.\n  A heterozygous deletion may have more hom-alt SNP calls. A homozygous deletion may have only unknown or\n  low-quality SNP calls.\n\nIn practice, this has had limited benefit for us. The depth changes are more informative.\n\n## Performance\n\n### Speed\n\n`duphold` runtime depends almost entirely on how long it takes to parse the BAM/CRAM files; it is relatively independent of the number of variants evaluated. It will also run quite a bit faster on CRAM than on BAM. It can be \u003c 20 minutes of CPU time for a 30X CRAM.\n\n### Accuracy\n\nEvaluting on the [genome in a bottle truthset](ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/analysis/NIST_SVs_Integration_v0.6/HG002_SVs_Tier1_v0.6.vcf.gz) for *DEL calls larger than 300 bp*:\n\n| method      |   FDR |   FN |   FP |   TP-call |   precision |   recall |   recall-% |    FP-% |\n|:------------|------:|-----:|-----:|----------:|------------:|---------:|-----------:|--------:|\n| unfiltered  | 0.054 |  276 |   86 |      1496 |       0.946 |    0.844 |    100.000 | 100.000 |\n| DHBFC \u003c 0.7 | 0.018 |  298 |   27 |      1474 |       0.982 |    0.832 |     98.529 |  31.395 |\n| DHFFC \u003c 0.7 | 0.021 |  289 |   32 |      1483 |       0.979 |    0.837 |     99.131 |  37.209 |\n\n\nNote that filtering on `DHFFC \u003c 0.7` **retains  99.1% of true positives** and **removes  62.8% (100 - 37.2) of false positives**\n\nThis was generated using [truvari.py](https://github.com/spiralgenetics/truvari) with the command:\n```\ntruvari.py --sizemax 15000000 -s 300 -S 270 -b HG002_SVs_Tier1_v0.6.DEL.vcf.gz -c $dupholded_vcf -o $out \\\n   --passonly --pctsim=0  -r 20 --giabreport -f $fasta --no-ref --includebed HG002_SVs_Tier1_v0.6.bed -O 0.6\n```\n\nFor **deletions \u003e= 1KB**, duphold does even better:\n\n| method      |   FDR |   FN |   FP |   TP-call |   precision |   recall |   recall-% |    FP-% |\n|:------------|------:|-----:|-----:|----------:|------------:|---------:|-----------:|--------:|\n| unfiltered  | 0.073 |   46 |   38 |       486 |       0.927 |    0.914 |    100.000 | 100.000 |\n| DHBFC \u003c 0.7 | 0.012 |   54 |    6 |       478 |       0.988 |    0.898 |     98.354 |  15.789 |\n| DHFFC \u003c 0.7 | 0.012 |   53 |    6 |       479 |       0.988 |    0.900 |     98.560 |  15.789 |\n\nNote that filtering on `DHFFC \u003c 0.7` **retains 98.5% of DEL calls that are also in the truth-set (TPs)** and\n**removes 84.2% (100 - 15.8) of calls not in the truth-set (FPs)**\n\nThe `truvari.py` command used for this is the same as above except for: `-s 1000 -S 970`\n\n## Install\n\n`duphold` is distributed as a static binary [here](https://github.com/brentp/duphold/releases/latest).\n\n\n\n## Usage\n\n```\nduphold -s $gatk_vcf -t 4 -v $svvcf -b $cram -f $fasta -o $output.bcf\nduphold --snp $gatk_bcf --threads 4 --vcf $svvcf --bam $cram --fasta $fasta --output $output.bcf\n```\n\n`--snp` can be a multi-sample VCF/BCF. `duphold` will be much faster with a BCF, especially if\nthe snp/indel file contains many (\u003e20 or so) samples.\n\nthe threads are decompression threads so increasing up to about 4 works.\n\nFull usage is available with `duphold -h`\n\n`duphold` runs on a single-sample, but you can install [smoove](https://github.com/brentp/smoove) and run `smoove duphold`\nto parallelize across many samples.\n\n## Examples\n\n#### Duplication\n\nHere is a duplication with clear change in depth (`DHBFC`)\n\n![image](https://user-images.githubusercontent.com/1739/45895409-5a224080-bd8e-11e8-844f-e7ffc13c7972.png \"example IGV screenshot\")\n\n`duphold` annotated this with\n\n+ **DHBFC**: 1.79\n\nwhere together these indicate rapid (DUP-like) change in depth at the break-points and a coverage that 1.79 times higher than the mean for the genome--again indicative of a DUP. Together, these recapitulate (or anticipate) what we see on visual inspection.\n\n#### Deletion\n\nA clear deletion will have rapid drop in depth at the left and increase in depth at the right and a lower mean coverage.\n\n![image](https://user-images.githubusercontent.com/1739/45895721-2dbaf400-bd8f-11e8-88b3-9fd5a90ef39e.png)\n\n`duphold` annotated this with:\n\n+ **DHBFC**: 0.6\n\nThese indicate that both break-points are consistent with a deletion and that the coverage is ~60% of expected. So this is a clear deletion.\n\n#### BND\n\nwhen lumpy decides that a cluster of evidence does not match a DUP or DEL or INV, it creates a BND with 2 lines in the VCF. Sometimes these\nare actual deletions. For example:\n\n![image](https://user-images.githubusercontent.com/1739/45906495-987d2700-bdb1-11e8-8ba5-eacdf8221f68.png)\n\nshows where a deletion is bounded by 2 BND calls. `duphold` annotates this with:\n\n+ **DHBFC**: 0.01\n\nindicating a homozygous deletion with clear break-points.\n\n\n## Tuning and Env vars\n\nThe default flank is 1000 bases. If the environment variable `DUPHOLD_FLANK` is set to an integer, that\ncan be used instead. In our experiments, this value should be large enough that duphold can get a good estimate\nof depth, but small enough that it is unlikely to extend into an unmapped region or another event.\nThis may be lowered for genomes with poor assemblies.\n\nIf the sample name in your bam does not match the one in the VCF (tisk, tisk). You can use `DUPHOLD_SAMPLE_NAME`\nenvironment variable to set the name to use.\n\n\n## Acknowledgements\n\nI stole the idea of annotating SVs with depth-change from Ira Hall.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbrentp%2Fduphold","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbrentp%2Fduphold","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbrentp%2Fduphold/lists"}