{"id":42060357,"url":"https://github.com/statgen/topmed_variant_calling","last_synced_at":"2026-01-26T07:39:04.913Z","repository":{"id":140593988,"uuid":"137161839","full_name":"statgen/topmed_variant_calling","owner":"statgen","description":null,"archived":false,"fork":false,"pushed_at":"2024-11-27T14:15:03.000Z","size":1952,"stargazers_count":23,"open_issues_count":3,"forks_count":3,"subscribers_count":12,"default_branch":"master","last_synced_at":"2025-01-13T00:46:08.517Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/statgen.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-06-13T04:18:02.000Z","updated_at":"2024-11-27T14:15:10.000Z","dependencies_parsed_at":"2023-05-03T21:02:10.149Z","dependency_job_id":"386ef609-8470-43a2-aa11-0a9ba433760c","html_url":"https://github.com/statgen/topmed_variant_calling","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/statgen/topmed_variant_calling","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/statgen%2Ftopmed_variant_calling","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/statgen%2Ftopmed_variant_calling/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/statgen%2Ftopmed_variant_calling/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/statgen%2Ftopmed_variant_calling/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/statgen","download_url":"https://codeload.github.com/statgen/topmed_variant_calling/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/statgen%2Ftopmed_variant_calling/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28769853,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-26T06:37:25.426Z","status":"ssl_error","status_checked_at":"2026-01-26T06:37:23.039Z","response_time":59,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-01-26T07:39:03.886Z","updated_at":"2026-01-26T07:39:04.904Z","avatar_url":"https://github.com/statgen.png","language":"C++","funding_links":[],"categories":[],"sub_categories":[],"readme":"TOPMed Variant Calling Pipeline (latest: Freeze 8)\n==================================================\n\nOverview of this repository\n----------------------------\n\nThis repository is intended to provide a collection of software tools used for producing TOPMed variant calls and genotypes with a comprehensive documentation that allows investigators to understand the methods and reproduce the variant calls from the same set of aligned sequence reads.\n\nThis repository reflects specific versions of software tools that are\nunder active development in the Center for Statistical Genetics\n(CSG). Most of the latest version of these software tools can be\naccessed through multiple repositories linked as submodules, this repository is focused on a freeze of software tools that can reproduce a variant calls compatible to the latest TOPMed Freeze (Freeze 8 currently).\n\n\nOutline of the variant calling procedure\n----------------------------------------\n\nOur ``GotCloud vt`` pipeline detects and genotype variants from a list of aligned sequence reads. Specifically, the pipeline consist of the following six key steps. Most of these procedure will be integrated into ``GotCloud`` software package later this year. \n\nOur ``GotCloud vt`` pipeline detects and genotype variants from a list of aligned sequence reads. Specifically, the pipeline consist of the following six key steps. Most of these procedure will be integrated into ``GotCloud`` software package later this year. \n\n1. **Sample quality control** : For each sequenced genome (in BAM/CRAMs), the genetic ancestry, sequence contamination, and biological sex are inferred using ``cramore cram-verify-bam`` and ``cramore vcf-normalize-depth``. \n2. **Variant detection** : For each sequenced genome (in BAM/CRAMs), candidate variants are detected by ``vt discover2`` software tools, separated by each chromosome. The candidate variants are normalized by ``vt normalize`` algorithm. \n3. **Variant consolidation** : For each chromosome, the called variant sites are merged across the genomes, accounting for overlap of variants between genomes, using ``cramore vcf-merge-candidate-variants``, ``vt annotate_indels`` software tool.\n4. **Genotype and feature collection** : For each 100kb chunk of genome, the genotyping module implemented in ``cramore dense-genotypes`` collects individual genotypes and variant features across the merged sites by iterating each sequence genome focusing on the selected region.  \n5. **Variant filtering** : We use the inferred pedigree of related and duplicated samples to calculate the Mendlian consistency statistics using ``king``, ``vcf-infer-ped``, ``vt milk-filter``, and train variant classifier using Support Vector Machine (SVM) implemented in the ``libsvm`` software package and ``run-svm-filter`` software tool.\n\n\n![TOPMed Variant Calling Overview](topmed_variant_calling_overview.png)\n\n\nInstallation\n-------------\nFirst, clone the repository by recursively cloning each submodule.\n```\n   $ git clone --recurse-submodules https://github.com/statgen/topmed_variant_calling\n   (Use --recursive instead of --recurse-submodules for git version 2.12 or lower) \n```\n\nNext, build each submodule using the following sets of commands \n```\n   $ cd libsvm/; make; cd ..\n   $ cd apigenome; autoreconf -vfi; ./configure --prefix $PWD; make; make install; cd ..\n   $ cd libStatGen; make; cd ..\n   $ cd bamUtil; make; cd ..\n   $ cd invNorm; make; cd ..\n   $ cd htslib; autoheader; autoconf; ./configure; make; cd ..\n   $ cd vt-topmed; make; cd ..\n   $ cd cramore; autoreconf -vfi; ./configure; make; cd ..\n   $ cd samtools; autoheader; autoconf -Wno-syntax; ./configure; make; cd ..\n   $ cd bcftools; make; cd ..\n   $ cd king; g++ -O3 -c *.cpp; g++ -O3 -o king *.o -lz; cd ..\n```\n\nPerforming variant calling with example data\n----------------------------------------------\nTo produce variant calls using this pipeline, the following input\nfiles are neded:\n\n 1. Aligned sequenced reads in BAM or CRAM format. Each BAM and CRAM file should contain one sample per subject. It also must be indexed using ``samtools index`` or equivalent software tools.\n 2. A sequence index file. Each line should contain [Sample ID] [Full Path to the BAM/CRAM file]. See [examples/index/list.107.local.crams.index](examples/index/list.107.local.crams.index) for example.\n 3. Genomic resource files, such as FASTA, dbSNP, HapMap files. An example collection of such resources is hosted at ``ftp://share.sph.umich.edu/1000genomes/fullProject/hg38_resources``\n \nHere we use 107 public samples from the TOPMed project to document a\nreproducible variant calling pipeline that resembles the latest\nTOPMed variant calling pipeline. In order to do so, you need do\ndownload the following two sets of files.\n\n1. Download the resource files for the variant calling. The tarball\n   package is available at\n   ``ftp://share.sph.umich.edu/1000genomes/fullProject/hg38_resources``. The ``resources/`` directory is assumed to be present under ``examples/``directory\n   in our tutorial. To download the data via command\n   line, you may use the following command. Note that the file size is\n   4.5GB, and it will take a significant amount ot time.\n\n```\n   $ cd examples/\n   $ wget ftp://share.sph.umich.edu/1000genomes/fullProject/hg38_resources/topmed_variant_calling_example_resources.tar.gz\n   $ tar xzvf topmed_variant_calling_example_resources.tar.gz\n```\n\n2. Download 107 CRAMs from the public GCS bucket.\n   The CRAM files are publicly available at\n   ``gs://topmed-irc-share/public``. However, if you want to access\n   the data outside the Google Cloud, it will not be free of charge. \n   To download the files, you need to set the ``[PROJECT_ID]`` that is\n   associated with a billing account, and use ``gsutil`` tool documented at https://cloud.google.com/storage/docs/gsutil.\n   The total amount of CRAM files is 2.17TB, and the estimate egress\n   charge is $256 assuming $0.12/GB rate available at https://cloud.google.com/compute/pricing#internet_egress\n```\n   gsutil -u [PROJECT_ID] -m cp -r gs://topmed-irc-share/public [DESTINATION_PATH]\n```\n\n   Here in the tutorial, we will assume that the files are stored or\n   symbolic linked in the\n   ``examples/crams`` directory.\n   The [examples/index/list.107.local.crams.index](examples/index/list.107.local.crams.index) file contains the\n   sample ID and CRAM file path. The CRAM file and the corresponding\n   indices (.cram.crai) must present before running the examples.\n   \nThe TOPMed variant calling has been performed on the Google Cloud. The\nsoftware tool `cloudify` supports running jobs on the Google Cloud. However,\nthe tutorial examples here are configured to run in a local\ncomputer. Cloud-based example commands will also be available later.\n\n### Step 1. Sample QC and Variant Detection \n\nFirst, make sure to change your current directory to [examples](examples)\n, and run the following command.\n\n```\n $ cd examples/\n\n $ ../apigenome/bin/cloudify --cmd ../scripts/run-discovery-local.cmd \n```\n\nThen follow the instruction to run ``make`` with proper arguments to\ncomplete the step. This step performs the following things\n* Run ``vt discover2`` to detect potential candidate variants to\n  generate per-sample BCF.\n* Run ``cramore verify-bam`` (verifyBamID2) to jointly estimate genetic ancestry and DNA\n  contamination.\n* Run ``vcf-normalize-depth`` to calculate relative X/Y depth to\n  determine the sex of sequenced genome.\n  \nUpon successful completion, we expect to see the following files for each sequenced genome represented by ``[NWD_ID]``. \n * ``out/sm/[NWD_ID]/[NWD_ID].bcf``\n * ``out/sm/[NWD_ID]/[NWD_ID].bcf.csi``\n * ``out/sm/[NWD_ID]/[NWD_ID].vb2``\n * ``out/sm/[NWD_ID]/[NWD_ID].norm.xy`` \n  \nMore technical details can be found by directly examining\n``../scripts/run-discovery-local.cmd``. The ``cloudify`` script take\nthis command file and iterate the command across all samples listed in\nthe index file in an idempotent manner using GNU ``make``.\n\nAfter running the steps above, the ``verifyBamID2`` results and X/Y\ndepth results should be merged together into a single index file using\nthe following command:\n\n```\n  $ mkdir --p out/index/\n  $ ../apigenome/bin/cram-vb-xy-index --index index/list.107.local.crams.index --dir out/sm/ --out out/index/list.107.local.crams.vb_xy.index\n```\n\nUpon successful completion, we expect to see the file ``out/index/list.107.local.crams.vb_xy.index`` that has a header line and one line for each of 107 samples.\n\n### Step 2. Hierarchical merge of variant sites across all samples\n\nNext step is to merge the candidate variant sites across all sequenced\ngenomes. Even though the example file consist of only 107 samples, we\nused a batch of 20 to show examples how to process hundreds of\nthousands of genomes without opening all the files in a single\nprocess. We typically process 1,000 samples per batch, and merge\nhundreds of batches together later on.\n\n```\n  $ ../apigenome/bin/cloudify --cmd ../scripts/run-merge-sites-local.cmd \n```\n\nThis command will make a merged site list for each batch and each 10Mb\ninterval. Upon successful completion, we expect to see the following files.\n* ``out/union/[BATCH]/b[BATCH].chr[CHROM]_[BEGIN_10Mb_CHUNK]_[END_10Mb_CHUNK].merged.sites.bcf``\n* ``out/union/[BATCH]/b[BATCH].chr[CHROM]_[BEGIN_10Mb_CHUNK]_[END_10Mb_CHUNK].merged.sites.bcf.csi``\nwhere ``[BATCH]`` represent the batch (e.g. 1, 21, .., 101 in our example data, as available at [examples/index/seq.batches.by.20.txt](examples/index/seq.batches.by.20.txt) ), and ``[BEGIN_10Mb_CHUNK]`` and ``[END_10Mb_CHUNK]`` are available at [examples/index/intervals/b38.intervals.X.10Mb.10Mb.txt](examples/index/intervals/b38.intervals.X.10Mb.10Mb.txt). \n\nThese per-batch site list are further merged and\nconsolidated using the following command.\n\n```\n  $ ../apigenome/bin/cloudify --cmd ../scripts/run-union-sites-local.cmd\n```\n\nAs a result, there will be merged and consolidated site list across\nall samples for each 10Mb region at ``out/union/`` directory. Upon completion, the following files are expected to be seen.\n* ``out/union/union.chr[CHROM]_[BEGIN_10Mb_CHUNK]_[END_10Mb_CHUNK].sites.bcf``\n* ``out/union/union.chr[CHROM]_[BEGIN_10Mb_CHUNK]_[END_10Mb_CHUNK].sites.bcf.csi``\n\n### Step 3. Hierarchical joint genotyping of merged variant sites\n\nThe merged site list can be jointly genotyped across the\nsamples. However, joint genotyping of \u003e100,000s samples is not\nstraighforward. We again perform genotyping for 1,000 samples for eac\n10Mb region. Next we hierarchically merge the genotypes across all the\nbatches, but with much smaller region size (e.g. 100kb) to maintain\nthe file size and running time manageable in a single machine. During\nthis joint genotyping step, we use the DNA contamination rate,\ninferred genetic ancestries, and\ninferred sex for more accurate genotyping.\n\nIn this example, we used a batch size of 20, and use 1Mb region size\nwhen merging the genotypes across batches, to resemble the Freeze 8\ncalling pipeline.\n\nFirst, generating per-batch genotypes for each 10Mb region can be\nacheived using the following command:\n\n```\n  $ ../apigenome/bin/cloudify --cmd ../scripts/run-batch-genotype-local.cmd \n```\n\nUpon successful completion, we expect to see the following files.\n* ``out/genotypes/batches/[BATCH]/b[BATCH].chr[CHROM]_[BEGIN_10Mb_CHUNK]_[END_10Mb_CHUNK].genotypes.bcf``\n* ``out/genotypes/batches/[BATCH]/b[BATCH].chr[CHROM]_[BEGIN_10Mb_CHUNK]_[END_10Mb_CHUNK].genotypes.bcf``\nwhere ``[BATCH]`` represent the batch (e.g. 1, 21, .., 101 in our example data, as available at [examples/index/seq.batches.by.20.txt](examples/index/seq.batches.by.20.txt) ), and ``[BEGIN_10Mb_CHUNK]`` and ``[END_10Mb_CHUNK]`` are available at [examples/index/intervals/b38.intervals.X.10Mb.10Mb.txt](examples/index/intervals/b38.intervals.X.10Mb.10Mb.txt). \n\n\nNext, pasting the genotypes across all batches while recalculating the\nvariant features is achieved using the following command:\n\n```\n  $ ../apigenome/bin/cloudify --cmd ../scripts/run-paste-genotype-local.cmd\n```\n\nUpon successful completion, we expect to see the following files.\n* ``out/genotypes/merged/[CHROM]/merged.chr[CHROM]_[BEGIN_1Mb_CHUNK]_[END_1Mb_CHUNK].genotypes.bcf``\n* ``out/genotypes/merged/[CHROM]/merged.chr[CHROM]_[BEGIN_1Mb_CHUNK]_[END_1Mb_CHUNK].genotypes.bcf.csi``\n* ``out/genotypes/minDP0/[CHROM]/merged.chr[CHROM]_[BEGIN_1Mb_CHUNK]_[END_1Mb_CHUNK].gtonly.minDPO.bcf``\n* ``out/genotypes/minDP0/[CHROM]/merged.chr[CHROM]_[BEGIN_1Mb_CHUNK]_[END_1Mb_CHUNK].gtonly.minDPO.bcf.csi``\n* ``out/genotypes/minDP10/[CHROM]/merged.chr[CHROM]_[BEGIN_1Mb_CHUNK]_[END_1Mb_CHUNK].gtonly.minDP10.bcf``\n* ``out/genotypes/minDP10/[CHROM]/merged.chr[CHROM]_[BEGIN_1Mb_CHUNK]_[END_1Mb_CHUNK].gtonly.minDP10.bcf.csi``\n* ``out/genotypes/hgdp/[CHROM]/merged.chr[CHROM]_[BEGIN_1Mb_CHUNK]_[END_1Mb_CHUNK].gtonly.minDP10.hgdp.bcf``\n* ``out/genotypes/hgdp/[CHROM]/merged.chr[CHROM]_[BEGIN_1Mb_CHUNK]_[END_1Mb_CHUNK].gtonly.minDP10.hgdp.bcf.csi``\nwhere ``[BATCH]`` represent the batch (e.g. 1, 21, .., 101 in our example data, as available at [examples/index/seq.batches.by.20.txt](examples/index/seq.batches.by.20.txt) ), and ``[CHROM]``, ``[BEGIN_1Mb_CHUNK]``, and ``[END_1Mb_CHUNK]`` are available in 1st, 4th, and 5th column at [examples/index/intervals/b38.intervals.X.10Mb.1Mb.txt](examples/index/intervals/b38.intervals.X.10Mb.1Mb.txt). \n\n\n### Step 4. Inferring Duplicated and Related Individuals\n\nThe step above not only pastes the genotypes across the samples, but also\ngenerate multiple versions of genotypes, such as ``minDP0`` (no\nmissing genotypes), ``minDP10`` (genotypes marked missing if depth is\n10 or less)., and ``hgdp`` (extract only HGDP-polymorphic sites). \n\nThe HGDP genotypes on autosomes can be merged together across all regions\nusing the following commands:\n\n```\n   $ cut -f 1,4,5 index/intervals/b38.intervals.X.10Mb.1Mb.txt | grep -v ^chrX | awk '{print \"out/genotypes/hgdp/\"$1\"/merged.\"$1\"_\"$2\"_\"$3\".gtonly.minDP0.hgdp.bcf\"}' \u003e out/index/hgdp.auto.bcflist.txt\n\n   $ ../bcftools/bcftools concat -n -f out/index/hgdp.auto.bcflist.txt -Ob -o out/genotypes/hgdp/merged.autosomes.gtonly.minDP0.hgdp.bcf\n```\n\nThese HGDP-site BCF file can be convered into PLINK format, and\npedigree can be inferred using ``king`` and ``vcf-infer-ped`` software\ntools as follows. Note that ``plink-1.9`` is not the part of this repository and you need to obtain the software separately at https://www.cog-genomics.org/plink2 . \n\n```\n   $ plink-1.9 --bcf out/genotypes/hgdp/merged.autosomes.gtonly.minDP0.hgdp.bcf --make-bed --out out/genotypes/hgdp/merged.autosomes.gtonly.minDP0.hgdp.plink --allow-extra-chr\n\n   $ ../king/king -b out/genotypes/hgdp/merged.autosomes.gtonly.minDP0.hgdp.plink.bed --degree 4 --kinship --prefix out/genotypes/hgdp/merged.autosomes.gtonly.minDP0.hgdp.king\n\n   $ ../apigenome/bin/vcf-infer-ped --kin0 out/genotypes/hgdp/merged.autosomes.gtonly.minDP0.hgdp.king.kin0 --sex out/genotypes/merged/chr1/merged.chr1_1_1000000.sex_map.txt --out out/genotypes/hgdp/merged.autosomes.gtonly.minDP0.hgdp.king.inferred.ped\n```\n\nThe inferred pedigree file using these procedure only contains nuclear\nfamilies and duplicates in a specialized PED format. When a sample is\nduplicated, all sample IDs representing the duplicated sample (in the\n2nd column) need to presented in a comma-separated way. In the 3rd and\n4th column to represend their parents, only representative sample ID (first\namong comma-separated duplicate ID) is required. \n\n### Step 5. Run SVM variant filtering guided by the inferred pedigree.\n\nUsing the infered pedigree, we compute duplicate and Mendelian\nconsistency, and use the information to aid variant filtering. \nFirst, duplicate and Mendelian consistency is compute using the\nfollowing ``milk`` (Mendelian-inheritance under likelihood framework) command\n\n```\n   $ ../apigenome/bin/cloudify --cmd ../scripts/run-milk-local.cmd\n```\n\n* ``out/milk/[CHROM]/milk.chr[CHROM]_[BEGIN_1Mb_CHUNK]_[END_1Mb_CHUNK].full.vcf.gz``\n* ``out/milk/[CHROM]/milk.chr[CHROM]_[BEGIN_1Mb_CHUNK]_[END_1Mb_CHUNK].sites.vcf.gz``\n* ``out/milk/[CHROM]/milk.chr[CHROM]_[BEGIN_1Mb_CHUNK]_[END_1Mb_CHUNK].sites.vcf.gz.tbi``\n\n\nThe results are merged across each chromosome in the following way.\n\n```\n   $ cut -f 1,4,5 index/intervals/b38.intervals.X.10Mb.1Mb.txt | awk '{print \"out/milk/\"$1\"/milk.\"$1\"_\"$2\"_\"$3\".sites.vcf.gz\"}' \u003e out/index/milk.autoX.bcflist.txt\n\n   $ (seq 1 22; echo X;) | xargs -I {} -P 10 bash -c \"grep chr{}_ out/index/milk.autoX.bcflist.txt | ../bcftools/bcftools concat -f /dev/stdin -Oz -o out/milk/milk.chr{}.sites.vcf.gz\"\n\n   $ (seq 1 22; echo X;) | xargs -I {} -P 10 ../htslib/tabix -f -pvcf out/milk/milk.chr{}.sites.vcf.gz \n```\n\nFinally, the SVM filtering step is performed using\n``vcf-svm-milk-filter`` tool. The training is typically done with one\nchromosome, for example ``chr2``.\n\n```\n   $ mkdir out/svm\n\n   $ ../apigenome/bin/vcf-svm-milk-filter --in-vcf out/milk/milk.chr2.sites.vcf.gz --out out/svm/milk_svm.chr2 --ref resources/ref/hs38DH.fa --dbsnp resources/ref/dbsnp_142.b38.vcf.gz --posvcf resources/ref/hapmap_3.3.b38.sites.vcf.gz --posvcf resources/ref/1000G_omni2.5.b38.sites.PASS.vcf.gz --train --centromere resources/ref/hg38.centromere.bed.gz --bgzip ../htslib/bgzip --tabix ../htslib/tabix --invNorm $GC/bin/invNorm --svm-train $GC/bin/svm-train --svm-predict $GC/bin/svm-predict \n```\n\nAfter finishing the training, the other chromosomes uses the same\ntraining model to perform SVM filtering.\n\n```\n   $ (seq 1 22; echo X;) | grep -v -w 2 | xargs -I {} -P 10 ../apigenome/bin/vcf-svm-milk-filter --in-vcf out/milk/milk.chr{}.sites.vcf.gz --out out/svm/milk_svm.chr{} --ref resources/ref/hs38DH.fa --dbsnp resources/ref/dbsnp_142.b38.vcf.gz --posvcf resources/ref/hapmap_3.3.b38.sites.vcf.gz --posvcf resources/ref/1000G_omni2.5.b38.sites.PASS.vcf.gz --model out/svm/milk_svm.chr2.svm.model --centromere resources/ref/hg38.centromere.bed.gz --bgzip ../htslib/bgzip --tabix ../htslib/tabix --invNorm $GC/bin/invNorm --svm-train ../libsvm/svm-train --svm-predict ../libsvm/svm-predict \n```\n\nThe details of each steps are elaborated below.\n\nVariant Detection Details\n--------------------------\nVariant detection from each sequence (ang aligned) genome is performed by ``vt discover2`` software tool. The script ``step-1-detect-variants.pl`` provide a mean to automate the variant detection across a large number of sequence genome.\n\nThe variant detection algorithm consider a variant as a potential candidate variant if there exists a mismatch between the aligned sequence reads and the reference genome. Because such a mismatch can easily occur by random errors, only potential candidate variants passing the following criteria are considered to be ***candidate variants*** in the next steps.\n\n1. At least two identical evidence of variants must be observed from aligned sequence reads. \n  1. Each individual evidence will be normalized using the normalization algorithm implemented in ``vt normalize`` software tools.\n  1. Only evidence on the reads with mapping quality 20 or greater will be considered.\n  1. Duplicate reads, QC-passed reads, supplementary reads, secondary reads will be ignored. \n  1. Evidence of variant within overlapping fragments of read pairs will not be double counted. Either end of the overlapping read pair will be soft-clipped using ``bam clipOverlap`` software tool.  \n1. Assuming per-sample heterozygosity of 0.1%, the posterior probability of having variant at the position should be greater than 50%. The method is equivalent to the `glfSingle` model described in http://www.ncbi.nlm.nih.gov/pubmed/25884587\n\nThe variant detection step is required only once per sequenced genome, when multiple freezes of variant calls are produced over the course of time.\n\n \nVariant Consolidation\n---------------------\nVariants detected from the discovery step will be merged across all samples. This step is implemented in the ``step-2-detect-variants.pl`` scripts.\n\n1. Each non-reference allele normalized by ``vt normalize`` algorithm is merged across the samples, and unique alleles are printed as biallelic candidate variants. The algorithm is published at http://www.ncbi.nlm.nih.gov/pubmed/25701572\n2. If there are alleles overlapping with other SNPs and Indels, ``overlap_snp`` and ``overlap_indel`` filters are added in the ``FILTER`` column of the corresponding variant.\n3. If there are tandem repeats with 2 or more repeats with total repeat length of 6bp or longer, the variant is annotated as potential VNTR (Variant Number Tandem Repeat), and ``overlap_vntr`` filters are added to the variant overlapping with the repeat track of the putative VNTR.     \n\n\nVariant Genotyping and Feature Collection\n-----------------------------------------\nThe genotyping step iterate all the merged variant site across the sample. It iterates each BAM/CRAM files one at a time sequentially for each 1Mb chunk to perform contamination-adjusted genotyping and annotation of variant features for filtering. The following variant features are calculated during the genotyping procedure. \n\n * AVGDP : Average read depth per sample\n * AC : Non-reference allele count\n * AN : Total number of alleles\n * GC : Genotype count\n * GN : Total genotype counts\n * HWEAF : Allele frequency estimated from PL under HWE\n * HWDAF : Genotype frequency estimated from PL under HWD\n * IBC : [ Obs(Het) – Exp(Het) ] / Exp[Het]\n * HWE_SLP : -log(HWE likelihood-ratio test p-value) ⨉ sign(IBC)\n * ABE : Average fraction [#Ref Allele] across all heterozygotes\n * ABZ : Z-score for tesing deviation of ABE from expected value (0.5)\n * BQZ: Z-score testing association between allele and base qualities\n * CYZ: Z-score testing association between allele and the sequencing cycle\n * STZ : Z-score testing association between allele and strand\n * NMZ : Z-score testing association between allele and per-read mismatches\n * IOR : log [ Obs(non-ref, non-alt alleles) / Exp(non-ref, non-alt alleles) ]\n * NM1 : Average per-read mismatches for non-reference alleles\n * NM0 : Average per-read mismatches for reference alleles\n\nThe genotyping was done by adjusting for potential contamination. It uses adjusted genotype likelihood similar to the published method https://github.com/hyunminkang/cleancall, but does not use estimated population allele frequency for the sake of computational efficiency. It conservatively assumes that probability of observing non-reference read given homozygous reference genotype is equal to the half of the estimated contamination level, (or 1%, whichever is greater). The probability of observing reference reads given homozygous non-reference genotype is calculated in a similar way. This adjustment makes the heterozygous call more conservatively when the reference and non-reference allele is strongly imbalanced. For example, if 45 reference alleles and 5 non-reference alleles are observed at Q40, the new method calls it as homozygous reference genotype while the original method ignoring potential contamination calls it as heterozygous genotype. This adjustment improves the genotype quality of contaminated samples by reducing genotype errors by several folds.\n\nVariant Filtering\n-----------------\nThe variant filtering in TOPMed Freeze 8 were performed by (1) first calculating Mendelian consistency scores using known familial relatedness and duplicates, and (2) training SVM classifier between the known variant site (positive labels) and the Mendelian inconsistent variants (negative labels). \n\nThe negative labels are defined if the Bayes Factor for Mendelian consistency quantified as ``Pr(Reads | HWE, Pedigree) / Pr(Reads | HWD, no Pedigree )`` less than 0.001. Also variant is marked as negative labels if 3 or more samples show 20% of non-reference Mendelian discordance within families or genotype discordance between duplicated samples.\n\nThe positive labels are the SNPs found polymorphic either in the 1000 Genomes Omni2.5 array or in HapMap 3.3, with additional evidence of being polymorphic from the sequenced samples. Variants eligible to be marked both positive and negative labels are discarded from the labels. The SVM scores trained and predicgted by ``libSVM`` software tool will be annotated in the VCF file. \n\nTwo additional hard filtering was applied additionally. First is excessive heterozygosity filter ``(EXHET)``, if the Hardy-Weinberg disequilbrium p-value was less than 1e-6 in the direction of excessive heterozygosity. Another filter is Mendelian discordance filter ``(DISC)``, with 3 or more Mendelian discordance or duplicate discordance observed from the samples.\n\nQuestions\n---------\nFor further questions, pleast contact Hyun Min Kang (hmkang@umich.edu) and Jonathon LeFaive (lefaivej@umich.edu).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fstatgen%2Ftopmed_variant_calling","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fstatgen%2Ftopmed_variant_calling","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fstatgen%2Ftopmed_variant_calling/lists"}