https://github.com/shujiahuang/gatk_recalibrate_and_realign

GATK Indel Realignment and Quality Recalibration: Deduplicates, realigns, and recalibrates a Mappings object in order to improve the quality of downstream variant calls, per the GATK best practices in variant detection.
https://github.com/shujiahuang/gatk_recalibrate_and_realign

Last synced: 6 months ago
JSON representation

Host: GitHub
URL: https://github.com/shujiahuang/gatk_recalibrate_and_realign
Owner: ShujiaHuang
Created: 2014-10-04T15:31:48.000Z (about 11 years ago)
Default Branch: master
Last Pushed: 2013-08-09T17:55:40.000Z (about 12 years ago)
Last Synced: 2025-04-11T01:50:04.072Z (6 months ago)
Homepage: https://platform.dnanexus.com/app/gatk_recalibrate_and_realign
Size: 37.9 MB
Stars: 0
Watchers: 2
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: Readme.developer.md

Awesome Lists containing this project

README

GATK Indel Realigner and Quality Recalibration, Advanced Readme
===============================================================

Introduction
------------

The GATK Best Practices pipeline runs a set of programs which realign and recalibrate the quality of previously mapped reads. These steps are recommended by the Broad Institute to get the best calls out of a set of mapped reads. The app consists of three conceptual steps: marking duplicates, realigning around indels, and recalibrating quality scores. This app creates a new table of the recalibrated mappings. Unmapped reads will not be included in the new table.

### Resource Files

Certain steps of the best practices pipeline use files containing known snps and indels. GATK recommends the use of dbsnp and two known indels files. For the b37 reference genome, these can be found in the public project **Reference Genomes** in the **b37** folder.

The dbsnp file is named **dbsnp_135.b37.vcf.gz**

The two indels files are named **Mills_and_1000G_gold_standard.indels.b37.vcf.gz** and **1000G_phase1.indels.b37.vcf.gz**

In both cases, these are [bgzipped](http://samtools.sourceforge.net/tabix.shtml) files. In place of these files, you may use any dbsnp or indels file that would work with command line GATK. The app will run faster if the provided files are bgzipped.

### Marking Duplicates

The first step of the pipeline is to mark all apparent duplicate reads or read pairs. This is done with [Picard Tools](http://picard.sourceforge.net/) [MarkDuplicates](http://picard.sourceforge.net/command-line-overview.shtml#MarkDuplicates) program. For unpaired reads, this program finds all reads mapped to the same 5' position, identifies finds the read with the highest number of bases with a quality score of 15 or higher and marks all other reads as duplicates. For paired end reads, it uses the 5' mapping location of both reads in the pair, calculating the number of bases with a quality score of 15 or higher in both reads of the pair and marking as duplicates all other read pairs which share the same 5' mapping as both of the highest quality pairs. For more information, see the [Picard Tools FAQ](http://sourceforge.net/apps/mediawiki/picard/index.php?title=Main_Page#Q:_How_does_MarkDuplicates_work.3F)

With the default parameters, the Picard MarkDuplicates component runs the following command:

java -jar MarkDuplicates.jar INPUT=input.bam OUTPUT=dedup.bam METRICS=metrics.txt ASSUME_SORTED=true \
VALIDATION_STRINGENCY=SILENT REMOVE_DUPLICATES=true

Parameters such as REMOVE_DUPLICATES can be changed by changing the app input parameters.

### Indel Realignment

In order to align reads to a reference in a reasonable amount of time, aligners such as BWA only look at a single read or read pair at a time. As a result, aligners can miss the context provided by other reads mapped to a similar location, which when looked at together would indicate that an indel is present and would provide information about the best way to align each individual read in the region.

In indel realignment, this information is extracted from the mappings and used to realign reads in the vicinity of an apparent indel. This is done with two [GATK](http://www.broadinstitute.org/gatk/) programs, [RealignerTargetCreator](http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_indels_RealignerTargetCreator.html), which identifies regions which may benefit from realignment, and [IndelRealigner](http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_indels_IndelRealigner.html), which performs the realignment.

With the default parameters, the indel realignment component runs the following commands:

java -jar GenomeAnalysisTK.jar -T RealignerTargetCreator -R ref.fa -I input.bam -o indels.intervals

java -jar GenomeAnalysisTK.jar -T IndelRealigner -R ref.fa -I input.bam -targetIntervals indels.intervals \
-o realigned.bam -known indels1.vcf.gz -known indels2.vcf.gz -known ...

### Quality Recalibration

The quality associated with sequencing runs is an attempt by the sequencing machine to provide a measure of the confidence that the assigned base at a position is correct. These quality scores have been tuned by the makers of the sequencing equipment to, on average, give the best values. However, a number of factors may mean that the quality scores over or under-estimate the quality for bases in particular contexts.

The quality recalibrator uses the fact that most true variants in a sequencing run have already been found in dbsnp to recalibrate the quality scores of the reads. This uses a number of a parameters which co-vary with the reported quality of the read and correct for biases associated with these parameters (for example, it may be that quality scores for a particular read group is generally lower or that a the quality of a particular dinucleotide is over reported).

This implementation uses the GATK programs CountCovariates, which calculates how various parameters correlate with quality, and TableRecalibrator, which performs the recalibration. These programs have been renamed and their options changed in GATK 2.0 and the legacy documentation is no longer available.

With the default parameters, the indel realignment component runs the following commands:

java -jar GenomeAnalysisTK.jar -T CountCovariates -R ref.fa -recalFile recalibration.csv \
-I realigned.bam -knownSites dbsnp.vcf.gz

java -jar GenomeAnalysisTK.jar -T TableRecalibration -R ref.fa -recalFile recalibration.csv \
-I realigned.bam -o recalibrated.bam --doNotWriteOriginalQuals

### General Options

NameLabelCorresponding command line option/NotesClass/TypeDefault
mappingsMappings Table(s)-I (All of the mappings to realign and recalibrate. The app will act on all reads in all tables and the output will be combined)array of gtables of type LetterMappings.Mandatory
referenceReference Genome-R record of type:ContigSetMandatory
output_nameOutput NamestringDefault is empty string and will name the new object the same name of the first mapping object with "Recalibrated and Realigned" added to the end
dbsnpdbSNP-knownSites a tar.gz file. If the file has been [bgzip'd](http://samtools.sourceforge.net/tabix.shtml), the program will take advantage of thisMandatory
known_indelsKnown Indels--known (Resources which contain known, common indels in the genome)an array of tar.gz filesOptional

### RealignerTargetCreator Options

NameLabelCorresponding command line option/NotesClass/TypeDefault
max_interval_sizeMax Interval Size--maxIntervalSize (Intervals larger than this will not be used in indel realignment)int500
min_reads_locusMin Reads at Locus--minReadsAtLocus (Mimimum reads at a locus to enable using entropy calculation for indel realignment)int4
mismatch_fractionMismatch Fraction--mismatchFraction (Fraction of base qualities needing to mismatch for a position to have high entropy for indel realignment target creator)float0.0
window_sizeWindow Size--windowSize (Window size for calculating entropy or SNP clusters)int10

### IndelRealigner Options

NameLabelCorresponding command line option/NotesClass/TypeDefault
consensus_modelConsensus Determination Model--consensusDeterminationModel (Determines how to compute the possible alternate consenses in indel realignment. Must be one of the following values: [USE_READS, KNOWNS_ONLY, USE_SW])stringOptional
lod_thresholdLOD Cleaning Threshold--LODThresholdForCleaning (LOD threshold above which the cleaner will clean in indel realignment. This is a measure of whether improvement is significant enough to merit realignment. Lower values are recommended in cases of low coverage or looking for indels with low allele frequency)float5.0
entropy_thresholdEntropy Threshold--entropyThreshold (Percentage of mismatches at a locus to be considered having high intropy in the indel realigner)float0.15
max_consensusesMax Consensuses--maxConsensuses (Max alternate consensuses to try - higher numbers improve performance in deep coverage)int30
max_insert_size_movementMax Insert Movement Size--maxIsizeForMovement (Maximum insert size of read pairs that realignment attempted for)int3000
max_position_moveMax Position Move--maxPositionalMoveAllowed (Maximum positional move in basepairs that a read can be adjusted during realignment)int200
max_reads_consensusMax Reads for Consensus--maxReadsForConsensus (Maximum reads used for finding the alternate consensuses - higher numbers improve performance in deep coverage)int120
max_reads_realignmentMax Reads for Realignment--maxReadsForRealignment (Maximum reads allowed at an interval for realignment)int20000

### CountCovariates and TableRecalibrator Options

Because several options are shared across CountCovariates and TableRecalibrator, their options are listed together. The covariates ReadGroup and ReportedQuality will always be used as covariates, regardless of the other covariates selected.

NameLabelCorresponding command line option/NotesClass/TypeDefault
solid_recalibration_modeSOLiD Recalibration Mode--solid_recal_mode (Only applies to SOLiD sequencing. How to recalibrate bases in which the reference was inseted. If entered, must be one of the following options: [DO_NOTHING, SET_Q_ZERO, SET_Q_ZERO_BASE_N, REMOVE_REF_BIAS])stringOptional
solid_nocall_modeSOLiD No-call Mode--solid_nocall_strategy (Only applies to SOLiD sequencing. Defines behavoir when no-call encountered in color space. If entered, must be one of the following options: [DO_NOTHING, SET_Q_ZERO, SET_Q_ZERO_BASE_N, REMOVE_REF_BIAS])stringOptional
context_sizeCount Covariate Context Size--context_size (Size of the k-mer context used in count covariates)intOptional
nbackCount Covariants N-back Size--homopolymer_nback (The number of previous bases to look at in HomopolymerCovariate)intOptional
cycle_covariateUse Cycle Covariate-cov CycleCovariate (Use cycle covariation in the quality recalibration process)booleantrue
dinuc_covariateUse Dinucleotide Covariate-cov DinucCovariate (Use dinucleotide covariation in the quality recalibration process)booleantrue
primer_round_covariateUse Primer Round Covariate-cov PrimerRoundCovariate (Use primer round covariation in the quality recalibration process)booleanfalse
mapping_quality_covariateUse Mapping Quality Covariate-cov MappingQualityCovariate (Use mapping quality covariation in the quality recalibration process)booleanfalse
gc_content_covariateUse GC Content Covariate-cov GCContentCovariate (Use GC content covariation in the quality recalibration process)booleanfalse
position_covariateUse Position Covariate-cov PositionCovariate (Use position covariation in the quality recalibration process)booleanfalse
minimum_nqs_covariateUse Minimum NQS Covariate-cov MinimumNQSCovariate (Use minimum NQS covariation in the quality recalibration process)booleanfalse
context_covariateUse Context Covariate-cov ContextCovariate (Use context covariation in the quality recalibration process)booleanfalse
preserve_qscorePreserve Q-Scores Less Than --preserve_qscores_less_than (Do not recalibrate quality scores below this threshold. Since many base callers use quality scores below 5 to indicate random or bad bases, it is often unsafe to recalibrate these bases)int5
smoothingSmoothing Counts --smoothing (Number of imaginary counts to add to each bin in order to smooth out binds with few data points)intOptional
max_qualityMaximum Quality Score --max_quality_score (The value at which to cap the quality scores)intOptional

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/shujiahuang/gatk_recalibrate_and_realign

Awesome Lists containing this project

README