{"id":24360495,"url":"https://github.com/simonsfoundation/pipeline","last_synced_at":"2026-02-16T16:09:02.806Z","repository":{"id":34670673,"uuid":"38643304","full_name":"simonsfoundation/pipeline","owner":"simonsfoundation","description":"validated, scalable, minimalist variant detection pipeline","archived":false,"fork":false,"pushed_at":"2018-07-17T20:58:13.000Z","size":196,"stargazers_count":5,"open_issues_count":0,"forks_count":1,"subscribers_count":6,"default_branch":"master","last_synced_at":"2025-01-18T21:32:46.296Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Makefile","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/simonsfoundation.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2015-07-06T20:12:00.000Z","updated_at":"2020-10-29T19:49:41.000Z","dependencies_parsed_at":"2022-09-14T18:02:46.452Z","dependency_job_id":null,"html_url":"https://github.com/simonsfoundation/pipeline","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/simonsfoundation%2Fpipeline","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/simonsfoundation%2Fpipeline/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/simonsfoundation%2Fpipeline/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/simonsfoundation%2Fpipeline/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/simonsfoundation","download_url":"https://codeload.github.com/simonsfoundation/pipeline/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243188204,"owners_count":20250452,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-01-18T21:30:17.894Z","updated_at":"2025-10-08T05:33:27.703Z","avatar_url":"https://github.com/simonsfoundation.png","language":"Makefile","funding_links":[],"categories":[],"sub_categories":[],"readme":"## *pipeline*\n\n### Overview\n\n*pipeline* is a computational engine for genetic variant detection in\na single sample, or in a batch of samples. \nAlmost every step in the pipeline is done via a *Makefile* (GNU make). These\nmakefiles can be used on their own to accomplish common bioinformatics operations, or\nthey can be used together in a shell script to compose a pipeline. \n\n*pipeline* is well suited for processing large number of familiar cohorts, and has been deployed on 462 families from SPARK collection at Simons Foundation.\n\n### From BAM files to de novo germline mutations\n\n   - Input: BAM file(s) \n   - Optionally process BAM files according to GATK best practices\n   - Compute callable regions, and subdivide genome into bins of approximately\n   equal size for parallelization\n   - Call variants with a choice of GATK HaplotypeCaller in GVCF mode, Freebayes, Platypus \n   - Apply GATK variant recalibration\n   - Apply hard variant filters\n   \nValidation was done against CEUTrio, NA12878\n\n### Dependencies\n\nBesides *Python 2.7, JDK, GNU make*, the following packages are required\n   1. [GATK](https://www.broadinstitute.org/gatk/)\n   2. [freebayes](https://github.com/ekg/freebayes)\n   3. [platypus](http://www.well.ox.ac.uk/platypus)\n   4. [piccard](http://broadinstitute.github.io/picard/)\n   5. [samtools](http://samtools.sourceforge.net/)\n   6. [sambamba](https://github.com/lomereiter/sambamba)\n   7. [bedtools](http://bedtools.readthedocs.org/en/latest/content/installation.html)\n   8. [bedops](https://github.com/bedops/bedops)\n   9. [bgzip, tabix](https://github.com/samtools/htslib)\n   10. [bcftools](https://github.com/samtools/bcftools)\n   11. [vcflib](https://github.com/ekg/vcflib)\n   12. [SnpEff](http://snpeff.sourceforge.net/)\n   \nAll (except GATK) can be installed with an install of [bcbio-nextgen](https://github.com/chapmanb/bcbio-nextgen).\n\n### Installation\n\n```\ncd ~\ngit clone https://github.com/simonsfoundation/pipeline.git\n```\n\n### Running\nTo setup the configuration for the pipeline, you need to edit *include_example.mk* file that defines *Makefile* variables.\n\nThe pipeline is run as follows:\n\n```\n~/pipeline/ppln/pipe03.sh     \\\n/path/to/input/bams/       \\ #dir with bam file(s)\n/path/to/output/dir        \\ #will be created for final output, metrics, log. \nfamilycode                 \\ # common prefix for a group of BAM files (familycode*.bam): family, batch etc;If your BAM files do not have a common prefix, create one via symbolic links\nWG                         \\ #binning method EX, WG(recommended)\n0                          \\ #set to 0; if set to 1, pipeline will use existing regions (testing purposes)\ntmp                        \\ #if \"tmp\", work in /tmp, else work in output dir\n~/pipeline/ppln/include_example.mk \\ #makefile with variable definition\nYES                         \\ #if YES/NO - delete/don't delete intermediate files\n,Reorder,FixGroups,FilterBam,DedupBam,Metrics,IndelRealign,BQRecalibrate,Freebayes,Platypus,HaplotypeCallerGVCF, \\\n1                          \\#if 1, remove working dir on exit\n/path/to/pipeline/ppln     \\\n20                  \\#max number of physical cpu cores to utilize\nall  \\ # 1-12 if process only region defined in /ppln/data, all - work on full file \nNO    # YES/NO delete/not delete input bam files\n```\n\n### Cluster environments\n   - Grid Engine\n   - Slurm\n\nSubmitting via *sbatch*\n```\nsbatch -N1 --exclusive -J batch2 -e batch2.err -o batch2.out --wrap=\"/mnt/xfs1/home/asalomatov/projects/pipeline/ppln/pipe03.sh /mnt/xfs1/scratch/asalomatov/data/SPARK/bam/batch_2 /mnt/xfs1/scratch/asalomatov/data/SPARK/vars/b2/all batch2 WG 0 work /mnt/xfs1/home/asalomatov/projects/pipeline/ppln/include_example.mk YES ,FixGroups,HaplotypeCallerGVCF,Platypus,Freebayes, 0 /mnt/xfs1/home/asalomatov/projects/pipeline/ppln 25 all NO\"\n```\n\n### Validation\n([freebayes tutorial](http://clavius.bc.edu/~erik/CSHL-advanced-sequencing/freebayes-tutorial.html), [bcbio blog post](http://bcb.io/2014/10/07/joint-calling/))\n\n#### NA12878\n \n1. Download chromosome 20 high coverage bam file, Broad Institute's truth set, and NIST Genome in a Bottle target regions:\n```\nwget ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/working/20130103_high_cov_trio_bams/NA12878/alignment/NA12878.chrom20.ILLUMINA.bwa.CEU.high_coverage.20120522.bam\nwget ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/working/20130103_high_cov_trio_bams/NA12878/alignment/NA12878.chrom20.ILLUMINA.bwa.CEU.high_coverage.20120522.bam.bai\nwget http://ftp.ncbi.nlm.nih.gov/1000genomes/ftp/technical/working/20130806_broad_na12878_truth_set/NA12878.wgs.broad_truth_set.20131119.snps_and_indels.genotypes.vcf.gz\nwget http://ftp.ncbi.nlm.nih.gov/1000genomes/ftp/technical/working/20130806_broad_na12878_truth_set/NA12878.wgs.broad_truth_set.20131119.snps_and_indels.genotypes.vcf.gz.tbi\nwget -O NA12878-callable.bed.gz ftp://ftp-trace.ncbi.nih.gov/giab/ftp/data/NA12878/variant_calls/NIST/union13callableMQonlymerged_addcert_nouncert_excludesimplerep_excludesegdups_excludedecoy_excludeRepSeqSTRs_noCNVs_v2.17.bed.gz\n```\n\n2. Run the pipeline:\n```\nsbatch -J NA12878 -N 1 --exclusive ~/pipeline/ppln/pipe03.sh ./ ./NA12878 NA12878 WG 0 tmp ~/pipeline/ppln/include_example.mk 0 ,Reorder,FixGroups,FilterBam,DedupBam,Metrics,IndelRealign,BQRecalibrate,HaplotypeCaller,Freebayes,Platypus,HaplotypeCallerGVCF,RecalibVariants, 1 ~/pipeline/ppln/ 20 all\n```\n\n3. Restrict our consideration to chromosome 20, and to the confidently callable regions:\n```\nmkdir chr20\ntabix -h NA12878.wgs.broad_truth_set.20131119.snps_and_indels.genotypes.vcf.gz 20 | vcfintersect -b NA12878-callable.bed | bgzip -c \u003e chr20/NA12878.wgs.broad_truth_set.20131119-chr20.vcf.gz\ntabix -p vcf chr20/NA12878.wgs.broad_truth_set.20131119-chr20.vcf.gz\n```\nDo the same to our variant calls.\n\n4. Use ```vcf-compare``` to gauge the concordance between our calls and the true positives from the truth set:\n```\nzcat NA12878.wgs.broad_truth_set.20131119-chr20.vcf.gz | grep \"^#\\|TruthStatus=TRUE_POSITIVE\" | bgzip -c \u003e NA12878.wgs.broad_truth_set.20131119-chr20-TRUE_POS.vcf.gz\ntabix -p vcf NA12878.wgs.broad_truth_set.20131119-chr20-TRUE_POS.vcf.gz\n```\n\nFor Haplotype Caller:\n```\nvcf-compare NA12878-HC-vars-flr-call.vcf.gz ../NA12878.wgs.broad_truth_set.20131119-chr20-TRUE_POS.vcf.gz | grep ^VN\nVN\t1298\tNA12878-HC-vars-flr-call.vcf.gz (1.7%)\nVN\t1740\t../NA12878.wgs.broad_truth_set.20131119-chr20-TRUE_POS.vcf.gz (2.3%)\nVN\t73287\t../NA12878.wgs.broad_truth_set.20131119-chr20-TRUE_POS.vcf.gz (97.7%)\tNA12878-HC-vars-flr-call.vcf.gz (98.3%)\n```\n\nFor Haplotype Caller in GVCF mode:\n```\nvcf-compare NA12878-JHC-vars-call.vcf.gz ../NA12878.wgs.broad_truth_set.20131119-chr20-TRUE_POS.vcf.gz | grep ^VN\nVN\t1400\tNA12878-JHC-vars-call.vcf.gz (1.9%)\nVN\t1729\t../NA12878.wgs.broad_truth_set.20131119-chr20-TRUE_POS.vcf.gz (2.3%)\nVN\t73298\t../NA12878.wgs.broad_truth_set.20131119-chr20-TRUE_POS.vcf.gz (97.7%)\tNA12878-JHC-vars-call.vcf.gz (98.1%)\n```\n\nFor Freebayes:\n```\nvcf-compare NA12878-FB-vars-call.vcf.gz ../NA12878.wgs.broad_truth_set.20131119-chr20-TRUE_POS.vcf.gz | grep ^VN\nVN\t445\tNA12878-FB-vars-call.vcf.gz (0.6%)\nVN\t3131\t../NA12878.wgs.broad_truth_set.20131119-chr20-TRUE_POS.vcf.gz (4.2%)\nVN\t71896\t../NA12878.wgs.broad_truth_set.20131119-chr20-TRUE_POS.vcf.gz (95.8%)\tNA12878-FB-vars-call.vcf.gz (99.4%)\n```\n\nFor Platypus:\n```\nvcf-compare NA12878-PL-vars-call.vcf.gz ../NA12878.wgs.broad_truth_set.20131119-chr20-TRUE_POS.vcf.gz | grep ^VN\nVN\t2486\t../NA12878.wgs.broad_truth_set.20131119-chr20-TRUE_POS.vcf.gz (3.3%)\nVN\t3071\tNA12878-PL-vars-call.vcf.gz (4.1%)\nVN\t72541\t../NA12878.wgs.broad_truth_set.20131119-chr20-TRUE_POS.vcf.gz (96.7%)\tNA12878-PL-vars-call.vcf.gz (95.9%)\n```\n\n#### CEU Trio\n\n1. Download chromosome 20 alignments for [NA1278, NA12891, NA12892](http://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/working/20130103_high_cov_trio_bams/), and create symbolic links:\n```\nln -s NA12878.chrom20.ILLUMINA.bwa.CEU.high_coverage.20120522.bam CEUTrio.NA12878.chr20.20120522.bam\nln -s NA12891.chrom20.ILLUMINA.bwa.CEU.high_coverage.20120522.bam CEUTrio.NA12891.chr20.20120522.bam\nln -s NA12892.chrom20.ILLUMINA.bwa.CEU.high_coverage.20120522.bam CEUTrio.NA12892.chr20.20120522.bam\nln -s NA12878.chrom20.ILLUMINA.bwa.CEU.high_coverage.20120522.bam.bai CEUTrio.NA12878.chr20.20120522.bam.bai\nln -s NA12891.chrom20.ILLUMINA.bwa.CEU.high_coverage.20120522.bam.bai CEUTrio.NA12891.chr20.20120522.bam.bai\nln -s NA12892.chrom20.ILLUMINA.bwa.CEU.high_coverage.20120522.bam.bai CEUTrio.NA12892.chr20.20120522.bam.bai\n```\n\n2. Download a set of high confidence calls for the trio:\n```\nwget -O GiaB_NIST_RTG_v0_2.vcf.gz ftp://ftp-trace.ncbi.nih.gov/giab/ftp/data/NA12878/variant_calls/GIAB_integration/NIST_RTG_PlatGen_merged_highconfidence_v0.2_Allannotate.vcf.gz\ntabix -f -p vcf GiaB_NIST_RTG_v0_2.vcf.gz\n```\n\n3. Run the pipeline\n```\nsbatch -J CEUTrio -N 1 --exclusive ~/pipeline/ppln/pipe03.sh ./ ./CEUTrio CEUTrio WG 0 tmp ~/pipeline/ppln/include_example.mk 0 ,Reorder,FixGroups,FilterBam,DedupBam,Metrics,IndelRealign,BQRecalibrate,HaplotypeCaller,Freebayes,Platypus,HaplotypeCallerGVCF,RecalibVariants, 1 ~/pipeline/ppln/ 20 all\n```\n\n4. Filter out loci where the proband is HomRef, one can use ```SnpSift``` to accomplish this task.\n```\nvt normalize -r $GENOMEREF CEUTrio-FB-vars.vcf.gz | java -jar SnpSift.jar filter \" GEN[0].GT != '0/0' \u0026 GEN[0].GT != '0|0' \" | vcfintersect -b ../../NA12878-callable.bed | bgzip -c \u003e CEUTrio-FB-vars-NoHomRef-call.vcf.gz\n```\n\n5. Compare with *vcf-compare*.\n\nHaplotype Caller:\n```\nvcf-compare CEUTrio/CEUTrio-HC-vars-NoHomRef-call.vcf.gz GiaB_NIST_RTG_v0_2-chr20-norm.vcf.gz  | grep ^VN\nVN  456 GiaB_NIST_RTG_v0_2-chr20-norm.vcf.gz (0.7%)\nVN  5978    CEUTrio/CEUTrio-HC-vars-NoHomRef-call.vcf.gz (7.9%)\nVN  69434   CEUTrio/CEUTrio-HC-vars-NoHomRef-call.vcf.gz (92.1%) GiaB_NIST_RTG_v0_2-chr20-norm.vcf.gz (99.3%)\n```\n\nHaplotype Caller in GVCF mode:\n```\nvcf-compare CEUTrio/CEUTrio-JHC-vars-NoHomRef-call.vcf.gz GiaB_NIST_RTG_v0_2-chr20-norm.vcf.gz  | grep ^VN\nVN  470 GiaB_NIST_RTG_v0_2-chr20-norm.vcf.gz (0.7%)\nVN  5653    CEUTrio/CEUTrio-JHC-vars-NoHomRef-call.vcf.gz (7.5%)\nVN  69420   CEUTrio/CEUTrio-JHC-vars-NoHomRef-call.vcf.gz (92.5%) GiaB_NIST_RTG_v0_2-chr20-norm.vcf.gz (99.3%)\n```\n\nFreebayes:\n```\nvcf-compare CEUTrio/CEUTrio-FB-vars-NoHomRef-call.vcf.gz GiaB_NIST_RTG_v0_2-chr20-norm.vcf.gz  | grep ^VN\nVN  791 GiaB_NIST_RTG_v0_2-chr20-norm.vcf.gz (1.1%)\nVN  3784    CEUTrio/CEUTrio-FB-vars-NoHomRef-call.vcf.gz (5.2%)\nVN  69099   CEUTrio/CEUTrio-FB-vars-NoHomRef-call.vcf.gz (94.8%) GiaB_NIST_RTG_v0_2-chr20-norm.vcf.gz (98.9%)\n```\n\nPlatypus:\n```\nvcf-compare CEUTrio/CEUTrio-PL-vars-NoHomRef-call.vcf.gz GiaB_NIST_RTG_v0_2-chr20-norm.vcf.gz  | grep ^VN\nVN  857 GiaB_NIST_RTG_v0_2-chr20-norm.vcf.gz (1.2%)\nVN  7769    CEUTrio/CEUTrio-PL-vars-NoHomRef-call.vcf.gz (10.1%)\nVN  69033   CEUTrio/CEUTrio-PL-vars-NoHomRef-call.vcf.gz (89.9%) GiaB_NIST_RTG_v0_2-chr20-norm.vcf.gz (98.8%)\n`,FilterBam,DedupBam,Metrics,IndelRealign,BQRecalibrate,HaplotypeCaller,Freebayes,Platypus,HaplotypeCallerGVCF,``\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsimonsfoundation%2Fpipeline","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsimonsfoundation%2Fpipeline","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsimonsfoundation%2Fpipeline/lists"}