{"id":24535001,"url":"https://github.com/georgiesamaha/fq2vcf","last_synced_at":"2025-03-15T22:42:29.599Z","repository":{"id":158019232,"uuid":"427248895","full_name":"georgiesamaha/fq2vcf","owner":"georgiesamaha","description":"Bash scripts for fastq to vcf pipeline, written for USyd Artemis HPC. ","archived":false,"fork":false,"pushed_at":"2023-02-04T07:00:06.000Z","size":56,"stargazers_count":2,"open_issues_count":0,"forks_count":1,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-01-22T11:43:34.862Z","etag":null,"topics":["bam","bioinformatics","gatk","genomics","indels","mapping","ngs","pipeline","snps","variant-calling","vcf"],"latest_commit_sha":null,"homepage":"","language":"Shell","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/georgiesamaha.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-11-12T05:56:59.000Z","updated_at":"2024-09-28T08:01:46.000Z","dependencies_parsed_at":null,"dependency_job_id":"57296445-4dff-4d09-86f7-a9d29d54e259","html_url":"https://github.com/georgiesamaha/fq2vcf","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/georgiesamaha%2Ffq2vcf","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/georgiesamaha%2Ffq2vcf/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/georgiesamaha%2Ffq2vcf/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/georgiesamaha%2Ffq2vcf/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/georgiesamaha","download_url":"https://codeload.github.com/georgiesamaha/fq2vcf/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243801609,"owners_count":20350106,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bam","bioinformatics","gatk","genomics","indels","mapping","ngs","pipeline","snps","variant-calling","vcf"],"created_at":"2025-01-22T11:31:27.336Z","updated_at":"2025-03-15T22:42:29.585Z","avatar_url":"https://github.com/georgiesamaha.png","language":"Shell","funding_links":[],"categories":[],"sub_categories":[],"readme":"# fq2vcf  @ USyd Artemis  \n\nThis repo contains bash scripts for fastq to vcf pipeline, written for USyd Artemis HPC. This pipeline follows GATK/BROAD best practice recommendations for alignment and germline variant calling. Workflow requires users to download and prepare reference assembly and then run a series of scripts to align raw reads to the reference assembly, perform base quality score recalibration and perform variant calling with GATK. These scripts have not been optimised and are run as a vanilla implementation.  \n\nInstructions on how to install this repo, prepare, and run scripts follow below. \n\n\n## Set up and installation  \n  \nBefore you begin clone this repo to your `/project/\u003cProject\u003e` directory: \n\n```\nmodule load git\ngit clone https://github.com/georgiesamaha/fq2vcf.git\n```\n* All Scritps can be found in and should be run from `/project/\u003cProject\u003e/Scripts`  \n* All logs will be output into `/project/\u003cProject\u003e/Scripts/Logs`  \n\nThese scripts assume you will be running your scripts from `/project/\u003cProject\u003e` and outputting files to `/scratch/\u003cProject\u003e`. You will need to edit variables in each script before running based on your project. You will generally need to edit project, reference and config variables in all scripts. Any variable that needs to be edited before running sits at the top of the script, under the heading `# edit these to match your project (specify full path)`.\n     \n\nUpon cloning this repo from github you will have the following directory structure in `/project/\u003cProject\u003e`:  \n\n\n```\n/project/\u003cProject\u003e/\n├── Apps\n├── Reference\n└── Scripts\n  └── Logs\n  └── Configs\n  \n```\n\n- `Apps` should be used to house the MultiQC singularity image file (.sif) you download (instructions below) and any other tools you choose to install for downstream analyses. \n- `Reference` should house any reference assembly files including .fasta (and its indexes), .dict, .gtf and population .vcf files. \n- `Scripts` houses all scripts to be run as well as PBS and tool output logs. Run all scripts from `project/\u003cProject\u003e/Scripts`. If you run into any issues with failed runs, look at corresponding log file for source of error. Most scripts are written as PBS array jobs. To check their progress in the queue as they run, type `jobstat` into the commandline.   \n\nThese scripts expect the following directory structure in `/scratch/\u003cProject\u003e`:   \n \n  ```\n  /scratch/\u003cProject\u003e/\n├── Bams\n├── Fastq\n└── VCFs\n```\n \n- `Fastq`: copy your fastq files here before you begin. FastQC will output quality reports.   \n- `Bams` will house all BAM files including split, discordant, and final BAMs and bam summary stats\n- `VCFs` will house all g.vcf files and cohort VCF file    \n\n### Software \nAll tools run in this pipeline are globally installed on Artemis, with the exception of MultiQC. Tools used in these scripts and their version are:  \n\n * fastqc/0.11.8\n * multiqc/1.9  \n * samtools/1.10\n * bwa/0.7.17\n * samblaster/0.1.24\n * singularity/3.7.0\n * gatk/4.2.1.0\n\nRunning MultiQC step is optional as it is used in this pipeline to create aggregate quality reports for fastqs and bams. To download multiQC/1.9 singularity container run the following from the Scripts directory:    \n```\nbash prepmultiqc.sh\n``` \n\n## User guide   \n\n### 1. Prepare config file \n\nThe config file must have one row per unique sample, matching the format:\n\n|ArrayID|SampleID|Breed |Platform|SeqCentre |FQ1 |FQ2 |\n|-------|--------|------|--------|----------|----|----|\n|1      |Sample1 |Breed1|Illumina|Ramaciotti|/PATH/TO/FQ_R1.GZ | /PATH/TO/FQ_R2.GZ|\n|2      |Sample2 |Breed1|Illumina|Ramaciotti|/PATH/TO/FQ_R1.GZ | /PATH/TO/FQ_R2.GZ|\n\n   - ArrayID: job number for PBS job scheduler. Sample 1 will be run as \n   - SampleID: the unique identifier enabling one to recognise which FASTQs belong to the same sample.\n   - Platform: type of sequencing platform \n   - SeqCentre: where the samples were sequenced, this will be stored in the final BAM files\n   - Fq1: full path of the corresponding R1 fastq file \n   - Fq2: full path of the corresponding R2 fastq file \n\nSave the config file to the /project/\u003cProject\u003e/Scripts/Configs directory. It will be used to run a single job array for each sample, where applicable to run script parallel by sample.  \n\n### 2. Download and prepare reference assembly   \n\nDownload reference genome from Ensembl's FTP site using wget. Can also use this opportunity to download population VCF file and GTF file for downstream analyses. For example:\n\n```\nwget http://ftp.ensembl.org/pub/release-104/fasta/felis_catus/dna/Felis_catus.Felis_catus_9.0.dna.toplevel.fa.gz\nwget http://ftp.ensembl.org/pub/release-104/variation/vcf/felis_catus/felis_catus.vcf.gz\nwget http://ftp.ensembl.org/pub/release-104/gtf/felis_catus/Felis_catus.Felis_catus_9.0.104.gtf.gz\n```\n\nIndex the reference assembly for samtools, GATK and BWA by running `prepare_reference.pbs`. Edit ref variable before running.   \n\n```\nqsub prepare_reference.pbs \n```\n\n### 3. Check quality of fastq files with FastQC \n  \nThis step will produce quality reports for all fastq files. Each fastq file will be run as a separate task and each task is processed in parallel. For an explanation of reports, see the [FastQC documentation](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/).   \n\n To run fastQC in parallel create a `fastq.config` file and save in /project/\u003cProject\u003e/Scripts/Configs. Should have the following format, with no header:  \n |#ArrayID|Fastq|\n|---|-----------------------------|\n|1  |Sample1_R1_001.fastq.gz|\n|2  |Sample1_L001_R2_001.fastq.gz|\n|3  |Sample2_L001_R1_001.fastq.gz|\n|4  |Sample2_L001_R2_001.fastq.gz|\n|5  |Sample3_L001_R1_001.fastq.gz|\n|6  |Sample3_L001_R2_001.fastq.gz|\n  \n Edit variables in the script according to the needs of your project and run fastQC for each fastq file in parallel with: \n  ```\n  qsub fastqc.pbs\n  ```\nFastQC reports are output to `/scratch/\u003cProject\u003e/Fastq/FastQC`. \n\n### 4. Aggregate fastQC reports with MultiQC \n  \nAggregate reports of FastQC results can also be produced for all FASTQ files using [MultiQC](https://multiqc.info/docs/), if desired. Edit the variables in the script according to the needs of your project and run script as: \n  \n```\nbash multiqc_fq.sh\n```\n\nMultiQC aggregate report for fastq files are output to `/scratch/\u003cProject\u003e/Fastq`. \n  \n### 5. Align raw reads to reference with bwa-mem\n  \nAlign raw reads to the reference assembly with bwa-mem for each sample in parallel. Duplicate and split reads will be extracted from the final alignment file and saved as .sam files. These files can later be used for structural variant calling if necessary. This process will take approximately 24 hours. If job fails because it excedes walltime, edit the #PBS -l walltime=HH:MM:SS variable to give the job more time. This may occur for higher coverage samples. Edit relevant variables in the script and run script with: \n  \n```\nqsub align.pbs\n```\n  \nIndexed final.bam files as well as split.sam and disc.sam files are output to `/scratch/\u003cProject\u003e/Bams`  \n  \n### 6. Perform base quality score recalibration \n\nThis step is optional and requires a set of known population-level variants to be run. It is a data pre-processing step that detects systematic errors made by the sequencing machine when it estimates the accuracy of each base call. Base quality score recalibration is performed in two steps. See [GATK's BQSR documentation](https://gatk.broadinstitute.org/hc/en-us/articles/360035890531-Base-Quality-Score-Recalibration-BQSR-) for more information. This step should take ~3-4 hours. Edit relevant variables in the script and run script with: \n\n```\nqsub bqsr.pbs\n```\nIndexed final.bam files are output to `/scratch/\u003cProject\u003e/Bams`  \n   \n### 7. Collect alignment summary stats \n\nCollect summary metrics for final.bam files with [Samtools flagstat](http://www.htslib.org/doc/samtools-flagstat.html). This step also includes running multiQC to create an aggregate report for all final.bam files. If you do not want to run MultiQC, hash out the singularity-multiqc command in the script. Edit relevant variables in the script and run script with: \n\n```\nqsub bamsummaries.pbs\n```\n  \nFlagstat summaries and MultiQC aggregate reports are output to `/scratch/\u003cProject\u003e/Bams`     \n  \n### 8. Call variants with GATK's HaplotypeCaller \n\nRun GATK's [HaplotypeCaller tool](https://gatk.broadinstitute.org/hc/en-us/articles/360037225632-HaplotypeCaller) for sample level variant calling across sample (NB will rewrite to parallelise across each chromosome in the future). HaplotypeCaller is capable of calling SNPs and indels simultaneously, can handle non-diploid and pooled experimental data. Edit the relevant variables in the script and run with:   \n\n  \n```\nqsub callvariants.pbs\n```\n\nA g.vcf.gz file will be produced for each sample. They will be output to `/scratch/\u003cProject\u003e/VCFs`  \n  \n### 9. Joint call variants \n\nThis step performs joint genotyping on one or more samples pre-called with HaplotypeCaller, using GATK's CombineGVCFs and GenotypeGVCFs tools. Edit the relevant variables in the script and run with:  \n\n```\nqsub jointcallvariants.pbs\n```\n\nA cohort vcf.gz file will be output to `/scratch/\u003cProject\u003e/VCFs`  \n\n### 10. Collect VCF summary stats \n  \nFiltering or variant quality score recalibration of the final VCF is recommended to filter out false positive variants. Run the following scripts to annotate variants with generic filtering thresholds and summarise variant outputs with: \n \n```\nqsub filter_summarise_vcf.pbs\n```  \n  \n \n## Resources \n[Artemis user guide](https://sydneyuni.atlassian.net/wiki/spaces/RC/pages/185827329/Artemis+User+Guide)   \n[Artemis job queues](https://sydneyuni.atlassian.net/wiki/spaces/RC/pages/220988168/Queue+resource+limits)  \n[GATK bqsr](https://gatk.broadinstitute.org/hc/en-us/articles/360035890531-Base-Quality-Score-Recalibration-BQSR-)   \n[GATK parallelism](https://gatk.broadinstitute.org/hc/en-us/articles/360035532012-Parallelism-Multithreading-Scatter-Gather)   \n[VCF file format](https://samtools.github.io/hts-specs/VCFv4.2.pdf)   \n[SAM/BAM file format](https://samtools.github.io/hts-specs/SAMv1.pdf)   \n[Fastq file format](https://sapac.support.illumina.com/bulletins/2016/04/fastq-files-explained.html)  \n  \n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgeorgiesamaha%2Ffq2vcf","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgeorgiesamaha%2Ffq2vcf","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgeorgiesamaha%2Ffq2vcf/lists"}