https://github.com/zilong-li/svupp
genotype Structural Variants Using Pre-Phased Reads
https://github.com/zilong-li/svupp
genotyping long-read-sequencing nextflow structual-variation
Last synced: 4 months ago
JSON representation
genotype Structural Variants Using Pre-Phased Reads
- Host: GitHub
- URL: https://github.com/zilong-li/svupp
- Owner: Zilong-Li
- License: mit
- Created: 2025-05-28T10:16:28.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2025-12-24T08:36:22.000Z (6 months ago)
- Last Synced: 2025-12-24T23:49:11.724Z (6 months ago)
- Topics: genotyping, long-read-sequencing, nextflow, structual-variation
- Language: Nextflow
- Homepage: https://doi.org/10.1093/bioinformatics/btaf587
- Size: 57.6 KB
- Stars: 9
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.org
- License: LICENSE
Awesome Lists containing this project
README
#+title: Structural Variants Genotyping Using Pre-Phased Reads
#+author: Zilong Li
#+email: zilong.dk@gmail.com
#+options: toc:2 num:nil email:t -:nil ^:nil
#+STARTUP: show2levels indent hidestars hideblocks
[[https://github.com/Zilong-Li/SVUPP/blob/main/.github/workflows/main.yml][https://github.com/Zilong-Li/SVUPP/actions/workflows/main.yml/badge.svg]]
[[https://doi.org/10.5281/zenodo.17227286][https://zenodo.org/badge/DOI/10.5281/zenodo.17227287.svg]]
Please cite our [[https://doi.org/10.1093/bioinformatics/btaf587][paper]] with the following BibTex template:
#+begin_src bibtex
@article{li2025SVUPP,
title = {Pre-Phasing Long Reads Improves Structural Variant Genotyping},
author = {Li, Zilong and St{\ae}ger, Frederik Filip and Davies, Robert W and Moltke, Ida and Albrechtsen, Anders},
year = {2025},
month = oct,
journal = {Bioinformatics},
pages = {btaf587},
issn = {1367-4811},
doi = {10.1093/bioinformatics/btaf587}
}
#+end_src
* Quick Start
#+begin_src shell
# Download
git clone https://github.com/Zilong-Li/SVUPP
cd SVUPP
# Download example data from 1KG
bash ./scripts/download-examples.sh
# Run after preparing the sample sheet and reference panel
nextflow run main.nf \
-profile conda \ # or docker/singularity
--refpanel tests/refpanel.csv \ # for phased reference panel
--samples tests/samples.csv \ # samplesheet with long reads
--svfile tests/delins.sniffles.hg38.liftedT2T.13Nov2023.nygc.vcf.gz # a known SV list for genotyping
# Output Structure
# results
# ├── cutesv2
# │ ├── NA12878.vcf.gz # Final VCF with SV genotypes
# │ ├── NA12878.vcf.gz.tbi
# │ └── versions.yml
# ├── prepared_reference_rdata.csv
# ├── quilt2_impute
# │ ├── batch1
# │ └── versions.yml
# ├── quilt2_phase
# │ ├── batch1
# │ └── versions.yml
# ├── quilt2_prepare_chunk
# │ ├── chr21.csv
# │ ├── chr22.csv
# │ └── versions.yml
# ├── quilt2_prepare_reference
# │ ├── RData
# │ └── versions.yml
# └── samples_read_labels.csv
# Read the nextflow.config about advanced and Customization parameters
#+end_src
* Table of Contents :toc:quote:noexport:
#+BEGIN_QUOTE
- [[#quick-start][Quick Start]]
- [[#introduction][Introduction]]
- [[#usage][Usage]]
- [[#step-0-install-nextflow][Step 0: install Nextflow]]
- [[#step-1-configure-the-workflow][Step 1: configure the workflow]]
- [[#step-2-choose-a-container][Step 2: choose a container]]
- [[#step-3-run-the-workflow][Step 3: run the workflow]]
- [[#output][Output]]
- [[#evaluation][Evaluation]]
- [[#qa][Q&A]]
- [[#which-reference-panel-should-i-use][Which reference panel should I use?]]
- [[#what-if-i-already-have-the-prepared-reference-panel-ie-the-rdata-from-quilt2][What if I already have the prepared reference panel, i.e the RData, from QUILT2?]]
- [[#speedup-quilt2-for-a-large-reference-panel][Speedup QUILT2 for a large reference panel]]
- [[#what-if-i-already-have-read-labels-either-from-quilt2-or-other-read-phasing-program][What if I already have read labels either from QUILT2 or other read phasing program?]]
- [[#whats-the-advantages-of-quilt2-vs-whatshap][What's the advantages of QUILT2 vs WhatsHap?]]
- [[#will-this-pipeline-support-whatshap][Will this pipeline support WhatsHap?]]
#+END_QUOTE
* Introduction
SVUPP is a pipeline that improves SVs genotyping accuracy by incorporating per-read phasing information into genotype likelihoods. Currently, we first used [[https://github.com/rwdavies/QUILT][QUILT2]] to phase long reads with a SNP reference panel. Then, we used a forked version of [[https://github.com/Zilong-Li/cuteSV/tree/v0.0.2][cuteSV2 (aka cuteFC)]] for assigning SV signals to each read followed by our genotyping formula that incorporates the haplotype probability of reads.
$$
\text{P}(X\vert H=(H_1,H_2)) \propto \prod_{r=1}^{D} \sum_{h\in\{1,2\}} \text{P}(X_r\vert H_h) \text{P}(\text{hap}_r=h)
$$
* Usage
** Step 0: install Nextflow
Please follow the official [[https://www.nextflow.io/docs/latest/install.html][guideline]] to install the latest Nextflow with DSL2 support.
** Step 1: configure the workflow
There are two main CSV files you need to prepare, e.g. =tests/samples.csv= and =tests/refpanel.csv=. Check out the [[file:tests/README.org][README there]]. In addition, to configure the parameters of the workflow, modify the =nextflow.config= or use Nextflow command options (for Nextflow experts).
** Step 2: choose a container
SVUPP supports /Docker/, /Singularity/ and /Conda/ containers technology. Therefore, you can choose to use one of the 3 profiles in the =nextflow.config= namely /docker/, /singularity/ and /conda/. *NB*, if you use either /singularity/ or /docker/ profile, you have to set the =params.container= to the local image path. Check out the [[file:containers/README.org][README there]] for building your local container images or download one from https://doi.org/10.5281/zenodo.17227286. If you use conda profile, you are recommenced to activate a conda environment first before running SVUPP. Also, it may take a while for conda to resolve the environment for the first time depending on the conda version and internet connection.
** Step 3: run the workflow
You can =git clone= this workflow to a customized path, and run without =cd into SVUPP=
#+begin_src shell
nextflow run SVUPP/main.nf \
-profile conda \ # or docker/singularity
--refpanel SVUPP/tests/refpanel.csv \ # for phased reference panel
--samples SVUPP/tests/samples.csv \ # samplesheet with long reads
--svfile SVUPP/tests/delins.sniffles.hg38.liftedT2T.13Nov2023.nygc.vcf.gz # a list of known SVs
#+end_src
If you are new to Nextflow, here is a quick guide.
| Functionality | Nextflow Command | Important Note |
|------------------------------+------------------+------------------------------|
| Run job in the background | run -bg | DO NOT use nohup or & |
| Resume from the cached tasks | run -resume | Can work with specific hash |
| Data cache directory | run -w dir | Defaults 'work' |
| Output directory | run --ourdir | Defaults 'results' |
| Max parallel processes | run -qs | Defaults None |
| Logging history | log | Find the status of past runs |
* Output
All output files are saved in the folder that you specified when running Nextflow command with defaults to *results*. Here are the details:
| Genotyped VCF: | results/cuteSV2/$sampleid.vcf.gz |
| Read labels: | results/samples_read_labels.csv |
| Prepared reference: | results/prepared_reference_rdata.csv |
* Evaluation
For benchmarking studies, It is important to evaluate the results by stratifying the SV complexity and the call rate, which is controled by the GQ thresholding. You can achieve this easily using the latest [[https://github.com/Zilong-Li/vcfppR?tab=readme-ov-file#vcfcomp-compare-two-vcf-files-and-report-concordance][vcfppR]] package (version >= 0.8.2).
#+begin_src R
#remotes::install_github("Zilong-Li/vcfppR") ## use the latest github version
library(vcfppR)
svvcf <- system.file("extdata", "platinum.sv.vcf.gz", package="vcfppR")
svuppvcf <- system.file("extdata", "svupp.call.vcf.gz", package="vcfppR")
truth <- vcftable(svvcf)
truth$neighbors <-as.integer(sub(".*NumNeighbors=([^;]+).*", "\\1", truth$info))
truth <- subset(truth, neighbors == 0) ## subset biallelic SVs
res <- vcfcomp(svuppvcf, truth, stats = "gtgq")
vcfplot(res, col = 2,cex = 2, lwd = 3, type = "l", bty = 'l')
#+end_src
* Q&A
** Which reference panel should I use?
In principle, choose the one with matched ancestry or a large one with multiple admixed populations, e.g., the UK Biobank. However, in our benchmarking with the Platinum data, we found there was no difference in accuracy between using the UK Biobank and the 1000 Genomes Project. You can download the prepared 1000 Genomes reference panel in RData format for QUILT2 here http://popgen.dk/zilong/datahub/1KGP/quilt2_refpanel_hg38/RData/. See the next section on how to use it directly.
** What if I already have the prepared reference panel, i.e the RData, from QUILT2?
1. Prepare a sheet with two columns named 'chunk_id' and 'refpanel_rdata', such as http://popgen.dk/zilong/datahub/1KGP/quilt2_refpanel_hg38/prepared_reference_rdata.csv.
#+begin_src shell
chunk_id,refpanel_rdata
chr22.48718618.55783303,/home/zilong/Projects/SVUPP/work/f2/f9b51191685bdf2fa893e394a834af/RData/QUILT_prepared_reference.chr22.48718618.55783303.RData
chr22.38068017.44734586,/home/zilong/Projects/SVUPP/work/9b/6e3c921ecb41b2ebe01c8f0d4935ab/RData/QUILT_prepared_reference.chr22.38068017.44734586.RData
chr22.30094765.34092463,/home/zilong/Projects/SVUPP/work/89/b4676a75daf1e493c82e90d8bf1bdd/RData/QUILT_prepared_reference.chr22.30094765.34092463.RData
#+end_src
2. Run the nextflow
#+begin_src shell
nextflow run main.nf \
-profile conda \ # or docker/singularity
--refdata prepared_reference_rdata.csv \ # the sheet with prepared RData for reference panel
--samples tests/samples.csv \ # samplesheet with long reads
--svfile /path/to/vcf/with/sv.vcf # for SV genotyping
#+end_src
** Speedup QUILT2 for a large reference panel
QUILT2 can run much faster if only imputing common variants in a large reference panel where the major SNPs are rare. With that in mind, SVUPP runs QUILT2 with =--impute_rare_common=FALSE= in default, which disables rare variants imputation. To enable it, you should modify the =nextflow.config= file to set =quilt_extra_args= to ='--impute_rare_common=TRUE'=.
** What if I already have read labels either from QUILT2 or other read phasing program?
*First*, Prepare a sheet with two columns named 'sample' and 'label', such as:
#+begin_src shell
sample,label
NA12877,/home/zilong/Projects/SVUPP/work/6c/f6daadafa1fdf4e90c6c8de4c39181/1/NA12877.haptag.tsv
NA12878,/home/zilong/Projects/SVUPP/work/6c/f6daadafa1fdf4e90c6c8de4c39181/1/NA12878.haptag.tsv
#+end_src
The label column stores the path to a space-separated file with no header and the first three columns being =qname,phasing_prob,hap=. An example:
| A00217:76:HFLT3DSXX:4:1457:26015:15984 | 0.999 | 1 |
| A00296:43:HCLHLDSXX:2:2502:19642:31219 | 0.999 | 2 |
| A00217:76:HFLT3DSXX:1:1336:4616:23359 | 0.500025147658519 | 1 |
*Second*, run the nextflow
#+begin_src shell
nextflow run main.nf \
-profile conda \ # or docker/singularity
--read_labels samples_read_labels.csv \ # the sheet associate each sample with its read label file
--samples tests/samples.csv \ # samplesheet with long reads
--svfile /path/to/vcf/with/svs # for SV genotyping
#+end_src
** What's the advantages of QUILT2 vs WhatsHap?
There are two main reasons why [[https://github.com/rwdavies/QUILT][QUILT2]] is chosen.
- QUILT2 is better than the alternatives at *low-to-medium coverage* (<10x) reads phasing.
- Users only need to have the aligned long reads of the target samples and a public available SNP reference panel, which are easy to obtain (at least for human projects).
However, for some non-human projects, where a public reference panel is rarely available, WhatHap may be a good alternative with the cost of obtaining high quality called SNPs, which are normally generated with high-coverage short reads sequencing of the target samples.
** Will this pipeline support WhatsHap?
It would be nice to have, time permitting. Welcome PRs!