{"id":40316313,"url":"https://github.com/coopsor/svpg","last_synced_at":"2026-01-20T07:00:58.919Z","repository":{"id":306035813,"uuid":"997424916","full_name":"coopsor/SVPG","owner":"coopsor","description":"Pangenome-based structural variation caller","archived":false,"fork":false,"pushed_at":"2026-01-14T08:57:48.000Z","size":891,"stargazers_count":11,"open_issues_count":1,"forks_count":2,"subscribers_count":1,"default_branch":"master","last_synced_at":"2026-01-14T12:12:52.887Z","etag":null,"topics":["bioinformatics","hifi","nanopore","pangenome","rare-variant","somatic-variant","structural-variations"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/coopsor.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-06-06T14:00:03.000Z","updated_at":"2026-01-14T08:55:06.000Z","dependencies_parsed_at":"2025-07-23T10:22:07.008Z","dependency_job_id":"9cbc1e10-e98d-4908-aba8-9bcc9d7ff525","html_url":"https://github.com/coopsor/SVPG","commit_stats":null,"previous_names":["coopsor/svpg"],"tags_count":5,"template":false,"template_full_name":null,"purl":"pkg:github/coopsor/SVPG","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/coopsor%2FSVPG","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/coopsor%2FSVPG/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/coopsor%2FSVPG/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/coopsor%2FSVPG/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/coopsor","download_url":"https://codeload.github.com/coopsor/SVPG/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/coopsor%2FSVPG/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28597985,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-20T02:08:49.799Z","status":"ssl_error","status_checked_at":"2026-01-20T02:08:44.148Z","response_time":117,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bioinformatics","hifi","nanopore","pangenome","rare-variant","somatic-variant","structural-variations"],"created_at":"2026-01-20T07:00:57.152Z","updated_at":"2026-01-20T07:00:58.905Z","avatar_url":"https://github.com/coopsor.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# SVPG\n[![PyPI version](https://img.shields.io/pypi/v/svpg.svg)](https://pypi.org/project/svpg/)\n[![Anaconda-Server Badge](https://anaconda.org/bioconda/svpg/badges/version.svg)](https://anaconda.org/bioconda/svpg)\n[![Anaconda-Server Badge](https://anaconda.org/bioconda/svpg/badges/license.svg)](https://anaconda.org/bioconda/svpg)\n[![Anaconda-Server Badge](https://anaconda.org/bioconda/svpg/badges/platforms.svg)](https://anaconda.org/bioconda/svpg)\n[![Anaconda-Server Badge](https://anaconda.org/bioconda/svpg/badges/latest_release_date.svg)](https://anaconda.org/bioconda/svpg)\n\n## Overview\n\u003ctable style=\"border-collapse: collapse; border: none; padding: 0; margin: 0; width: 100%;\"\u003e\n  \u003ctr\u003e\n    \u003ctd style=\"text-align: center; vertical-align: middle; font-family: monospace; white-space: pre; font-size: 14px; padding: 0; margin: 0;\"\u003e\n\u003cpre style=\"margin: 0; line-height: 1;\"\u003e\n████ █     █ ████   ████ \n█    █     █ █   █ █     \n████  █   █  ████  █ ███ \n   █   █ █   █     █   █ \n████    █    █      ████ \n\u003c/pre\u003e\n    \u003c/td\u003e\n    \u003ctd vertical-align: middle; padding: 0; margin: 0\u003e\n      \u003cdiv style=\"margin: 0 auto\"\u003e\n\u003cb\u003eSVPG\u003c/b\u003e (Structural Variant detection based on Pangenome Graph) is a computational tool designed for structural variation (SV) detection and efficient pangenome graph augmentation. With the growing availability of long-read sequencing data and pangenome references, SVPG fills a critical gap by enabling accurate SV discovery and scalable integration of new genomes into existing pangenome graphs.\n      \u003c/div\u003e\n    \u003c/td\u003e\n  \u003c/tr\u003e\n\u003c/table\u003e\n\u003cdiv style=\"text-align: center; margin-top: 10px;\"\u003e\n  \u003cimg src=\"doc/overview.jpg\" alt=\"SVPG illustration\" style=\"max-width: 100%; height: auto;\"\u003e\n\u003c/div\u003e\n\n\n## Key Features\n\n* **Dual SV detection modes**:\n\n  * **Pangenome-guided mode**:  Extracts SV-supporting reads from BAM files, and realigns a pangenome reference graph. By analyzing the graph alignment's topological and path transition features to detect germline SVs with high precision.\n  * **Graph-based mode**: Directly resolves reads-to-graph alignments to discover _de novo_ SVs within haplotype paths of pangenome graph, ideal for conducting reference-bias-free low-frequency/somatic SV discovery without relying on prior SV databases or annotations.\n* **High sensitivity and accuracy SV detection**: Demonstrates superior performance in benchmarking against state-of-the-art SV callers across both population-wide germline and individual-specific SVs.\n* **Rapid graph augmentation**: Designed to work seamlessly with the graph-call mode, it accelerates pangenome augmentation by nearly an order of magnitude compared to traditional _de novo_ assembly methods on cohorts of dozens of samples, enabling fast and scalable integration of new samples.\n\n## Contents\n* [Installation](#installation)\n* [Requirements](#requirements)\n* [Usage](#usage)\n  * [1. Pangenome-Guided SV Detection](#1-pangenome-guided-sv-detection)\n  * [2. Graph-Based SV Detection](#2-graph-based-sv-detection)\n  * [3. Pangenome Graph Augmentation](#3-pangenome-graph-augmentation)\n* [Parameters](#parameters)\n* [Limitations](#limitations)\n* [Citation](#citation)\n* [Contact](#contact)\n\n\n## Installation\n\n```bash\n$ pip install svpg\nor\n$ conda install svpg\nor\n$ git clone https://github.com/coopsor/SVPG.git \u0026\u0026 cd SVPG/ \u0026\u0026 pip install . \n```\n\n## Requirements\n* Python \u003e= 3.10 (tested on v3.10.4)\n* pysam \u003e= 0.22 for BAM file processing\n* numpy \u003e= 1.26.4 for numerical computing\n* scipy \u003e= 1.13.1 for scientific computing\n* [pyabpoa](https://github.com/yangao07/abPOA/tree/main/python) \u003e= 1.5.4 for consensus sequence generation\n\nThe following tools must be available in your system path (recommend installing via conda):\n* [minigraph](https://github.com/lh3/minigraph) \u003e= 0.21 for pangenome graph alignment in pangenome-guided mode\n* [mappy](https://github.com/lh3/minimap2/tree/master/python) \u003e= 2.28 for consensus sequence realignment in pangenome-guided mode\n* bcftools \u003e= 1.20 for VCFs processing in augmentation mode\n* truvari \u003e= 3.1.0 for VCFs merging in augmentation mode\n\n## Usage\n\n### 1. Pangenome-Guided SV Detection\n* Pangenome-guided mode requires an input of read-reference alignment results in coordinate-sorted and indexed BAM file. If you start with sequencing reads (e.g., FASTA/FASTQ files), you need to map them to a linear reference genome first.\n* SVPG support parallelized and uses 16 threads by default. This value can be adapted using e.g. `-t` 4 as option.\n* SVPG was evaluated on the first and second releases of the HPRC pangenome graphs ([v3.1](https://zenodo.org/records/10693675) and [v4.1](https://zenodo.org/records/16728828)). Benchmark results indicate that SVPG achieves nearly identical performance on both versions. \n* By default, SVPG outputs all SVs supported by more than one read. In pangenome-guided mode, users can according to genotype-assigned variants using `FILTER=PASS` to obtain a more high-confidence SV set.\n In addition, users may manually adjust the minimum read support threshold with the `--min_support`/`-s` parameter based on sequencing depth with the following table for reference. This is particularly useful for ultra-low-coverage datasets (\u003c10×) to preserve recall, as well as for graph-based mode with genotyping is not available.\n\n  | Depth (×) | ONT | HiFi |\n  |-----------|-----|------|\n  | \u003c10       | 2   | 1    |\n  | [10, 20)  | 3   | 2    |\n  | [20, 50)  | 4   | 3    |\n  | ≥50       | 10  | 4    |\n\n```bash\nsvpg call --working_dir svpg_out/ --bam sample.bam --ref hg38.fa --gfa pangenome.gfa --read ont\n```\nThe called file `variants.vcf` was saved in the specified working directory. `-o` option can be used to specify the output file name.\n\n### 2. Graph-Based SV Detection\n* Graph-based mode requires an input of read-graph alignment results in GAF format. If you start with sequencing reads (e.g., FASTA/FASTQ files), you need to map them to a pangenome. We recommend to produce the alignments using [minigraph]((https://github.com/lh3/minigraph)).\n* Since minigraph by default outputs [stable coordinates](https://github.com/lh3/gfatools/blob/master/doc/rGFA.md#the-graph-alignment-format-gaf) in [rGFA](https://github.com/lh3/gfatools/blob/master/doc/rGFA.md) format, SVPG requires the `--vc` option to be enabled during alignment to support more general GFA formats (e.g., [GraphAligner](https://github.com/maickrau/GraphAligner) alignment result).\n\n```bash\nminigraph -cx lr --vc -t 64 pangenome.gfa sample.fasta \u003e sample.gaf \nsvpg graph-call --working_dir svpg_out/ --ref hg38.fa --gfa pangenome.gfa --gaf sample.gaf --read ont -s 3\n```\n\n* SVPG leverages a pangenome as a panel for filtering germline and population-level SVs, and therefore outputs tumor-only SVs by default. For Tumor/Normal paired analysis, we recommend running the two samples separately and then integrating the results with our script to achieve optimal performance.\n```bash\nsvpg graph-call --working_dir tumor_out/ --ref hg38.fa --gfa pangenome.gfa --gaf tumor.gaf --read hifi -s 3\nsvpg graph-call --working_dir normal_out/ --ref hg38.fa --gfa pangenome.gfa --gaf normal.gaf --read hifi -s 1\npython scripts/vcf_specific.py tumor_out/variants.vcf normal_out/variants.vcf tumor_specific.vcf\n```\nThis procedure selects SVs that are present only in the tumor sample but absent in the matched normal.\n\n### 3. Pangenome Graph Augmentation\nSVPG provides a streamlined pipeline to rapidly embed _de novo_ SVs detected from graph-based alignment back into the pangenome graph.\nTo use this feature, users should place a directory containing the raw sequencing data (e.g., FASTA/FASTQ files) of new samples under the specified `working_dir` path. For example:\n```bash\nworking_dir/\n├── sample_1/\n│   └── sample_1.fasta\n├── sample_2/\n│   └── sample_2.fasta\n```\nSVPG will automatically detect SV in graph-based mode and process these VCFs for graph augmentation, and the output file `augment.gfa` is placed into the given working directory. \n```bash\nsvpg augment --working_dir svpg_out/ --ref hg38.fa --gfa pangenome.gfa --read hifi\n```\nAlternatively, you may provide a .tsv file listing the paths to FASTA files of new samples.\nFor example, the sample.tsv file may look like(sample_1 name ≠ sample_2 name):\n`/path/to/sample_1.fasta \\n /path/to/sample_2.fasta`\nthen, run the command `svpg augment --working_dir svpg_out/ --sample_list sample.tsv --ref hg38.fa --gfa pangenome.gfa --read hifi` \n\n## Parameters\n| Parameter               | Description                                                                                                                                                       | Default                                                                            |\n|-------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------|\n| `--working_dir`         | Specify the working directory to store output files.                                                                                                              | Required                                                                           |\n| `--bam`                 | Coordinate-sorted and indexed BAM file with aligned long reads.                                                                                                   | Required for `call` mode                                                           |\n| `--gaf`                 | GAF file with long reads aligned to the pangenome graph (.gaf).                                                                                                   | Required for `graph-call` mode                                                     |\n| `--ref`                 | The reference genome used for pangenome construction (.fa), is also serves as the coordinate system for SVPG’s SV call output.                                    | Required                                                                           |\n| `--gfa`                 | Pangenome reference file that the long reads were aligned to (.gfa).                                                                                              | Required                                                                           |\n| `--read`                | Type of sequencing reads: `ont` for Oxford Nanopore, `hifi` for PacBio HiFi.                                                                                      | hifi                                                                               |\n| `--min_support`/`-s`    | Minimum read support threshold for SV calling. Adjust based on sequencing depth.                                                                                  | 2                                                                                  |\n| `--num_threads`/`-t`    | Number of threads to use for parallel processing.                                                                                                                 | 16                                                                                 |\n| `--min_mapq`            | Minimum mapping quality for reads to be considered in SV detection.                                                                                               | 20                                                                                 |\n| `--min_sv_size`         | Minimum size of SVs to be detected.                                                                                                                               | 50                                                                                 |\n| `--max_sv_size`         | Maximum size of SVs to be detected. Set to -1 for unlimited size (recommend for somatic SV of `graph-call` mode).                                                 | 1,000,00                                                                           |\n| `--max_merge_threshold` | Maximum distance of SV signals to be merged.                                                                                                                      | 50 for hifi read and 500 for ont read                                              |\n| `--ultra_split_size`    | Ignore extremely large BNDs from split alignments unless supported by high enough reads, which may be regarded as false-negative intra-chromosomal translocation. | 1000000                                                                            |\n| `--alt_consensus`       | Generate alternative allele consensus sequences for insertion using pyabpoa.                                                                                      | Disable                                                                            |\n| `--noseq`               | Disable sequence extraction for SVs. Useful for ultra-large SVs to save time and disk space.                                                                      | Disabled                                                                           |\n| `--types`               | Specify the types of SVs to call: DEL, INS, DUP, INV, BND. Separate multiple types with commas.                                                                   | DEL,INS,DUP,INV,BND                                                                |\n| `--contigs`             | Specify the chromosomes list to call SVs (e.g., --contigs chr1 chr2 chrX)'.                                                                                       | All chromosomes                                                                    |   \n| `--skip_genotype`       | Skip genotyping step to speed up the process for `call` mode.                                                                                                     | Disabled                                                                           |\n| `--realign`             | Realign the noise reads to the reference for more accurate SV sequence inference for `call` mode.                                                                 | Disabled                                                                           |\n| `--sample_list`         | Path to a TSV file listing the paths to FASTA files of new samples for `augment` mode.                                                                            | Optional; if not provided, all FASTA files under `working_dir` will be processed.  |\n| `--skip_call`           | Skip SV calling step and directly proceed to graph augmentation using existing VCF files in the working directory.                                                | Disabled                                                                           |\n| `--out`/`-o`            | Specify the output file name.                                                                                                                                     | `variants.vcf` for `call` and `graph-call` modes, `augment.gfa` for `augment` mode |\n| `--version`/`-v`        | Show the version of SVPG.                                                                                                                                         | N/A                                                                                |\n| `--help`/`-h`           | Show help message and exit.                                                                                                                                       | N/A                                                                                | \n\n## Limitations\n* SVPG's pangenome-guided mode relies on minigraph to realign SV signature reads to the pangenome graph. Although this step introduces some overhead, this process is relatively fast: in our tests on the HG002 sample, realignment took approximately 10 minutes for ONT (50×) data and 4 minutes for HiFi (48×) data.\n* The `--realign` module provides more accurate breakpoint resolution in graph-hard-alignment regions (for example, [LCRs](https://arxiv.org/html/2509.23057v1#bib.bib20)). On the latest HG002-Q100 benchmark, this module yields measurable performance improvements.\nHowever, it relies on pyabpoa and mappy to perform local re-alignment, which introduces additional computational overhead (e.g., ~1 hour extra for 48× HG002 HiFi data).\nAs this feature is still experimental, we recommend enabling it in analyses that require base-pair–level breakpoint accuracy.\n* The graph-based mode currently does not support genotyping. Users should manually adjust the minimum read support threshold using the `--min_support`/`-s` parameter based on sequencing depth to balance sensitivity and precision.\n \n## Citation\nRefer to our [paper](https://doi.org/10.1101/2025.07.11.664486) for further details and citation:\n\nHu, H. et al. SVPG: A pangenome-based structural variant detection approach and rapid augmentation of pangenome graphs with new samples. bioRxiv, 2025.2007.2011.664486 (2025).\n\n## Contact\n\nFor questions or support, please open an issue on GitHub or contact the authors at [hhengwork@gmail.com](mailto:hhengwork@gmail.com).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcoopsor%2Fsvpg","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcoopsor%2Fsvpg","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcoopsor%2Fsvpg/lists"}