{"id":21547970,"url":"https://github.com/illumina/paragraph","last_synced_at":"2025-07-09T18:02:44.810Z","repository":{"id":26928116,"uuid":"111942873","full_name":"Illumina/paragraph","owner":"Illumina","description":"Graph realignment tools for structural variants","archived":false,"fork":false,"pushed_at":"2022-12-08T18:33:38.000Z","size":32304,"stargazers_count":143,"open_issues_count":21,"forks_count":28,"subscribers_count":20,"default_branch":"master","last_synced_at":"2024-03-26T20:19:48.853Z","etag":null,"topics":["genotyping","htslib","structural-variation","variant-calling","vcf"],"latest_commit_sha":null,"homepage":null,"language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Illumina.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2017-11-24T17:39:25.000Z","updated_at":"2024-02-27T14:32:50.000Z","dependencies_parsed_at":"2023-01-14T08:30:34.943Z","dependency_job_id":null,"html_url":"https://github.com/Illumina/paragraph","commit_stats":null,"previous_names":[],"tags_count":10,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Illumina%2Fparagraph","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Illumina%2Fparagraph/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Illumina%2Fparagraph/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Illumina%2Fparagraph/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Illumina","download_url":"https://codeload.github.com/Illumina/paragraph/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248166927,"owners_count":21058480,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["genotyping","htslib","structural-variation","variant-calling","vcf"],"created_at":"2024-11-24T06:16:56.004Z","updated_at":"2025-04-10T05:53:13.974Z","avatar_url":"https://github.com/Illumina.png","language":"C++","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Paragraph: a suite of graph-based genotyping tools\n\n\u003c!-- vscode-markdown-toc --\u003e\n* [Introduction](#Introduction)\n* [Installation](#Installation)\n* [Run Paragraph from VCF](#RunParagraphFromVCF)\n    * [Test example](#TestExample)\n    * [Input requirements](#InputRequirements)\n    * [Run time](#RunTime)\n    * [Population-scale genotyping](#PopulationScaleGenotyping)\n* [Run Paragraph on complex variants](#RunParagraphOnComplexVariants)\n* [Further Information](#FurtherInformation)\n\t* [Documentation](#Documentation)\n\t* [External links](#ExternalLinks)\n* [License](#License)\n\n\u003c!-- vscode-markdown-toc-config\n\tnumbering=false\n\tautoSave=true\n\t/vscode-markdown-toc-config --\u003e\n\u003c!-- /vscode-markdown-toc --\u003e\n\n## \u003ca name='Introduction'\u003e\u003c/a\u003eIntroduction\n\nAccurate genotyping of known variants is a critical for the analysis of whole-genome sequencing data. Paragraph aims to facilitate this by providing an accurate genotyper for Structural Variations with short-read data.\n\nPlease reference Paragraph using:\n\n- Chen, et al (2019) [Paragraph: A graph-based structural variant genotyper for short-read sequence data](https://www.biorxiv.org/content/10.1101/635011v2). *bioRxiv*. doi: https://doi.org/10.1101/635011\n\nGenotyping data in this paper can be found at [paper-data/download-instructions.txt](paper-data/download-instructions.txt)\n\nFor details of population genotyping, please also refer to:\n\n- https://www.illumina.com/science/genomics-research/accurate-genotyping-of-structural-variant.html\n\n## \u003ca name='Installation'\u003e\u003c/a\u003eInstallation\n\nPlease check [doc/Installation.md](doc/Installation.md) for system requirements and installation instructions.\n\n## \u003ca name='RunParagraphFromVCF'\u003e\u003c/a\u003eRun Paragraph from VCF\n### \u003ca name='TestExample'\u003e\u003c/a\u003eTest example\nAfter installation, run `multigrmpy.py` script from the build/bin directory on an example dataset as follows:\n\n```bash\npython3 bin/multigrmpy.py -i share/test-data/round-trip-genotyping/candidates.vcf \\\n                          -m share/test-data/round-trip-genotyping/samples.txt \\\n                          -r share/test-data/round-trip-genotyping/dummy.fa \\\n                          -o test \\\n```\n\nThis runs a simple genotyping example for two test samples.\n*  **candidates.vcf**: this specifies candidate SV events in a vcf format.\n*  **samples.txt**: Manifest that specifies some test BAM files. Tab or comma delimited.\n*  **dummy.fa** a short dummy reference which only contains `chr1`\n\nThe output folder `test` then contains gzipped json for final genotypes:\n\n```bash\n$ tree test\n```\n```\ntest\n├── grmpy.log            #  main workflow log file\n├── genotypes.vcf.gz     #  Output VCF with individual genotypes\n├── genotypes.json.gz    #  More detailed output than genotypes.vcf.gz\n├── variants.vcf.gz      #  The input VCF with unique ID from Paragraph\n└── variants.json.gz     #  The converted graphs from input VCF (no genotypes)\n```\n\nIf successful, the last 3 lines of genotypes.vcf.gz will the same as in [expected file](share/test-data/round-trip-genotyping/expected-vcf-record.txt).\n\n## \u003ca name='InputRequirements'\u003e\u003c/a\u003eInput requirements\n### VCF format\nparaGRAPH will independently genotype each entry of the input VCF. You can use either indel-style representation (full REF and ALT allele sequence in 4th and 5th columns) or symbolic alleles, as long as they meet the format requirement of VCF 4.0+.\n\nCurrently we support 4 symbolic alleles:\n- `\u003cDEL\u003e` for deletion\n    - Must have END key in INFO field.\n- `\u003cINS\u003e` for insertion\n    - Must have a key in INFO field for insertion sequence (without padding base). The default key is SEQ.\n    - For blockwise swap, we strongly recommend using indel-style representation, other than symbolic alleles.\n- `\u003cDUP\u003e` for duplication\n    - Must have END key in INFO field. paraGRAPH assumes the sequence between POS and END being duplicated for one more time in the alternative allele.\n- `\u003cINV\u003e` for inversion\n    - Must have END key in INFO field. paraGRAPH assumes the sequence between POS and END being reverse-complemented in the alternative allele.\n\n### Sample Manifest\nMust be tab-deliemited.\n\nRequired columns:\n- id: Each sample must have a unique ID. The output VCF will include genotypes for all samples in the manifest\n- path: Path to the BAM/CRAM file.\n- depth: Average depth across the genome. Can be calculated with bin/idxdepth (faster than samtools).\n- read length: Average read length (bp) across the genome.\n\nOptional columns:\n\n- depth sd: Specify standard deviation for genome depth. Used for the normal test of breakpoint read depth. Default is sqrt(5*depth).\n- depth variance: Square of depth sd.\n- sex: Affects chrX and chrY genotyping. Allow \"male\" or \"M\", \"female\" or \"F\", and \"unknown\" (quotes shouldn't be included in the manifest). If not specified, the sample will be treated as unknown.\n\n## \u003ca name='RunTime'\u003e\u003c/a\u003eRun time\n\n- On a 30x HiSeqX sample, Paragraph typically takes 1-2 seconds to genotype a simple SV in confident regions.\n\n- If the SV is in a low-complexity region with abnormal read pileups, the running time could vary.\n\n- For efficiency, it is recommended to manually set the \"-M\" option (maximum allowed read count for a variant) to skip these high-depth regions. We recommend \"-M\" as 20 times of your mean sample depth.\n\n## \u003ca name='PopulationScaleGenotyping'\u003e\u003c/a\u003ePopulation-scale genotyping\n\nTo efficiently genotype SVs across a population, we recommend doing single-sample mode as follows:\n- Create a manifest for each single sample\n- Run `multigrmpy.py` for each manifest. Be sure to set \"-M\" option for each sample according to its depth.\n- Multithreading (option \"-t\") is highly recommended for population-scale genotyping\n- Merge all `genotypes.vcf.gz` to create a big VCF of all samples. You can use either `bcftools merge` or your custom script.\n\n## \u003ca name='RunParagraphOnComplexVariants'\u003e\u003c/a\u003eRun Paragraph on complex variants\nFor more complicated events (e.g. genotype a deletion together with its nearby SNP), you can provide a custimized JSON to paraGRAPH:\n\nPlease follow the pattern in [example JSON](share/test-data/paragraph/pg-het-ins/pg-het-ins.json) and make sure all required keys are provided. Here is a visualization of this [sample graph](share/test-data/paragraph/pg-het-ins/pg-het-ins.png).\n\nTo obtain graph alignments for this graph (including all reads), run:\n```bash\nbin/paragraph -b \u003cinput BAM\u003e \\\n              -r \u003creference fasta\u003e \\\n              -g \u003cinput graph JSON\u003e \\\n              -o \u003coutput JSON path\u003e \\\n              -E 1\n```\n\nTo obtain the algnment summary, genotypes of each breakpoint, and the whole graph, run:\n```bash\nbin/grmpy -m \u003cinput manifest\u003e \\\n          -r \u003creference fasta\u003e \\\n          -i \u003cinput graph JSON\u003e \\\n          -o \u003coutput JSON path\u003e \\\n          -E 1\n```\n\nIf you have multiple events listed in the input JSON, `multigrmpy.py` can help you to run multiple `grmpy` jobs together.\n\n## \u003ca name='FurtherInformation'\u003e\u003c/a\u003eFurther Information\n\nPlease check github wiki for common usage questions and errors.\n\n### \u003ca name='Documentation'\u003e\u003c/a\u003eDocumentation\n\n*    More **information about all tools we provide in this package** can be found in \n    [doc/graph-tools.md](doc/graph-tools.md).\n\n*   In [doc/graph-models.md](doc/graph-models.md) we describe the graph and genotyping \n    models we implement.\n\n*    Some developer documentation about our code analysis and testing process can be found in \n    [doc/linting-and-testing.md](doc/linting-and-testing.md).\n\n*    Procedures for read level alignment validation \n    [doc/validation-with-simulated-reads.md](doc/validation-with-simulated-reads.md).\n\n*    How we count reads for variants and paths\n    [doc/graph-counting.md](doc/graph-counting.md).\n\n*    Documentation of genotyping model parameters\n    [doc/genotyping-parameters.md](doc/genotyping-parameters.md).\n\n*   [Doc/graphs-ashg-2017.pdf](doc/graphs-ashg-2017.pdf) contains the poster about this method we showed at \n    [ASHG 2017](http://www.ashg.org/2017meeting/)\n\n### \u003ca name='ExternalLinks'\u003e\u003c/a\u003eExternal links\n\n*   The [Illumina/Polaris](https://github.com/Illumina/Polaris) repository gives the\n    short-read sequencing data we used to test our method in population.\n\n## \u003ca name='License'\u003e\u003c/a\u003eLicense\n\nThe [LICENSE](LICENSE) file contains information about libraries and other tools we use, \nand license information for these.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fillumina%2Fparagraph","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fillumina%2Fparagraph","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fillumina%2Fparagraph/lists"}