{"id":26904018,"url":"https://github.com/quinlan-lab/pathoscore","last_synced_at":"2025-07-27T17:05:12.952Z","repository":{"id":92119679,"uuid":"97071922","full_name":"quinlan-lab/pathoscore","owner":"quinlan-lab","description":"pathoscore evaluates variant pathogenicity tools and scores.","archived":false,"fork":false,"pushed_at":"2022-03-25T14:31:29.000Z","size":162,"stargazers_count":21,"open_issues_count":5,"forks_count":8,"subscribers_count":9,"default_branch":"master","last_synced_at":"2025-04-01T10:49:20.111Z","etag":null,"topics":["pathogenic-variants","score","variants","vcfanno"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/quinlan-lab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2017-07-13T02:33:56.000Z","updated_at":"2024-01-19T10:13:14.000Z","dependencies_parsed_at":null,"dependency_job_id":"656d3614-97a1-4988-9423-d857378d1f68","html_url":"https://github.com/quinlan-lab/pathoscore","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/quinlan-lab/pathoscore","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/quinlan-lab%2Fpathoscore","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/quinlan-lab%2Fpathoscore/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/quinlan-lab%2Fpathoscore/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/quinlan-lab%2Fpathoscore/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/quinlan-lab","download_url":"https://codeload.github.com/quinlan-lab/pathoscore/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/quinlan-lab%2Fpathoscore/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":267392533,"owners_count":24079917,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-07-27T02:00:11.917Z","response_time":82,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["pathogenic-variants","score","variants","vcfanno"],"created_at":"2025-04-01T10:49:23.867Z","updated_at":"2025-07-27T17:05:12.927Z","avatar_url":"https://github.com/quinlan-lab.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"pathoscore\n==========\n\npathoscore evaluates variant pathogenicity tools and scores.\n\nevaluating scores is hard because logic can be circular and benign and pathogenic sets are\nhard to curate and evaluate.\n\n`pathoscore` is software and datasets that facilitate applying evaluating pathogenicity scores.\n\nThe sections below describe the tools.\n\nAnnotate\n--------\n\nAnnotate a vcf with some scores (which can be bed or vcf).\nNote that this tool is a simple wrapper around [vcfanno](https://github.com/brentp/vcfanno) so \na user can instead use to run vcfanno directly.\n\n```\npython pathoscore.py annotate \\\n    --scores exac-ccrs.bed.gz:exac_ccr:14:max \\\n    --scores mpc.regions.clean.sorted.bed.gz:mpc_regions:5:max \\\n    --exclude /data/gemini_install/data/gemini_data/ExAC.r0.3.sites.vep.tidy.vcf.gz \\\n    --conf combined-score.conf \\\n    testing-denovos.vcf.gz\n```\n\nThe individual flags are described here:\n\n### scores\n\nThe `scores` format is `path:name:column:op` where:\n\n+ name becomes the new name in the INFO field.\n+ column indicates the column number (or INFO name) to pull from the scores VCF.\n+ op is a `vcfanno` operation.\n\n+ multiple annotations for the same file can be used as such:\n```\npython pathoscore.py annotate --prefix benign \\\n --scores score-sets/GRCh37/aloft/aloft.txt.gz:aloft_het,aloft_lof,aloft_rec:5,6,7:max,max,max \\\n truth-sets/GRCh37/clinvar/clinvar-benign.20170905.vcf.gz\n```\n\n### exclude\n\ncan be a population VCF that is used to filter would-be pathogenic variants (as we know that common variants can't be pathogenic). This can also be a set of regions to exclude, and for user convenience we curated gene sets that the user can filter on such as autosomal dominant genes from Berg et al. (2013) and haploinsufficient genes from Dang et al. (2008).\n\n### conf\n\nan optional [vcfanno](https://github.com/brentp/vcfanno) conf file so users can specify exactly\nhow to annotate if they feel comfortable doing so.\n\nThis can also be used to specify vcfanno `[[postannotation]]` blocks, for example, to combine scores.\n\nAn example `conf` to combine 2 scores looks like:\n\n```\n[[postannotation]]\nname=\"combined\"\nop=\"lua:exac_ccr+10\\*cadd\"\nfields=[\"exac_ccr\", \"cadd\"]\ntype=\"Float\"\n```\n\nEvaluate\n--------\n\n```\npython pathoscore.py evaluate \\\n    -s MPC \\\n    -s exac_ccr \\\n    -i mpc_regions \\\n    -s combined \\\n    --goi listofgenesofinterest \\\n    pathogenic.vcf.gz \\\n    benign.vcf.gz\n```\n\nThis will take the output(s) from `annotate` and create ROC curves and score distribution plots.\nIt assumes that the first VCF contains pathogenic variants and the 2nd contains benign variants.\nIt uses the columns specified via `-s` and `-i` as the scores.\n\n`-i` indicates that lower scores are more constrained where as \n\n`-s` is for fields where higher scores are more constrained.\n\n`--goi` is to provide a newline delimited file of genes of interest for a clinical utility calculation.  More information is provided in the [wiki](https://github.com/quinlan-lab/pathoscore/wiki/Clinical-Utility-and-Genes-of-Interest).\n\nOutput\n------\n\nAn example ROC curve for the Clinvar truth-set looks like this:\n\n![roc](https://user-images.githubusercontent.com/1739/29724634-6b730c44-8986-11e7-8b82-4341edcb3f0a.png \"roc\")\n\nThe point in the plot shows the max [J Statistic](https://en.wikipedia.org/wiki/Youden%27s_J_statistic) which can be\nsummarized as the point in each curve where the vertical distance to the Y=X line is maximized. \nThis has its highest possible value at an FPR of 0 so there is an implicit penalty for having a high TPR at a high-ish\nFPR.\n\nWe also report the full distrubtion of J statistics:\n\n\n![J](https://user-images.githubusercontent.com/1739/29724633-6b72ee30-8986-11e7-9e1d-1033392e2914.png \"J\")\n\nfinally, we report the proportion of *benign* and *pathogenic* variants scored in a truth-set:\n\n\n![scores](https://user-images.githubusercontent.com/1739/29724635-6b72f308-8986-11e7-89bb-3e86fc16fab7.png \"scores\")\n\nThese plots, along with the score-distributions for each method for pathogenic and benign, are aggregated into a single\nHTML report.\n\nInstall\n-------\n\nDownload a [vcfanno binary](https://github.com/brentp/vcfanno/releases) for your system and make it available as\n`vcfanno` on your `$PATH`\n\nThen run:\n```\npip install -r requirements.txt\n```\n\nThen you should be able to run the evaluation scripts.\n\nTruth Sets\n----------\n\nPart of `pathoscore` is to provide curated truth sets that can be used for evaluation.\n\nThese are kept in `truth-sets/`. Each set has a benign and/or a pathogenic set. \n\nPull-requests for recipes that add new truth sets are welcomed. These should include a `make.sh`\nscript that, when run will pull from the original data source and make a benign and/or pathogenic\nvcf that is bgzipped and tabixed and made as small as possible (see the clinvar example for how\nto remove unneeded fields from the INFO field).\n\nAll truth-sets should be annotated with `bcftools csq` so that it's possible to choose to score only\nfunctional variants.\n\nCurrently we have:\n\n### ClinVar\n\n+ ClinVar pathogenics are either `Pathogenic` or `Likely-Pathogenic` and variants with uncertainty are removed.\n+ ClinVar benigns are either `Benign` or `Likely-Benign` and variants with uncertainty are removed.\n+ ClinVar variants where there is an SSR field are removed because they are suspected false positives due to paralogy or computational/sequencing error\n+ We created a version of ClinVar benigns that incorporates gnomAD variants to match the much larger count of ClinVar pathogenics, with the intent of creating an equal-sized set of pathogenics and benigns.  See the [README](truth-sets/GRCh37/clinvar/benchmark/README.md) for those sets for more info.\n\n### Samocha\n\nThese are from [Kaitlin Samocha's paper](http://www.biorxiv.org/content/early/2017/06/12/148353) on mis-sense contraint.\n\n+ Benigns are labelled as `control` in her source file.\n+ Pathogenics are anything other than control.\n\n### Filtering Pathogenic Variants on Allele Frequency\n\nSome alleged pathogenic variants may appear at high allele frequencies in population databases, and some users may understandably find those variants suspect.  If you would like to filter out variants on allele frequency in a population set.  An example conf file is provided in the repo called [af.conf](scripts/gnomad/af.conf). If you have additional filtering parameters you'd like to specify you can also use a conf file for that as detailed in [vcfanno's repo](https://github.com/brentp/vcfanno).\n\nAnd then you can run the pathoscore script as below:\n\n```\npython pathoscore.py annotate --scores score-sets/GRCh37/MPC/mpc.txt.gz:MPC:5:max --scores score-sets/GRCh37/REVEL/revel.txt.gz:REVEL:7:max truth-sets/GRCh37/samocha/samocha.pathogenic.vcf.gz --prefix neurodev --conf af.conf\n```\n\nJust make sure that you don't use a file more than once in the conf file, write everything you want to do for each file in a list as shown above.  Additionally, don't use any fields like --scores or --exclude to perform things on a file that is already referenced in the conf file you provide to pathoscore.  It will not work.\n\nFor user convenience, under scripts/gnomad, there are make scripts for generating vt normalized, decomposed and BCSQ annotated ExAC v1 and gnomAD VCF files, so that you can filter by allele frequency in those population datasets.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fquinlan-lab%2Fpathoscore","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fquinlan-lab%2Fpathoscore","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fquinlan-lab%2Fpathoscore/lists"}