{"id":38676499,"url":"https://github.com/wglab/phenosv","last_synced_at":"2026-01-17T10:01:12.275Z","repository":{"id":173139645,"uuid":"628754962","full_name":"WGLab/PhenoSV","owner":"WGLab","description":"PhenoSV: Interpretable phenotype-aware model for the prioritization of genes affected by structural variants.","archived":false,"fork":false,"pushed_at":"2025-02-11T15:32:35.000Z","size":16268,"stargazers_count":16,"open_issues_count":3,"forks_count":4,"subscribers_count":4,"default_branch":"main","last_synced_at":"2025-02-11T16:35:49.464Z","etag":null,"topics":["phenotyping","structural-variants","transformer","variant-interpretation"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/WGLab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-04-16T22:45:32.000Z","updated_at":"2025-02-11T15:32:39.000Z","dependencies_parsed_at":null,"dependency_job_id":"a3c28cc1-7266-45c1-b93c-75711c46cd64","html_url":"https://github.com/WGLab/PhenoSV","commit_stats":null,"previous_names":["wglab/phenosv"],"tags_count":4,"template":false,"template_full_name":null,"purl":"pkg:github/WGLab/PhenoSV","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/WGLab%2FPhenoSV","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/WGLab%2FPhenoSV/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/WGLab%2FPhenoSV/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/WGLab%2FPhenoSV/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/WGLab","download_url":"https://codeload.github.com/WGLab/PhenoSV/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/WGLab%2FPhenoSV/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28505570,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-17T06:57:29.758Z","status":"ssl_error","status_checked_at":"2026-01-17T06:56:03.931Z","response_time":85,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["phenotyping","structural-variants","transformer","variant-interpretation"],"created_at":"2026-01-17T10:01:06.979Z","updated_at":"2026-01-17T10:01:12.241Z","avatar_url":"https://github.com/WGLab.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# PhenoSV: Interpretable phenotype-aware model for the prioritization of genes affected by structural variants.\n[![DOI](https://zenodo.org/badge/628754962.svg)](https://zenodo.org/doi/10.5281/zenodo.10028740)\n\n## Background\nStructural variants (SVs) represent a major source of genetic variation and may be associated with phenotypic diversity and disease susceptibility. Recent advancements in long-read sequencing have revolutionized the field of SV detection, enabling the discovery of over 20,000 SVs per human genome. However, identifying and prioritizing disease-related SVs and assessing their functional impacts on individual genes remains challenging, especially for noncoding SVs. \n\nPhenoSV is a phenotype-aware machine-learning model to predict pathogenicity of all types of structural variants (SVs) that disrupt either coding or noncoding genome regions, including deletions, duplications, insertions, inversions, and translocations. PhenoSV segments SVs and annotates each segment using hundreds of genomic features, then adopts a transformer-based architecture to predict functional impacts of SVs under a multiple-instance learning framework. When phenotype information is available, PhenoSV further utilizes gene-phenotype associations to prioritize disease-related SVs. \n\n## Web server\nFor SVs that affect less than 30 protein-coding genes (10 protein-coding genes each for batch predictions), we provide a web server at https://phenosv.wglab.org for easy applications of PhenoSV. If you want to score SVs that affect more than 30 genes or make batch predictions, please install PhenoSV and run offline. \n\n## Installation\n\n### Step1: install sources \nTo avoid package version conflicts, we strongly recommand to use conda to set up the environment. If you don't have conda installed, you can run codes below in linux to install.\n\n```\nwget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh \nbash Miniconda3-latest-Linux-x86_64.sh\n```\n\nAfter installing conda, PhenoSV sources can be downloaded:\n\n```\ngit clone https://github.com/WGLab/PhenoSV.git\ncd PhenoSV\nconda env create --name phenosv --file phenosv.yml\nconda activate phenosv\n```\n\n### Step2: download and set up required files\nSome files are required by PhenoSV, including genomic feature files, the pre-trained PhenoSV model, Phen2Gene knowledgebases, etc. All of these files required by the full version of PhenoSV have been packed together and can be downloaded using codes below. The packed file takes about 160G storage, please make sure to store these files under a directory with enough space. This directory can be different from the one used to save PhenoSV source codes. Due to the large size of feature files, this step might take some time. Simply run codes below to set up everything. \n\n```\nbash setup.sh /path/to/folder\n```\n\nWe also offered a light-weight version of PhenoSV as a highly efficient alternative of PhenoSV. PhenoSV-light consists of only 42 features with much improved annotation efficiency without compromising predictive accuracy except for translocations. The packed file takes about 50G storage. Run the codes below to set up PhenoSV-light. \n\n```\nbash setup.sh /path/to/folder 'light'\n```\n\nNote that, users who downloaded the full set of required files using can excute both PhenoSV and PhenoSV-light. Users who downloaded the light version files can only excute PhenoSV-light.\n\nIf users need to change the path for storing required data files and would like to make changes for the config file, below codes can be excuted.\n\n```\nbash update_config.sh /path/to/newfolder \n```\n\n\n### Step3: install PhenoSV as a python package (optional)\n\nThis step is not required. If you want to integrate PhenoSV into your own python scripts, you can install PhenoSV as a python package following the steps above.\n\n```\npip install -e .\n```\n\n## Run PhenoSV in linux\n\nIn linux, PhenoSV can be used to: \n- score a single SV with or without prior phenotype knowledge \n- score a list of SVs in .bed, .csv, or .bedpe format with or without prior phenotype knowledge. \n\nType `python3 phenosv/model/phenosv.py -h` to see all options. \n\n```\noptions:\n  -h, --help            show this help message and exit\n  --genome GENOME       choose genome build between hg38 (default) and hg19\n  --alpha ALPHA         A positive value with larger value representing more contribution of phenotype information in refining PhenoSV\n                        scores. Default is 1\n  --inference INFERENCE\n                        leave it blank (default) if only considering direct impacts of coding SVs. Set to `full` if inferring both\n                        direct and indirect impacts of coding SVs\n  --model MODEL         choose between PhenoSV (default) and PhenoSV-light\n  --c C                 chromosome, e.g. chr1\n  --s S                 start, e.g. 2378909\n  --e E                 end, e.g. 2379909\n  --c2 C2               chromosome2, e.g. chr1, only for translocation\n  --s2 S2               start2, e.g. 2378909, only for translocation\n  --strand1 STRAND1     strand1, + or - , only for translocation\n  --strand2 STRAND2     strand2, + or - , only for translocation\n  --svtype SVTYPE       deletion, duplication, insertion, inversion, translocation\n  --noncoding NONCODING\n                        inference mode, choose from distance and tad\n  --HPO HPO             HPO terms should in the format of HP:digits, e.g., HP:0000707, separated by semicolons, commas, or\n                        spaces.\n  --sv_file SV_FILE     path to SV file (.csv, .bed, .bedpe)\n  --target_folder TARGET_FOLDER\n                        enter the folder path to save PhenoSV results, leave it blank if you only want to print out the results\n  --target_file_name TARGET_FILE_NAME\n                        enter the file name to save PhenoSV results\n```\n\n\n### Score a single SV\n\nThe running time of PhenoSV to score a single SV depends on the number of genes it impacted. For the examples below, PhenoSV is expected to generate results within a few seconds.\n\n#### deletion, duplication, insertion, inversion\n##### Example1\nYou can use the following codes to score a single SV (deletion, duplication, insertion, inversion) easily by providing the SV location and type. The arguments required are: --c: chromosome, --s: start position, --e: end position (can be ignored by insertions), --svtype: types of SV.\n\n```\npython3 phenosv/model/phenosv.py --c chr6 --s 156994830 --e 157006982 --svtype 'deletion'\n```\n\n##### Example2\nSince this example SV is a noncoding SV, PhenoSV's default setting is to consider genes within 1Mbp upstream and downstream impacted. PhenoSV can also consider genes based on consensus TAD annotation by setting `--noncoding` argument as 'tad'. Prior phenotype information can be added using `--HPO` argument. Here is an example:\n\n```\npython3 phenosv/model/phenosv.py --c chr6 --s 156994830 --e 157006982 --svtype 'deletion' --noncoding 'tad' --HPO 'HP:0000707,HP:0007598'\n```\n\nPhenoSV will output results below. Without considering phenotype information, PhenoSV predicts the SV-level pathogenicity as 0.65. The gene-level pathogenicity scores are 0.82 for ARID1B by disrupting its introns, 0.05 for NOX3, and 0.53 for TFB1M by indirectly altering their regulatory elements. After adding phenotype information, PhenoSV scores are 0.65 for the whole SV and 0.82 for ARID1B gene.\n\n```\n  Elements  Pathogenicity           Type  Phen2Gene   PhenoSV\n0       SV       0.664912  Non-coding SV   0.999126  0.664331\n1   ARID1B       0.823556       Intronic   0.999126  0.822836\n2     NOX3       0.045648     Regulatory   0.837460  0.038229\n3    TFB1M       0.533431     Regulatory   0.544762  0.290593\n\n```\n\n##### Example3\nAnother example as shown in the paper, we investigated the SV that indirectly impacting SOX9 (Kurth et al. 2009, GRCh38, chr17: 70134929-71339950, duplication). This is a coding SV impacting exons of gene KCNJ16 and KCNJ2, as seen below. The model only predict direct impacts of coding SVs on genes within the SV region by default, and the pathogenicity score is 0.07, very likely to be benign.\n\n```\npython3 phenosv/model/phenosv.py --c chr17 --s 70134929 --e 71339950 --svtype 'duplication'\n```\n\n```\n  Elements  Pathogenicity           Type\n0       SV       0.066281      Coding SV\n1   KCNJ16       0.016484         Exonic\n2    KCNJ2       0.088711         Exonic\n\n```\n\nWe can set add argument of `--inference 'full'` to infer both direct and indirect impacts of coding SVs. Here the model predict the SV pathogenicity score as 0.69 throught impacting genes indirectly, whereas the gene-level pathogenicity scores for MAP2K6 and SOX9 are 0.59 and 0.61 respectively. \n\n```\npython3 phenosv/model/phenosv.py --c chr17 --s 70134929 --e 71339950 --svtype 'duplication' --inference 'full' --noncoding 'tad'\n```\n\n```\n  Elements  Pathogenicity                Type\n0       SV       0.066281           Coding SV\n1   KCNJ16       0.016484              Exonic\n2    KCNJ2       0.088711              Exonic\n3       SV       0.688360  Coding SV indirect\n4   MAP2K6       0.594582          Regulatory\n5     SOX9       0.610116          Regulatory\n\n```\n\n\n##### Example4\n\nTo run the PhenoSV-light model, simply add --model 'PhenoSV-light' as shown below.\n\n```\npython3 phenosv/model/phenosv.py --c chr6 --s 156994830 --e 157006982 --svtype 'deletion' --model 'PhenoSV-light'\n```\n\n#### translocation\n\nPhenoSV can also be used to interpret translocations. The arguments required are: --c: 5' chromosome, --s: 5' breakpoint, --c2: 3' chromosome, --s2: 3' breakpoint, --svtype: types of SV (translocation); --strand1:  5' strand and --strand2:  3' strand are optional with '+' as default.\n\n```\npython3 phenosv/model/phenosv.py --c chr6 --s 156994830 --strand1 '+' --c2 chr7 --s2 156994830 --strand2 '+' --svtype 'translocation'\n```\n\nPhenoSV will output results below. The SV-level pathogenicity is 0.98, generating a fusion ARID1B-MNX1 gene with PhenoSV scores being 0.98 and 0.95 for ARID1B and MNX1, respectively.\n\n```\n  Elements  Pathogenicity       Type                             ID\n0       SV       0.975636  Coding SV  chr6:156994830-chr7:156994830\n1   ARID1B       0.975636     Exonic  chr6:156994830-chr7:156994830\n2     MNX1       0.947925     Exonic  chr6:156994830-chr7:156994830\n```\n\n### Score multiple SVs\n\nPhenoSV accepts csv, bed, and bedpe files as input to score multiple SVs. Some examples are provided at `data/`. csv and bed files can be used to score deletion, duplication, inversion, and insertion. bedpe files can be used to score translocations. Fields of csv and bed files should be: chromosome, start, end, ID, svtype, HPO (optional). Fields of bedpe files should be: chromosome1, start1, end1, chromosome2, start2, end2, strand1, strand2, ID. start1 and start2 will be treated as breakpoints, whereas end1 and end2 will not be used by PhenoSV. Note that, if input files do not have HPO terms, you can use the `--HPO` argument to add phenotype information by treating the same HPO terms for all SVs in the file. If HPO terms are present in input files, the `--HPO` argument will be ignored.\n\nTo score multiple SVs using a single process, run: \n\n```\npython3 phenosv/model/phenosv.py --sv_file data/sampledata.bed --target_folder data/ --target_file_name sample_bed_out\n```\n\nYou can also score multiple SVs in parallel to speed up. Below is an example of running PhenoSV with 4 processes in parallel (set up using `--workers` argument). Leave the HPO terms blank if they are already in the input file or you don't have prior phenotype information. \n\n```\nbash phenosv/model/phenosv.sh --sv_file 'path/to/sv/data.csv' --target_folder 'folder/path/to/store/results' --workers 4 --HPO 'HP:0000707,HP:0007598' --model PhenoSV-light\n```\n\n\n\n## Run PhenoSV in Python\n\nYou can run PhenoSV in Python and the output will be a pandas dataframe with the SV-level and the gene-level predictions for a given SV.\n\n```\n#import packages\nimport os\nimport phenosv\nfrom phenosv.model.phenosv import init as init\nimport phenosv.model.operation_function as of\n\n#get configurations, use the light argument for choosing between PhenoSV and PhenoSV-light\nconfig_path = os.path.join(os.path.dirname(phenosv.__file__), '..', 'lib', 'fpath.config')\nconfigs, ckpt = init(config_path, ckpt = True, light = False)\n\n# set 'tad_path' as None to consider genes within 1Mbp uptream and downstream a noncoding SV.\n# do not run this line if you want to use TAD annotations to interpret noncoding SVs\nconfigs['tad_path']=None\n\n# users can also specify tissue-specific TAD annotations based on research goals\n# simply assign TAD annotation path using configs['tad_path']='path/to/tad_annotation.bed'\n\n#load model\nmodel = of.prepare_model(ckpt)\n\n#to interpret a single SV that is not translocation\nof.phenosv(CHR='chr6', START=156994830, END=157006982, svtype='deletion', model=model, HPO=None, **configs)\n\n#to interpret a single SV that is translocation\nof.phenosv(CHR='chr6', START=156994830,END=None,svtype='translocation', model=model, HPO=None,\n           CHR2='chr11', START2=111728347, strand1='+',strand2='+',**configs)\n\n#liftover if using hg19 build\nfrom liftover import get_lifter\nimport phenosv.utilities.utility as u\nconverter = get_lifter('hg19', 'hg38')\n\n#SV in hg19\nCHR, START, END ='chr6',157315964, 157328116 \n#liftover to hg38\nSTART = u.liftover(CHR, START, converter)\nEND = u.liftover(CHR, END, converter)\nof.phenosv(CHR=CHR, START=START, END=END, svtype='deletion', model=model, HPO=None, **configs)\n\n```\n## User defined TAD annotations\n\nPhenoSV relies on pre-determined sets of candidate genes when interpreting the impacts of noncoding SVs, either by distance or TAD annotations. We already included an aggregated version of TAD annotation file with PhenoSV. This annotation file is tissue-unspecific. Users can access to this file under the `data` folder at pre-assigned path (saved in lib/fpath.config). The file name is `tad_w_boundary_08.bed`\n\nSince TAD annotations are tissue specific, users can also use their own TAD annotations depending on different study goals. The annotation file should be in bed format. Run codes below to assign different annotation files. \n\n```\npython3 phenosv/model/phenosv.py --c chr6 --s 156994830 --e 157006982 --svtype 'deletion' --noncoding 'path/to/tad_annotation.bed'\n```\n\nHere is an example showing the format of the TAD annotation file. Four columns are 'CHR', 'START','END', and 'Indicators' (0 means TAD domain, 1 means TAD boundaries).\n\n```\nchr1    0       724620  0\nchr1    724620  764620  1\nchr1    764620  3423436 0\nchr1    3423436 3463436 1\nchr1    3463436 3783436 0\nchr1    3783436 3823436 1\n```\n\n\n## Annotate SVs\n\nUsing the above codes, PhenoSV annotates SVs with hundreds of genomic features on the fly and then feeds into the pre-trained model to make pathogenicity calls. Users can also save SV annotations forehead and make predictions afterward using the codes below. \n\n```\n#Annotate SVs using a single CPU core\npython3 phenosv/model/annotation.py --sv_file data/sampledata.csv --target data/\n\n#Annotate SVs using multiple CPU cores in parallel (4 cores in the example)\nbash annotation.sh data/sampledata.csv path/to/output/folder/ 4\n```\n\n## Archived datasets\n\nWe deposited simulated patients' SV profiles used in manuscript for prioritizations. Several things to notice:\n- Each file corresponds to one patient's SV profile after filtering out all common SVs\n- Within each SV profile, the first row is the real disease-associated SV, the rest SVs are noise rare SVs. \n- We hided the coordinates of all SVs from DECIPHER, but one can query the information from https://www.deciphergenomics.org \n- The column of `Pathogenicity` represents general pathogenicity scores predicted by PhenoSV ($p_{sv}$), `Phen2Gene` represents gene-phenotype associations, and `PhenoSV` represents SV pathogenicity associated with given phenotype information ($p_{sv}^{pheno}$) when setting $\\alpha=1$.\n\nUse the codes below to download the simulation data. \n\n```\nwget https://www.openbioinformatics.org/PhenoSV/prioritization_simulation_benchmark.tar.gz\n```\n\nSouce data of the paper used to generate all figures can be found  [here](https://github.com/WGLab/PhenoSV/tree/main/data)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fwglab%2Fphenosv","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fwglab%2Fphenosv","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fwglab%2Fphenosv/lists"}