{"id":13711477,"url":"https://github.com/mims-harvard/PrimeKG","last_synced_at":"2025-05-06T20:34:23.564Z","repository":{"id":40400572,"uuid":"482717861","full_name":"mims-harvard/PrimeKG","owner":"mims-harvard","description":"Precision Medicine Knowledge Graph (PrimeKG)","archived":false,"fork":false,"pushed_at":"2024-05-27T20:49:27.000Z","size":15449,"stargazers_count":422,"open_issues_count":3,"forks_count":93,"subscribers_count":13,"default_branch":"main","last_synced_at":"2024-11-13T22:34:53.574Z","etag":null,"topics":["bioinformatics","dataset","graph-machine-learning","knowledge-graph","network-medicine","nlp-machine-learning","precision-medicine","therapeutics"],"latest_commit_sha":null,"homepage":"https://zitniklab.hms.harvard.edu/projects/PrimeKG","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mims-harvard.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-04-18T04:50:15.000Z","updated_at":"2024-11-13T04:00:39.000Z","dependencies_parsed_at":"2023-02-18T00:15:44.866Z","dependency_job_id":"93f93ace-11ae-4bbd-955e-e02c1f877d05","html_url":"https://github.com/mims-harvard/PrimeKG","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mims-harvard%2FPrimeKG","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mims-harvard%2FPrimeKG/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mims-harvard%2FPrimeKG/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mims-harvard%2FPrimeKG/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mims-harvard","download_url":"https://codeload.github.com/mims-harvard/PrimeKG/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252765112,"owners_count":21800798,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bioinformatics","dataset","graph-machine-learning","knowledge-graph","network-medicine","nlp-machine-learning","precision-medicine","therapeutics"],"created_at":"2024-08-02T23:01:08.712Z","updated_at":"2025-05-06T20:34:22.382Z","avatar_url":"https://github.com/mims-harvard.png","language":"Jupyter Notebook","funding_links":[],"categories":["Curated list","Data Resources \u0026 Knowledge Graphs","Databases"],"sub_categories":["Biomedical knowledge graphs","Rule-Based \u0026 Logic Methods","Interaction"],"readme":"# PrimeKG\n----\n\n[![website](https://img.shields.io/badge/website-live-brightgreen)](https://zitniklab.hms.harvard.edu/projects/PrimeKG/)\n[![GitHub Repo stars](https://img.shields.io/github/stars/mims-harvard/PrimeKG)](https://github.com/mims-harvard/PrimeKG/stargazers)\n[![GitHub Repo forks](https://img.shields.io/github/forks/mims-harvard/PrimeKG)](https://github.com/mims-harvard/PrimeKG/network/members)\n[![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](https://opensource.org/licenses/MIT)\n\n[**Lab Website**](https://zitniklab.hms.harvard.edu/projects/PrimeKG/) | [**Nature Publication**](https://www.nature.com/articles/s41597-023-01960-3) | [**Harvard Dataverse**](https://doi.org/10.7910/DVN/IXA7BM)\n\n## TL;DR\n**Precision Medicine Knowledge Graph (PrimeKG)** presents a holistic view of diseases. PrimeKG integrates 20\nhigh-quality biomedical resources to describe 17,080 diseases with 4,050,249 relationships representing ten major\nbiological scales. We accompany PrimeKG’s graph structure with text descriptions of clinical guidelines for drugs and\ndiseases to enable multimodal analyses. Download [this CSV file](https://dataverse.harvard.edu/api/access/datafile/6180620)\nto get started!\n\n## News and Updates\n- [Dec 2023] PrimeKG is extended to improve coverage of OMIM data.\n  \n    \u003cdetails\u003e\u003csummary\u003eDetails:\u003c/summary\u003e\n\n    ### December 2023 update\n  \n    In December 2023, an updated version of PrimeKG that includes complete entries from the Online Mendelian Inheritance in Man\n    (OMIM) database in a standardized data format was prepared.\n\n    #### Changes to PrimeKG\n    As discussed in [issue #9](https://github.com/mims-harvard/PrimeKG/issues/9), OMIM phenotypes and genes were\n    not fully included in prior versions of PrimeKG. For more details, see \n    [this pull request](https://github.com/mims-harvard/PrimeKG/pull/12).\n    \n    To extend of PrimeKG using a new data source and include edges between existing nodes in the knowledge graph,\n    we devised a standardized data format (see [PR#207](https://github.com/mims-harvard/TDC/pull/207) in mims-harvard/TD)\n    that is used for all data sources in the same format as the published PrimeKG edge list.\n    \n    #### Summary\n    * `datasets/processing_scripts/omim_tools.py` script contains functions to process OMIM data. \n    * `datasets/omim/` folder should store OMIM datasets.\n    * `datasets/omim/omim-api.ipynb` notebook is the OMIM API wrapper, which is used to download OMIM entries (note that\n      an API key is required).\n    * `knowledge_graph/append_omim.ipynb` notebook is used to append OMIM entries to PrimeKG. \n    * `scripts/utils.py` includes scripts that are used across multiple data sources.\n    \n    #### OMIM Database\n    Many of the OMIM phenotype entries have been already included in the PrimeKG through MONDO; however, there still exists\n    OMIM information that was not included in the PrimeKG. Thus, we add scripts and notebooks to cover OMIM genes, \n    phenotypes, and phenotypic series (see [here](https://www.omim.org/help/faq#1_13)) entries, and enable regular updates.\n\n    #### NCBI Gene\n    * OMIM gene entries are linked to NCBI Gene entries via new edges in the KG.\n    \n    #### Human Phenotype Ontology\n    * HPO-OMIM edges are added to PrimeKG.\n    \n    #### MONDO\n    * MONDO-OMIM edges are added to PrimeKG.\n    \n    #### Statistics\n    \n    New nodes and edges added:\n    ```text\n    # of new edges: 612282\n    # of new node: 32866\n    ```\n    \n    Updated edge count by `display_relation`:\n    ```text\n    display_relation\n    associated with    581387\n    linked to           26784\n    members              4111\n    ```\n\n    Updated edge_count by `relation`:\n    ```text\n    relation\n    mim_disease                        9599\n    mim_gene                          16636\n    mim_phenotype                    574128\n    mim_phenotypic_series              4111\n    mim_phenotypic_series_disease       549\n    phenotype_map                      7259\n    ```\n\n  \u003c/details\u003e\n\n- [July 2023] PrimeKG construction scripts are updated to include primary source data releases up to July 2023. Note that the files published on Harvard DataVerse remain unchanged; however, we provide new scripts and updated links should users wish to build their own current version of PrimeKG.\n  \n    \u003cdetails\u003e\u003csummary\u003eDetails:\u003c/summary\u003e\n\n    ### July 2023 update\n    \n    In July 2023, this repository was updated to rebuild PrimeKG and update the knowledge graph to include database releases up to July 2023. Note that the files published on Harvard DataVerse remain unchanged; however, we provide new scripts and updated links should users wish to build their own current version of PrimeKG. For more details, see [this pull request](https://github.com/mims-harvard/PrimeKG/pull/11).\n    \n    17 scripts `datasets/processing_scripts/` are re-run or updated to build a new version of PrimeKG, while `datasets/feature_construction/` scripts may remain out-of-date. Re-run or updated primary data sources include Bgee, Comparative Toxicogenomics Database, DisGeNET, DrugBank, DrugCentral, NCBI Gene, Gene Ontology, Human Phenotype Ontology, MONDO, Reactome, SIDER, UBERON, and UMLS. \n    \n    For more information, see `datasets/primary_data_resources.sh`. Changes include the following:\n    \n    #### General\n    Created script to automatically create directory structure, pull data, and run all necessary processing and feature extraction steps.\n    * Fixed broken environment construction script.\n    * Script automatically creates required directories.\n    * Added commands to retrieve gene names, details, and NCBI ID to UniProt ID mapping from [www.genenames.org](http://www.genenames.org/), then output to `vocab/gene_names.csv` and `vocab/gene_map.csv`.\n\n    #### Bgee\n    * 58405/5257181 gold quality calls with expression rank \u003c 25000 now specify cell type in a particular tissue (_e.g._, UBERON:0000473 ∩ CL:0000089, which denotes germ line stem cell in testis).\n    * These rows are dropped in `bgee.py`.\n    * URL updated to [here](https://www.bgee.org/ftp/current/download/calls/expr_calls/Homo_sapiens_expr_advanced.tsv.gz).\n    \n    #### Comparative Toxicogenomics Database\n    * URL updated to [here](https://ctdbase.org/reports/CTD_exposure_events.csv.gz).\n    \n    #### DisGeNET\n    * No changes needed.\n    \n    #### DrugBank\n    * Fixed paths in `parsexml_drugbank.py`. Output to new `/parsed` subdirectory. Removed extraneous lines in `Parsed_feature.ipynb`.\n    * :white_check_mark: Successfully ran `drugbank_drug_drug.py` and `drugbank_drug_protein.py`.\n    * :warning: `parsexml_drugbank.py` and `Parsed_feature.ipynb` may need updates.\n    \n    #### DrugCentral\n    * Modified `drugcentral_queries.txt` to work on O2, the Harvard Medical School high-performance computing cluster.\n    * :warning:  `drugcentral_feature.Rmd` may need updates.\n    \n    #### NCBI Gene\n    * No changes needed.\n    \n    #### Gene Ontology\n    * Used `-L` flag to follow redirects. No other changes needed.\n    \n    #### Human Phenotype Ontology\n    * Used `-L` flag to follow redirects. No other changes needed to `hpo.py`.\n    * Updated `hpoa.py` to replace old column names with new column names.\n    \n    #### MONDO\n    * Added check for NoneType values in external references (line 29).\n    \n    #### Reactome\n    * No changes needed.\n    \n    #### SIDER\n    * No changes needed.\n    \n    #### UBERON\n    * Checked for NA values, dropped two obsolete terms (UBERON:0039300 and UBERON:0039302) not marked as obsolete in the source file.\n    \n    #### UMLS\n    * UMLS data pulled and  paths updated for 2023 data.\n    * :warning: `umls.ipynb` may need updates.\n    \u003c/details\u003e\n  \n- [Feb 2023] PrimeKG is [published](https://www.nature.com/articles/s41597-023-01960-3) in Nature Scientific Data. \n- [Jun 2022] PrimeKG crosses 5,000 downloads on Harvard Dataverse! \n- [Apr 2022] PrimeKG is live on [bioRxiv](https://www.biorxiv.org/content/10.1101/2022.05.01.489928v1) and [Harvard Dataverse](https://doi.org/10.7910/DVN/IXA7BM)!\n\n\n## Table of Contents\n- [Unique Features of PrimeKG](#unique-features-of-primekg)\n- [Environment Setup](#environment-setup)\n- [Using PrimeKG](#using-primekg)\n- [Building an updated PrimeKG](#building-an-updated-primekg)\n- [Data Server](#data-server)\n- [Citing PrimeKG](#citing-primekg)\n- [License](#license)\n\n\n## Unique Features of PrimeKG\n \n- *Diverse coverage of diseases*: PrimeKG contains over 17,000 diseases including rare dieases. Disease nodes in PrimeKG are densely connected to other nodes in the graph and have been optimized for clinical relevance in downstream precision medicine tasks. \n- *Heterogeneous knowledge graph*: PrimeKG contains over 100,000 nodes distributed over various biological scales as depicted below. PrimeKG also contains over 4 million relationships between these nodes distributed over 29 types of edges.\n- *Multimodal integration of clinical knowledge*: Disease and drug nodes in PrimeKG are augmented with clinical descriptors that come from medical authorities such as Mayo Clinic, Orphanet, Drug Bank, and so forth. \n- *Ready-to-use datasets*: PrimeKG is minimally dependent on external packages. Our knowledge graph can be retrieved in a ready-to-use format from Harvard Dataverse.\n- *Data functions*: PrimeKG provides extensive data functions, including processors for primary resources and scripts to build an updated knowledge graph.\n\n\u003cp align=\"center\"\u003e\u003cimg src=\"https://github.com/mims-harvard/PrimeKG/blob/main/fig/schematic.png\" alt=\"overview\" width=\"600px\" /\u003e\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\u003cimg src=\"https://github.com/mims-harvard/PrimeKG/blob/main/fig/PrimeKG-example.png\" alt=\"PrimeKG-example\"/\u003e\u003c/p\u003e\n\n## Environment setup\n\n### Using `pip`\n\nTo install the dependencies required to run the PrimeKG code, use `pip`:\n\n```bash\npip install -r updated_requirements.txt\n```\n\n### Or use `conda`\n\n```bash\nconda env create --name PrimeKG --file=environment.yml\n```\n\n## Using PrimeKG\n\nFor a quick start in Python, you can download the raw data files in `.csv` format directly from [Harvard Dataverse](https://doi.org/10.7910/DVN/IXA7BM) or load PrimeKG using the following community dataloaders.\n\n### Getting started in Python\n\nDownload PrimeKG from Harvard Dataverse using the following bash command. You can replace `kg.csv` with any file path. \n```bash\nwget -O kg.csv https://dataverse.harvard.edu/api/access/datafile/6180620\n```\nYou can use the following code to load PrimeKG and visualize its data. \n```python\nimport pandas as pd\nprimekg = pd.read_csv('kg.csv', low_memory=False)\nprimekg.query('y_type==\"disease\"|x_type==\"disease\"')\n```\n\n### Dataloader: Therapeutics Data Commons \n[website](https://tdcommons.ai) | [docs](https://github.com/mims-harvard/TDC)\n```bash\npip install PyTDC\n```\n```python\nfrom tdc.resource import PrimeKG\ndata = PrimeKG(path = './data')\ndrug_feature = data.get_features(feature_type = 'drug')\ndata.to_nx()\ndata.get_node_list(type = 'disease')\n```\n\n### Dataloader: PyKEEN \n[website](https://github.com/pykeen/pykeen) | [docs](https://pykeen.readthedocs.io/en/latest/api/pykeen.datasets.PrimeKG.html)\n```\npip install pykeen\n```\n```python\nimport pykeen.datasets\npykeen.datasets.has_dataset('primekg')\n```\n\n## Building an updated PrimeKG\n\n### Downloading primary data resources\n\nAll persistent identifiers and weblinks to download the 20 primary data resources used to build PrimeKG are systematically provided in the Data Records section of our article. We have also mentioned the exact filenames that were downloaded from each resource for easy corroboration. \n\n### Curating primary data resources\n\nWe provide the scripts used to process all primary data resources and the names of the resulting output files generated by those scripts. We would be happy to share the intermediate processing datasets that were used to create PrimeKG on request. \n\n| Database                            | Processing scripts                            | Expected script output                                                                             |\n|-------------------------------------|-----------------------------------------------|----------------------------------------------------------------------------------------------------|\n| Bgee                                | bgee.py                                       | anatomy_gene.csv                                                                                   |\n| Comparative Toxicogenomics Database | ctd.py                                        | exposure_data.csv                                                                                  |\n| DisGeNET                            | -                                             | curated_gene_disease_associations.tsv                                                              |\n| DrugBank                            | drugbank_drug_drug.py                         | drug_drug.csv                                                                                      |\n| DrugBank                            | parsexml_drugbank.ipynb, Parsed_feature.ipynb | 12 drug feature files                                                                              |\n| DrugBank                            | drugbank_drug_protein.py                      | drug_protein.csv                                                                                   |\n| Drug Central                        | drugcentral_queries.txt                       | drug_disease.csv                                                                                   |\n| Drug Central                        | drugcentral_feature.Rmd                       | dc_features.csv                                                                                    |\n| Entrez Gene                         | ncbigene.py                                   | protein_go_associations.csv                                                                        |\n| Gene Ontology                       | go.py                                         | go_terms_info.csv, go_terms_relations.csv                                                          |\n| Human Phenotype Ontology            | hpo.py, hpo_obo_parser.py                     | hp_terms.csv, hp_parents.csv, hp_references.csv                                                    |\n| Human Phenotype Ontology            | hpoa.py                                       | disease_phenotype_pos.csv, disease_phenotype_neg.csv                                               |\n| MONDO                               | mondo.py,  mondo_obo_parser.py                | mondo_terms.csv, mondo_parents.csv, mondo_references.csv, mondo_subsets.csv, mondo_definitions.csv |\n| OMIM                                | omim_tools.py, omim-api.ipynb                 | mim2gene.txt, mimTitles.txt, genemap2.txt, morbidmap.txt, \u003comim_full_path\u003e.json                    |\n| Reactome                            | reactome.py                                   | reactome_ncbi.csv, reactome_terms.csv, reactome_relations.csv                                      |\n| SIDER                               | sider.py                                      | sider.csv                                                                                          |\n| UBERON                              | uberon.py                                     | uberon_terms.csv, uberon_rels.csv, uberon_is_a.csv                                                 |\n| UMLS                                | umls.py, map_umls_mondo.py                    | umls_mondo.csv                                                                                     |\n| UMLS                                | umls.ipynb                                    | umls_def_disorder_2021.csv, umls_def_disease_2021.csv                                              |\n\n### Harmonizing datasets into PrimeKG\n\nThe code to harmonize datasets and construct PrimeKG is available at `build_graph.ipynb`. Simply run this jupyter notebook in order to construct the knowledge graph from the outputs of the processing files mentioned above. This jupyter notebook produces all three versions of PrimeKG, `kg_raw.csv`, `kg_giant.csv`, and the complete version  `kg.csv`. \n\n[//]: # (### Building extended version of PrimeKG)\n\n### Feature extraction\n\nThe code required to engineer features can be found at `engineer_features.ipynb` and `mapping_mayo.ipynb`. \n\n## Citing PrimeKG\n\nIf you find PrimeKG useful, cite our work:\n```\n@article{chandak2022building,\n  title={Building a knowledge graph to enable precision medicine},\n  author={Chandak, Payal and Huang, Kexin and Zitnik, Marinka},\n  journal={Nature Scientific Data},\n  doi={https://doi.org/10.1038/s41597-023-01960-3},\n  URL={https://www.nature.com/articles/s41597-023-01960-3},\n  year={2023}\n}\n```\n\n## Data Server\n\nPrimeKG is hosted on [Harvard Dataverse](https://doi.org/10.7910/DVN/IXA7BM) with the following persistent\nidentifier [https://doi.org/10.7910/DVN/IXA7BM](https://doi.org/10.7910/DVN/IXA7BM). When Dataverse is under\nmaintenance, PrimeKG datasets cannot be retrieved. That happens rarely; please check the status on\n[the Dataverse website](https://dataverse.harvard.edu/).\n\n## License\nPrimeKG codebase and associated tools are released under the MIT license. Please note that this license specifically refers to the PrimeKG software, and is distinct from any licenses governing the PrimeKG dataset itself. For individual dataset usage, refer to the respective dataset licenses available on data website.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmims-harvard%2FPrimeKG","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmims-harvard%2FPrimeKG","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmims-harvard%2FPrimeKG/lists"}