Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.
https://github.com/robert-haas/awesome-biomedical-knowledge-graphs

A curated list of biomedical knowledge graphs and of resources for their construction.
https://github.com/robert-haas/awesome-biomedical-knowledge-graphs
List: awesome-biomedical-knowledge-graphs
awesome awesome-list biomedical-knowledge-graph knowledge-graph
Last synced: about 2 months ago
JSON representation
A curated list of biomedical knowledge graphs and of resources for their construction.
Host: GitHub
URL: https://github.com/robert-haas/awesome-biomedical-knowledge-graphs
Owner: robert-haas
License: cc-by-sa-4.0
Created: 2024-02-06T22:46:35.000Z (5 months ago)
Default Branch: main
Last Pushed: 2024-02-06T23:18:06.000Z (5 months ago)
Last Synced: 2024-05-08T00:13:58.781Z (2 months ago)
Topics: awesome, awesome-list, biomedical-knowledge-graph, knowledge-graph
Language: TeX
Homepage: https://robert-haas.github.io/awesome-biomedical-knowledge-graphs
Size: 926 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE
Lists

ultimate-awesome - awesome-biomedical-knowledge-graphs - A curated list of biomedical knowledge graphs and of resources for their construction. (Other Lists / Julia Lists)
README

        # Awesome biomedical knowledge graphs [![Awesome](https://awesome.re/badge.svg)](https://awesome.re)

> A curated list of biomedical knowledge graphs and of resources for their construction.

![logo](src/logo.png)

This repository is inspired by [awesome lists](https://github.com/sindresorhus/awesome) and follows the style guide of the [awesome manifesto](https://github.com/sindresorhus/awesome/blob/main/awesome.md).

## Table of contents

- [Introduction](#Introduction)

- [Survey](#Survey)

  - [PDF report](target/bmkg.pdf) and accompanying [website](https://robert-haas.github.io/awesome-biomedical-knowledge-graphs)

- [Curated list](#Curated-list)

  - [Biomedical knowledge graphs](#Biomedical-knowledge-graphs)

  - [Tools](#Tools)

  - [Databases](#Databases)

  - [Ontologies and controlled vocabularies](#Ontologies-and-controlled-vocabularies)

  - [File formats](#File-formats)

## Introduction

The following information was generated by 1) getting a broad overview of academic and commercial projects that provide [knowledge graphs](https://en.wikipedia.org/wiki/Knowledge_graph) in the domain of [biomedicine](https://en.wikipedia.org/wiki/Biomedicine) as well as resources for creating them and 2) narrowing them down to a small subset that I consider *awesome* due to the quality or relevance of their provided results. I hope both collections serve you well! If you have suggestions or find an error, please don't hesitate to [contact me](mailto:[email protected]) or to contribute directly with a [pull request](https://docs.github.com/en/pull-requests).

## Survey

A [PDF report](target/bmkg.pdf) and accompanying [website](https://robert-haas.github.io/awesome-biomedical-knowledge-graphs) were created to present a comprehensive overview of available biomedical knowledge graphs and of resources for their construction.

## Curated list

A carefully selected subset of the survey's entries are presented here in the style of an [awesome list](https://github.com/sindresorhus/awesome).

### Biomedical knowledge graphs

- **Biomedical Data Translator** –

  [Publication (2022)](https://doi.org/10.1111/cts.13301),

  [Website](https://ncatstranslator.github.io/TranslatorTechnicalDocumentation/),

  [Code](https://github.com/NCATSTranslator/ReasonerAPI),

  [API](https://smart-api.info/portal/translator),

  [Demo](https://ui.transltr.io/demo)

  - Content:

    - A collection of harmonized APIs

  - Scope:

    - "integrated data from over 250 knowledge sources, each exposed via open application programming interfaces (APIs)"

    - "a diverse community of nearly 200 basic and clinical scientists, informaticians, ontologists, software developers, and practicing clinicians distributed over 11 teams and 28 institutions to form the Biomedical Data Translator Consortium"

  - Goals:

    - "integrate as many datasets as possible, using a ‘knowledge graph’–based architecture, and allow them to be cross-queried and reasoned over by translational researchers"

    - "integrating existing biomedical data sets and “translating” those data into insights intended to augment human reasoning and accelerate translational science"

    - "promote serendipitous discovery and augment human reasoning in a variety of disease spaces"

    - "federate autonomous reasoning agents and knowledge providers within a distributed system for answering translational questions"

  - Sub-projects that construct knowledge graphs:

    - **ROBOKOP** –

      [Publication (2019)](https://doi.org/10.1021/acs.jcim.9b00683),

      [Code](https://github.com/NCATS-Gamma/robokop),

      [Data](https://stars.renci.org/var/plater/bl-3.5.4/RobokopKG/929354295ba7d43c/)

    - **RTX-KG2** –

      [Publication (2022)](https://doi.org/10.1186/s12859-022-04932-3),

      [Code](https://github.com/RTXteam/RTX-KG2),

      [Data](http://rtx-kg2-public.s3-website-us-west-2.amazonaws.com/)

- **Bioteque** –

  [Publication (2022)](https://doi.org/10.1038/s41467-022-33026-0),

  [Website](https://bioteque.irbbarcelona.org),

  [Code](https://gitlabsbnb.irbbarcelona.org/bioteque/bioteque),

  [Data](https://bioteque.irbbarcelona.org/downloads)

  - Content:

    - 450,000 nodes of 12 types

    - 30 million edges of 67 types

    - Extracted from 150 data sources

    - Provided as triples in multiple TSV files

  - Scope:

    - "a resource of unprecedented size and scope that contains pre-calculated embeddings derived from a gigantic heterogeneous network"

    - "Bioteque embeddings retain the information contained in the large biological network"

  - Goals:

    - "make biomedical knowledge embeddings available to the broad scientific community"

    - "evaluate, characterize and predict a wide set of experimental observations"

    - "assessment of high-throughput protein-protein interactome data"

    - "prediction of drug response and new repurposing opportunities"

- **CKG** –

  [Publication (2023)](https://doi.org/10.1038/s41587-021-01145-6),

  [Website](https://ckg.readthedocs.io),

  [Code](https://github.com/MannLabs/CKG),

  [Data](https://doi.org/10.17632/mrcf7f4tc2)

  - Full name: Clinical Knowledge Graph

  - Content:

    - 20 million nodes

    - 220 million edges

    - Extracted from 26 databases, 10 ontologies, 7 million publications

    - Provided as Neo4j graph database

  - Scope:

    - "prior knowledge, experimental data and de-identified clinical patient information"

    - "harmonization of proteomics with other omics data while integrating the relevant biomedical databases and text extracted from scientific publications"

  - Goals:

    - "inform clinical decision-making"

    - "reveal candidate markers of prognosis and/or treatment"

    - "generate new hypotheses that ultimately translate into clinically actionable results"

    - "clinically meaningful queries and advanced statistical analyses"

    - "liver disease biomarker discovery"

    - "multi-proteomics data integration for cancer biomarker discovery and validation"

    - "prioritize treatment options for chemorefractory cases"

- **HALD** –

  [Publication (2023)](https://doi.org/10.1038/s41597-023-02781-0),

  [Website](https://bis.zju.edu.cn/hald),

  [Code](https://github.com/zexuwu/hald),

  [Data](https://doi.org/10.6084/m9.figshare.22828196.v4)

  - Full name: Human Aging and Longevity Dataset

  - Content:

    - 12,227 nodes of 10 types

    - 115,522 edges of various types

    - Extracted from 339,918 biomedical articles in PubMed

    - Provided as triples with additional information in multiple JSON and CSV files

  - Scope:

    - "a text mining-based human aging and longevity dataset of the biomedical knowledge graph from all published literature related to human aging and longevity in PubMed"

  - Goals:

    - "precision gerontology and geroscience analyses"

    - "provide predictions regarding the individuals’ lifespan under various treatment scenarios"

    - "devise novel, biologically-driven therapeutic and preventive strategies that address fundamental aging mechanisms"

- **Monarch KG** –

  [Publication (2024)](https://doi.org/10.1093/nar/gkad1082),

  [Website](https://monarchinitiative.org),

  [Code](https://github.com/monarch-initiative/monarch-ingest),

  [Data](https://data.monarchinitiative.org/monarch-kg/index.html)

  - Naming explanation: "The name ’Monarch Initiative’ was chosen because it is a community effort to create paths for diverse data to be put to use for disease discovery, not unlike the navigation routes that a monarch butterfly would take."

  - Content:

    - 862,115 nodes of 88 types

    - 11,412,471 edges of 23 types

    - Extracted from 33 biomedical resources and biomedical ontologies and "updated with the latest data from each source once a month"

    - Provided in various formats such as SQLite, Neo4J, RDF, KGX

  - Scope:

    - "Monarch App includes an ETL platform for ingesting, harmonizing, and serving diverse life science data relating genes, phenotypes, and diseases into a semantic KG for use in various downstream applications"

    - "Monarch KG integrates gene, disease, and phenotype data"

    - "Monarch Assistant, which will combine the ability of LLMs to answer questions in plain language with Monarch’s extensive KG and analysis algorithms"

  - Goals:

    - "learn different things about the relationship between genotype and phenotype from different organisms"

    - "collect, integrate, and make a broad compendium of species and sources computable"

- **OREGANO** –

  [Publication (2023)](https://doi.org/10.1038/s41597-023-02757-0),

  [Code](https://gitub.u-bordeaux.fr/erias/oregano),

  [Data](https://gitub.u-bordeaux.fr/erias/oregano/-/tree/master/Data_OREGANO/Graphs)

  - Content:

    - 88,937 nodes of 11 types

    - 824,231 edges of 19 types

    - Extracted from various drug, protein and phenotype databases

    - Provided as triples in a TSV file

  - Scope:

    - "a holistically constructed knowledge graph using the broadest possible features and drug characteristics"

    - "integration of natural compounds (i.e. herbal and plant remedies)"

    - "incorporating together disease and drug information and natural compounds"

  - Goals:

    - "computational drug repositioning"

    - "generate hypotheses (molecule/drug - target links) through link prediction"

    - "from the available data, determine whether a drug is potentially capable of binding to a new target"

    - "identify possible repositionable molecules using machine learning (or more specifically deep learning) algorithms"

- **PharMeBINet** –

  [Publication (2022)](https://doi.org/10.1038/s41597-022-01510-3),

  [Website](https://pharmebi.net/#/),

  [Code](https://github.com/ckoenigs/PharMeBINet),

  [Data](https://doi.org/10.5281/zenodo.5816976)

  - Full name: Pharmacological Medical Biochemical Network

  - Content:

    - 2,869,407 nodes of 66 types

    - 15,883,653 edges of 208 types

    - Extracted from 48 data sources

    - Provided as Neo4j graph database and GraphML file

  - Scope:

    - "heterogeneous information on drugs, ADRs, genes, proteins, gene variants, and diseases"

  - Goals:

    - "analysis of ADRs [Adverse Drug Reactions]"

    - "analysis of possible existing connections between gene variants and drugs"

- **PrimeKG** –

  [Publication (2023)](https://doi.org/10.1038/s41597-023-01960-3),

  [Website](https://zitniklab.hms.harvard.edu/projects/PrimeKG),

  [Code](https://github.com/mims-harvard/PrimeKG),

  [Data](https://doi.org/10.7910/DVN/IXA7BM)

  - Full name: Precision Medicine Knowledge Graph

  - Content:

    - 129,375 nodes of 10 types

    - 4,050,249 edges of 30 types

    - Extracted from 20 data sources

    - Provided as triples in a CSV file 

  - Scope:

    - "ten major biological scales, including disease-associated protein perturbations, biological processes and pathways, anatomical and phenotypic scales, and the entire range of approved drugs with their therapeutic action"

    - "improves on coverage of diseases, both rare and common, by one-to-two orders of magnitude compared to existing knowledge graphs"

  - Goals:

    - "support research in precision medicine"

    - "linking biomedical knowledge to patient-level health information"

    - "personalized diagnostic strategies and targeted treatments"

    - "providing a holistic and multimodal view of diseases"

- **SPOKE** –

  [Publication (2023)](https://doi.org/10.1093/bioinformatics/btad080),

  [Website](https://spoke.ucsf.edu),

  [Code](https://github.com/cns-iu/spoke-vis),

  [API](https://spoke.rbvi.ucsf.edu/swagger/)

  - Full name: Scalable Precision Medicine Open Knowledge Engine

  - Content:

    - 27,056,367 nodes of 21 types

    - 53,264,489 edges of 55 types

    - Extracted from 41 databases

    - Provided as a REST API that accepts graph queries, but "not available as a bulk download"

  - Scope:

    - "ranging from molecular and cellular biology to pharmacology and clinical practice"

    - "focuses on experimentally determined information"

    - "computational predictions and text mining from the literature are not currently prioritized"

  - Goals:

    - "applications relevant to precision medicine"

    - "provide insights into the understanding of diseases, discovering of drugs and proactively improving personal health"

    - "drug repurposing"

    - "disease prediction and interpretation of transcriptomic data"

    - "predict diagnosis"

    - "predict biomedical outcomes in a biologically meaningful manner"

- **SynLethKG** –

  [Publication (2021)](https://doi.org/10.1093/bioinformatics/btab271),

  [Website](https://synlethdb.sist.shanghaitech.edu.cn),

  [Code](https://github.com/JieZheng-ShanghaiTech/KG4SL),

  [Data](https://github.com/JieZheng-ShanghaiTech/KG4SL/tree/main/data)

  - Full name: Synthetic Lethality Knowledge Graph

  - Content:

    - 54,012 nodes of 11 types

    - 2,231,921 edges of 24 types

    - Extracted from SynLethDB and various gene, drug and compound databases

    - Provided as triples in a CSV file

  - Scope:

    - "genes, compounds, diseases, biological processes and 24 kinds of relationships that could be pertinent to SL"

  - Goals:

    - "identify SL gene pairs"

    - "discovery of anti-cancer drug targets"

### Tools

- **BioCypher** –

  [Publication (2023)](https://doi.org/10.1038/s41587-023-01848-y),

  [Website](https://biocypher.org),

  [GitHub](https://github.com/biocypher/biocypher),

  [PyPI](https://pypi.org/project/biocypher/)

  - Scope:

    - "a Python library that provides a low-code access point to data processing and ontology manipulation"

    - "a modular architecture that maximizes reuse of data and code in three ways: input, ontology and output"

    - "adhere to FAIR (Findable, Accessible, Interoperable and Reusable) and TRUST (Transparency, Responsibility, User focus, Sustainability and Technology) principles"

  - Goals:

    - "make the process of creating a biomedical knowledge graph easier than ever, but still flexible and transparent"

    - "abstracting the KG build process as a combination of modular input adapters"

    - "provides easy access to state-of-the-art KGs to the average biomedical researcher"

    - "creating a more interoperable biomedical research community"

- **KGX** –

  [Website](https://kgx.readthedocs.io),

  [GitHub](https://github.com/biolink/kgx),

  [PyPI](https://pypi.org/project/kgx)

  - Scope:

    - "a Python library and set of command line utilities"

    - "The core datamodel is a Property Graph (PG), represented internally in Python using a networkx MultiDiGraph model."

  - Goals:

    - "exchanging Knowledge Graphs (KGs) that conform to or are aligned to the Biolink Model"

    - "provide validation, to ensure the KGs are conformant to the Biolink Model"

### Databases

- **Collections**

  - [Database Commons](https://ngdc.cncb.ac.cn/databasecommons)

  - [NAR db status](https://nardbstatus.kalis-amts.de)

  - [Online Bioinformatics Resources Collection (OBRC)](https://www.hsls.pitt.edu/obrc)

### Ontologies and controlled vocabularies

- **Collections**

  - [BioPortal](https://www.bioontology.org)

  - [Ontology Lookup Service](https://www.ebi.ac.uk/ols4/)

- **Biolink Model** –

  [Publication (2022)](https://doi.org/10.1111/cts.13302)

  [Website](https://biolink.github.io/biolink-model)

  [Code](https://biolink.github.io/biolink-model/)

  - Scope:

    - "a unified data model that bridges across multiple ontologies, schemas, and data models"

    - "a map for bringing together data from different sources under one unified model, and as a bridge between ontological domains"

  - Goals:

    - "supported easier integration and interoperability of biomedical KGs"

    - "supports translation, integration, and harmonization across knowledge sources"

### File formats

- **KGX (.json, .jsonl, .tsv, .ttl)** –

  [Website](https://kgx.readthedocs.io/en/latest/kgx_format.html)

- **Neo4j (.dump)** –

  [Website](https://neo4j.com/docs/desktop-manual/current/operations/create-dump/),

  [Wikipedia](https://en.wikipedia.org/w/index.php?title=Neo4j)

- **Resource Description Framework (RDF)** –

  [Website](https://www.w3.org/TR/2014/NOTE-rdf11-primer-20140624/),

  [Wikipedia](https://en.wikipedia.org/wiki/Resource_Description_Framework)

  - **Turtle (.ttl)** –

    [Website](https://www.w3.org/TR/turtle),

    [Wikipedia](https://en.wikipedia.org/wiki/Turtle_(syntax))

  - **N-Triples (.nt)** –

    [Website](https://www.w3.org/TR/n-triples),

    [Wikipedia](https://en.wikipedia.org/w/index.php?title=N-Triples)

  - **Notation3 (.n3)** –

    [Website](https://www.w3.org/TeamSubmission/n3),

    [Wikipedia](https://en.wikipedia.org/w/index.php?title=Notation3)