{"id":48925854,"url":"https://github.com/mims-harvard/optimuskg","last_synced_at":"2026-05-02T20:02:05.119Z","repository":{"id":351918264,"uuid":"914101573","full_name":"mims-harvard/OptimusKG","owner":"mims-harvard","description":"A modern multimodal knowledge graph with type-specific metadata across biomedical domains.","archived":false,"fork":false,"pushed_at":"2026-04-17T06:11:20.000Z","size":9418,"stargazers_count":3,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2026-04-17T07:02:37.362Z","etag":null,"topics":["biomedical","graph-ai","heterogeneous-graphs","knowledge-graph","multimodal-ai","multimodal-data","neo4j","ontology","python"],"latest_commit_sha":null,"homepage":"https://optimuskg.ai","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mims-harvard.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":"AGENTS.md","dco":null,"cla":null}},"created_at":"2025-01-09T00:32:11.000Z","updated_at":"2026-04-17T06:11:23.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/mims-harvard/OptimusKG","commit_stats":null,"previous_names":["mims-harvard/optimuskg"],"tags_count":86,"template":false,"template_full_name":null,"purl":"pkg:github/mims-harvard/OptimusKG","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mims-harvard%2FOptimusKG","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mims-harvard%2FOptimusKG/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mims-harvard%2FOptimusKG/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mims-harvard%2FOptimusKG/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mims-harvard","download_url":"https://codeload.github.com/mims-harvard/OptimusKG/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mims-harvard%2FOptimusKG/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32069440,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-20T21:26:33.338Z","status":"ssl_error","status_checked_at":"2026-04-20T21:26:22.081Z","response_time":94,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["biomedical","graph-ai","heterogeneous-graphs","knowledge-graph","multimodal-ai","multimodal-data","neo4j","ontology","python"],"created_at":"2026-04-17T07:00:23.869Z","updated_at":"2026-05-02T20:02:05.112Z","avatar_url":"https://github.com/mims-harvard.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cdiv align=\"center\"\u003e\n  \u003cvideo\n    src=\"https://github.com/user-attachments/assets/7218f6e1-5049-4e32-bfe1-13e382b33c9e\"\n    controls\n    width=\"800\"\n    poster=\"https://raw.githubusercontent.com/mims-harvard/optimuskg/main/assets/svg/optimuskg-logo.svg\"\n  \u003e\n    \u003ca target=\"_blank\" href=\"https://optimuskg.ai\" style=\"background:none\"\u003e\n      \u003cimg src=\"https://raw.githubusercontent.com/mims-harvard/optimuskg/main/assets/svg/optimuskg-logo.svg\" alt=\"Made at the Zitnik Lab\" width=\"600\"\u003e\n    \u003c/a\u003e\n  \u003c/video\u003e\n\u003c/div\u003e\n\n[![uv](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/uv/main/assets/badge/v0.json)](https://github.com/astral-sh/uv)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n[![Python 3.12+](https://img.shields.io/badge/python-3.12+-blue.svg)](https://www.python.org/downloads/)\n[![GitHub Stars](https://img.shields.io/github/stars/mims-harvard/OptimusKG)](https://github.com/mims-harvard/OptimusKG)\n[![DOI](https://img.shields.io/badge/DOI-10.7910%2FDVN%2FIYNGEV-blue)](https://doi.org/10.7910/DVN/IYNGEV)\n[![pre-commit](https://img.shields.io/badge/pre--commit-enabled-brightgreen?logo=pre-commit\u0026logoColor=white)](https://github.com/pre-commit/pre-commit)\n[![Website](https://img.shields.io/badge/docs-optimuskg.ai-blue)](https://optimuskg.ai)\n\n## Highlights\n\n- A [modern biomedical knowledge graph](https://optimuskg.ai) with molecular, anatomical, clinical, and environmental modalities.\n- Integrates 65 heterogeneous resources grounded with 18 ontologies and controlled vocabularies using the [BioCypher framework](https://github.com/biocypher/biocypher) and the [Biolink Model](https://github.com/biolink/biolink-model).\n- Contains 190,531 nodes across 10 entity types, 21,813,816 edges across 26 relation types, and 67,249,863 property instances encoding 110,276,843 values across 150 distinct property keys.\n- Independently validated using [PaperQA3](https://github.com/Future-House/paper-qa), a multimodal agent that retrieves and reasons over scientific literature.\n- Reproducible, deterministic and infrastructure-agnostic data pipeline with parallel execution.\n- Distributed as [Apache Parquet](https://parquet.apache.org/) files and downloadable via the [optimuskg]() python client.\n\nOptimusKG is developed at the [Zitnik Lab](https://zitniklab.hms.harvard.edu/), [Harvard Medical School](https://dbmi.hms.harvard.edu/).\n\n## Using OptimusKG\n\nOptimusKG is available via [Harvard Dataverse](https://doi.org/10.7910/DVN/IYNGEV). The graph can be programmatically accessed using the Python client, available on [PyPI](https://pypi.org/project/optimuskg/):\n\n```bash\n# With pip.\npip install optimuskg\n```\n\n```bash\n# Or pipx.\npipx install optimuskg\n```\n\nThe client fetches files from the gold layer with local caching, and supports loading the graph either as [Polars Dataframes](https://github.com/pola-rs/polars) or as a [NetworkX MultiDiGraph](https://networkx.org/documentation/stable/reference/classes/multidigraph.html):\n\n```python\nimport optimuskg\n\n# Download a specific file and store it locally\nlocal_path = optimuskg.get_file(\"nodes/gene.parquet\")\n\n# Load a single Parquet file as a Polars DataFrame\ndrugs = optimuskg.load_parquet(\"nodes/drug.parquet\")\n\n# Load nodes and edges as Polars DataFrames\n# Set lcc=True to load only the largest connected component\nnodes, edges = optimuskg.load_graph(lcc=True)\n\n# Load the graph as a NetworkX MultiDiGraph with metadata\n# Set lcc=True to load only the largest connected component\nG = optimuskg.load_networkx(lcc=True)\n```\n\n\u003e [!NOTE]\n\u003e Downloads are cached by default in `platformdirs.user_cache_dir(\"optimuskg\")` (`~/Library/Caches/optimuskg` on macOS, `~/.cache/optimuskg` on Linux, and `C:\\Users\\\u003cUser\u003e\\AppData\\Local\\optimuskg\\optimuskg` on Windows). The cache location can be overridden via the `$OPTIMUSKG_CACHE_DIR` environment variable or programmatically with `optimuskg.set_cache_dir(path)`.\n\n\u003e [!NOTE]\n\u003e To target a different dataset (_e.g._, a pre-release), set the `$OPTIMUSKG_DOI` environment variable or use `optimuskg.set_doi(\"doi:10.xxxx/XXXX\")`.\n\n## Data pipeline\n\nThe pipeline architecture consists of the following components:\n\n| Component | Description |\n| ---- | --- |\n| [**catalog**](https://optimuskg.ai/the-catalog) | The single source of truth of all datasets, their schemas, their format, and their metadata. |\n| [**dataset**](https://docs.kedro.org/en/unreleased/extend/how_to_create_a_custom_dataset/) | An abstraction that handles file formats, storage locations, and persistence logic. |\n| [**node**](https://docs.kedro.org/en/unreleased/getting-started/kedro_concepts/#node) | A pure Python function whose output value follows solely from its input values. |\n| [**pipeline**](https://docs.kedro.org/en/unreleased/getting-started/kedro_concepts/#pipeline) | A sequence of nodes wired into a DAG-based workflow, organized by the datasets they consume and produce. |\n| [**layer**]() | Follows the medallion architecture data design pattern to logically organize the data. There are 4 layers: `landing`, `bronze`, `silver`, and `gold`.|\n| [**parameters**](https://docs.kedro.org/en/unreleased/configure/parameters/) | Used to define constants for filtering the data across the construction process. |\n| [**provider**]() | An abstraction that provides versioned, automatic data downloads from different data sources. |\n| [**hook**](https://docs.kedro.org/en/unreleased/extend/hooks/introduction/) | Mechanism that allows injection of custom behavior into the core execution flow, such as before a node runs. |\n| [**conf**]() | A mechanism that separates _code_ from _settings_, defining the catalog, parameters, logging configuration, and ontology harmonization across different environments. |\n\n\u003e [!NOTE]\n\u003e We leverage additional features of the [Kedro framework](https://github.com/kedro-org/kedro), such as [namespaces](https://docs.kedro.org/en/latest/build/namespaces/), [kedro-viz](https://docs.kedro.org/projects/kedro-viz/en/latest/), [kedro-datasets](https://docs.kedro.org/projects/kedro-datasets/en/latest/) and catalog injection in [Jupyter notebooks](https://docs.kedro.org/en/latest/integrations-and-plugins/notebooks_and_ipython/kedro_and_notebooks/#exploring-the-kedro-project-in-a-notebook).\n\n## Running the pipeline\n\nThe pipeline is designed to generate the full knowledge graph and all the intermediate datasets used to generate it in one command:\n\n```console\n$ uv run kedro run --to-nodes gold.export_kg --runner=optimuskg.runners.FixedParallelRunner --async\n\n[01/28/25 19:29:07] INFO     Using 'conf/logging.yml' as logging configuration. You can change this by setting the KEDRO_LOGGING_CONFIG environment variable accordingly.\n[01/28/25 19:29:08] INFO     Kedro project optimuskg\n[01/28/25 19:29:09] INFO     Using synchronous mode for loading and saving data. Use the --async flag for potential performance gains.\n```\n\nThis will automatically download all the necessary data, store it in the `landing` layer, and execute the `bronze`, `silver`, and `gold` layers to finally export the graph inside the `data/gold/kg/` directory.\n\n\u003e [!NOTE]\n\u003e It is recommended to use the `optimuskg.runners.FixedParallelRunner`\n\u003e to run the nodes within a pipeline concurrently, and the [async](https://docs.kedro.org/en/latest/build/run_a_pipeline/#load-and-save-asynchronously) flag to reduce load and save time by using asynchronous mode. The Kedro default [ParallelRunner](https://docs.kedro.org/en/latest/build/run_a_pipeline/#parallelrunner) contains a bug that prevents it from running any validation checks.\n\n\u003e [!TIP]\n\u003e The location of each dataset, schema and their format is specified in the [catalog](conf/base/catalog/).\n\n\u003e [!TIP]\n\u003e Run `make help` for a list of available Make commands, and `uv run cli --help` for additional CLI utilities.\n\n\u003e [!NOTE]\n\u003e The pipeline automatically downloads public datasets and ingests them in the `landing` layer.\n\u003e\n\u003e Place any private datasets under `data/loading`. If absent, the [`Origin Hook`](https://github.com/mims-harvard/optimuskg/blob/main/optimuskg/hooks/origin/origin_hooks.py) will create empty placeholders, allowing dependent nodes to run even if the private data is missing.\n\n## Contributing\n\nWe are passionate about supporting contributors of all levels of experience and would love to see you get involved in the project. See the [contributing guide](CONTRIBUTING.md) to get started.\n\n## Citation\n\nIf you use OptimusKG in your research, please cite:\n\n```bibtex\n@article{vittor2026optimuskg,\n  title={OptimusKG: Unifying biomedical knowledge in a modern multimodal graph},\n  author={Vittor, Lucas and Noori, Ayush and Arango, I{\\~n}aki and Polonuer, Joaqu{\\'\\i}n and Rodriques, Sam and White, Andrew and Clifton, David A. and Zitnik, Marinka},\n  journal={In review},\n  year={2026}\n}\n```\n\n## License\n\nOptimusKG codebase is released under the [MIT License](LICENSE). OptimusKG integrates multiple primary data resources, each of which is subject to its own license and terms of use. These terms may impose restrictions on redistribution, commercial use, or downstream applications of the resulting knowledge graph or its subsets. Some resources provide data under academic or noncommercial licenses, while others may impose attribution or usage requirements. As a result, use of OptimusKG may be partially restricted depending on the specific data components included in a given instantiation. Users are responsible for reviewing and complying with the license and terms of use of each primary dataset, as specified by the original data providers. OptimusKG does not alter or override these source-specific licensing conditions.\n\n\u003cp align=\"center\"\u003e\n  Made with ❤️ at \u003ca href=\"https://zitniklab.hms.harvard.edu/\"\u003eZitnik Lab\u003c/a\u003e, Harvard Medical School\n\u003c/p\u003e\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmims-harvard%2Foptimuskg","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmims-harvard%2Foptimuskg","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmims-harvard%2Foptimuskg/lists"}