{"id":49039104,"url":"https://github.com/EBISPOT/GrEBI","last_synced_at":"2026-06-07T23:01:03.916Z","repository":{"id":214825790,"uuid":"737447552","full_name":"EBISPOT/GrEBI","owner":"EBISPOT","description":"HPC aggregation pipeline and API/MCP server for LLM-mediated biomedical data integration","archived":false,"fork":false,"pushed_at":"2026-06-02T14:09:36.000Z","size":41807,"stargazers_count":4,"open_issues_count":16,"forks_count":3,"subscribers_count":3,"default_branch":"dev","last_synced_at":"2026-06-02T14:24:16.699Z","etag":null,"topics":["bioinformatics","data-integration","data-mining","knowledge-graphs","mcp","mcp-server","neo4j","ontologies"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/EBISPOT.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":"AGENTS.md","dco":null,"cla":null}},"created_at":"2023-12-31T04:12:08.000Z","updated_at":"2026-06-02T14:16:31.000Z","dependencies_parsed_at":"2024-01-05T05:26:07.835Z","dependency_job_id":"2cf117fd-bbda-44d6-bd3c-2c7b572e1650","html_url":"https://github.com/EBISPOT/GrEBI","commit_stats":null,"previous_names":["ebispot/grebi","ebispot/ontograph"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/EBISPOT/GrEBI","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/EBISPOT%2FGrEBI","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/EBISPOT%2FGrEBI/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/EBISPOT%2FGrEBI/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/EBISPOT%2FGrEBI/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/EBISPOT","download_url":"https://codeload.github.com/EBISPOT/GrEBI/tar.gz/refs/heads/dev","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/EBISPOT%2FGrEBI/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34041089,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-07T02:00:07.652Z","response_time":124,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bioinformatics","data-integration","data-mining","knowledge-graphs","mcp","mcp-server","neo4j","ontologies"],"created_at":"2026-04-19T14:00:39.449Z","updated_at":"2026-06-07T23:01:03.911Z","avatar_url":"https://github.com/EBISPOT.png","language":"Jupyter Notebook","funding_links":[],"categories":["Biomedical Research \u0026 Genomics"],"sub_categories":[],"readme":"# GrEBI (Graphs@EBI)\n\nHPC pipeline using ontologies and LLM embeddings to aggregate knowledge graphs from [EMBL-EBI resources](https://www.ebi.ac.uk/services/data-resources-and-tools), the [MONARCH Initiative](https://monarch-initiative.github.io/monarch-ingest/Sources/), [DisMech](https://dismech.monarchinitiative.org/), [ROBOKOP](https://robokop.renci.org/), [Ubergraph](https://github.com/INCATools/ubergraph), and other sources.\n\nThe aim is to make it easier for humans and machines to perform integrative queries which span multiple biomedical resources, in contrast to existing REST APIs which are typically constrainted to one resource.\n\nA development server with the output of this pipeline can be accessed at https://wwwdev.ebi.ac.uk/kg\n\nMCP endpoint: https://wwwdev.ebi.ac.uk/kg/api/v1/mcp (Streamable HTTP)\n\nThe GrEBI pipeline is being applied to a number of projects including the [International Mouse Phenotyping Consortium (IMPC)](https://www.mousephenotype.org/) knowledge graph and the [EMBL Human Ecosystems Transversal Theme (HETT)](https://www.embl.org/about/info/human-ecosystems/) ExposomeKG.\n\n\u003cimg src=\"https://www.embl.org/guidelines/design/wp-content/uploads/2022/02/EMBL_logo_colour_DIGITAL.png\" width=100 /\u003e\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u003cimg src=\"https://monarch-initiative.github.io/monarch-ingest/images/monarch-initiative.png\" width=100 /\u003e\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u003cimg src=\"https://www.mousephenotype.org/wp-content/uploads/2022/08/IMPC_logo.svg\" width=100 /\u003e\n\n## Making the tests pass\n\nGrEBI has a suite of automated E2E tests that run the full pipeline on small synthetic datasets and compare the resulting Neo4j/Solr database contents against committed expected output in `tests/expected_output/`. If code changes alter the pipeline output such that it no longer matches the expected snapshots, the CI will fail and you will need to update the expected output.\n\nThere are four test subgraphs, each exercising a different aspect of the pipeline:\n\n| Test subgraph | Purpose |\n| --- | --- |\n| `test_clique_merge` | Verifies equivalent entities are merged into a single clique |\n| `test_edge_linking` | Verifies property values referencing other entities become graph edges |\n| `test_multi_datasource` | Verifies merging data from two separate datasources |\n| `test_type_hierarchy` | Verifies type superclass propagation through `rdfs:subClassOf` |\n\n### Prerequisites\n\nYou need Docker with the `docker compose` plugin and enough disk space to build the image. Build it locally before running the tests:\n\n    docker build -t ghcr.io/ebispot/grebi_combined:dev .\n\n### Running all tests\n\nRun the full E2E test suite across all four test subgraphs:\n\n    bash tests/run_all_e2e.sh\n\nThis will run each test subgraph through the full Nextflow pipeline (ingest → assign IDs → merge → index → link → create Neo4j → run queries → create Solr → integration tests), export DB snapshots, and compare them against `tests/expected_output/`.\n\n### Running a single test\n\nTo run only one test subgraph:\n\n    bash tests/run_e2e.sh test_clique_merge\n\n### Updating expected output\n\nWhen your changes intentionally alter the pipeline output, you need to update the expected snapshots. Run the pipeline for the affected test subgraph, inspect the changes, and commit them:\n\n    export GREBI_SUBGRAPHS=test_clique_merge\n    export GREBI_NF_EXTRA_ARGS=\"--export_snapshots true\"\n    bash dataload/scripts/dataload_local.sh\n\nCopy the new snapshots to expected output:\n\n    cp out/test_clique_merge/test_clique_merge_snapshot_*.jsonl \\\n       tests/expected_output/test_clique_merge/\n\nNow inspect the changes with `git diff` and make sure they are intentional. When you are happy, stage and commit the updated expected output:\n\n    git add -A tests/expected_output/\n    git commit -m \"Update expected test output\"\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FEBISPOT%2FGrEBI","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FEBISPOT%2FGrEBI","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FEBISPOT%2FGrEBI/lists"}