{"id":46430709,"url":"https://github.com/benitomartin/biomedical-graphrag","last_synced_at":"2026-03-05T18:07:39.484Z","repository":{"id":340170988,"uuid":"1074238503","full_name":"benitomartin/biomedical-graphrag","owner":"benitomartin","description":"A comprehensive GraphRAG (Graph Retrieval-Augmented Generation) system designed for biomedical research","archived":false,"fork":false,"pushed_at":"2026-02-23T15:44:26.000Z","size":1760,"stargazers_count":96,"open_issues_count":0,"forks_count":23,"subscribers_count":1,"default_branch":"main","last_synced_at":"2026-02-23T23:46:37.180Z","etag":null,"topics":["large-language-models","neo4j","openai","python","qdrant","retrieval-augmented-generation"],"latest_commit_sha":null,"homepage":"https://aiechoes.substack.com/p/building-a-biomedical-graphrag-when","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/benitomartin.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-10-11T12:10:01.000Z","updated_at":"2026-02-23T15:45:08.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/benitomartin/biomedical-graphrag","commit_stats":null,"previous_names":["benitomartin/biomedical-graphrag"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/benitomartin/biomedical-graphrag","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/benitomartin%2Fbiomedical-graphrag","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/benitomartin%2Fbiomedical-graphrag/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/benitomartin%2Fbiomedical-graphrag/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/benitomartin%2Fbiomedical-graphrag/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/benitomartin","download_url":"https://codeload.github.com/benitomartin/biomedical-graphrag/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/benitomartin%2Fbiomedical-graphrag/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":30141494,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-05T16:58:46.102Z","status":"ssl_error","status_checked_at":"2026-03-05T16:58:45.706Z","response_time":93,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["large-language-models","neo4j","openai","python","qdrant","retrieval-augmented-generation"],"created_at":"2026-03-05T18:07:34.545Z","updated_at":"2026-03-05T18:07:39.475Z","avatar_url":"https://github.com/benitomartin.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Biomedical GraphRAG\n\n![Neo4j UI](static/image.png)\n\n\u003cdiv align=\"center\"\u003e\n\n\u003c!-- Project Status --\u003e\n\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n[![Python version](https://img.shields.io/badge/python-3.13.8-blue.svg)](https://www.python.org/downloads/)\n[![uv](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/uv/main/assets/badge/v0.json)](https://github.com/astral-sh/uv)\n\n\u003c!-- Providers --\u003e\n\n[![Qdrant](https://img.shields.io/badge/Qdrant-1.15.1-5A31F4?logo=qdrant\u0026logoColor=white)](https://qdrant.tech/)\n[![Neo4j](https://img.shields.io/badge/Neo4j-5.28.2-008CC1?logo=neo4j\u0026logoColor=white)](https://neo4j.com/)\n[![OpenAI](https://img.shields.io/badge/OpenAI-2.3.0-412991?logo=openai\u0026logoColor=white)](https://openai.com/)\n\n\u003c/div\u003e\n\n## Table of Contents\n\n- [Biomedical GraphRAG](#biomedical-graphrag)\n  - [Table of Contents](#table-of-contents)\n  - [Overview](#overview)\n  - [Project Structure](#project-structure)\n  - [Prerequisites](#prerequisites)\n  - [Installation](#installation)\n  - [Usage](#usage)\n    - [Configuration](#configuration)\n    - [Data Collection](#data-collection)\n    - [Infrastructure Setup](#infrastructure-setup)\n      - [Neo4j Graph Database](#neo4j-graph-database)\n      - [Qdrant Vector Database](#qdrant-vector-database)\n    - [Query Commands](#query-commands)\n      - [Qdrant Vector Search](#qdrant-vector-search)\n      - [Hybrid Neo4j + Qdrant Queries](#hybrid-neo4j--qdrant-queries)\n      - [Available Query Types](#available-query-types)\n      - [Sample Queries](#sample-queries)\n    - [Testing](#testing)\n    - [Quality Checks](#quality-checks)\n  - [License](#license)\n\n## Overview\n\nA comprehensive GraphRAG (Graph Retrieval-Augmented Generation) system designed for biomedical research. It combines knowledge graphs with vector search to provide intelligent querying and analysis of biomedical literature and genomic data.\n\n**Video:**\n\n\u003cdiv align=\"center\"\u003e\n\n\u003ca href=\"https://www.youtube.com/watch?v=3NWTi90i6C4\u0026t=200s\" target=\"_blank\"\u003e\n  \u003cimg src=\"images/pubmed_navigator.jpeg\" width=\"600\" alt=\"YouTube Video\"\u003e\n\u003c/a\u003e\n\n\u003c/div\u003e\n\n\u0026nbsp;\n\n**Article:**\n\n [Building a Biomedical GraphRAG: When Knowledge Graphs Meet Vector Search](https://aiechoes.substack.com/p/building-a-biomedical-graphrag-when)\n\n**Key Features:**\n\n- **Hybrid Query System**: Combines Neo4j graph database with Qdrant vector search for comprehensive biomedical insights\n- **Data Integration**: Processes PubMed papers, gene data, and research citations\n- **Intelligent Querying**: Uses LLM-powered tool selection for graph enrichment and semantic search\n- **Biomedical Schema**: Specialized graph schema for papers, authors, institutions, genes, and MeSH terms\n- **Async Processing**: High-performance async data collection and processing\n\n## Project Structure\n\n```text\nbiomedical-graphrag-pipeline/\n├── .github/                    # GitHub workflows and templates\n├── data/                       # Dataset storage (PubMed, Gene data)\n├── docs/                       # Documentation\n├── src/\n│   └── biomedical_graphrag/\n│       ├── application/        # Application layer\n│       │   ├── cli/            # Command-line interfaces\n│       │   └── services/       # Business logic services\n│       ├── config.py           # Configuration management\n│       ├── data_sources/       # Data collection modules\n│       ├── domain/             # Domain models and entities\n│       ├── infrastructure/     # Database and external service adapters\n│       └── utils/              # Utility functions\n├── static/                     # Static assets (images, etc.)\n├── tests/                      # Test suite\n├── LICENSE                     # MIT License\n├── Makefile                    # Build and development commands\n├── pyproject.toml              # Project configuration and dependencies\n├── README.md                   # This file\n└── uv.lock                     # Dependency lock file\n```\n\n## Prerequisites\n\n| Requirement                                            | Description                             |\n| ------------------------------------------------------ | --------------------------------------- |\n| [Python 3.13+](https://www.python.org/downloads/)      | Programming language                    |\n| [uv](https://docs.astral.sh/uv/)                       | Package and dependency manager          |\n| [Neo4j](https://neo4j.com/)                            | Graph database for knowledge graphs     |\n| [Qdrant](https://qdrant.tech/)                         | Vector database for embeddings          |\n| [OpenAI](https://openai.com/)                          | LLM provider for queries and embeddings |\n| [PubMed](https://www.ncbi.nlm.nih.gov/books/NBK25501/) | Biomedical literature database          |\n\n## Installation\n\n1. Clone the repository:\n\n   ```bash\n   git clone git@github.com:benitomartin/biomedical-graphrag.git\n   cd biomedical-graphrag\n   ```\n\n1. Create a virtual environment:\n\n   ```bash\n   uv venv\n   ```\n\n1. Activate the virtual environment:\n\n   ```bash\n   source .venv/bin/activate\n   ```\n\n1. Install the required packages:\n\n   ```bash\n   uv sync --all-groups --all-extra\n   ```\n\n1. Create a `.env` file in the root directory:\n\n   ```bash\n    cp env.example .env\n   ```\n\n## Usage\n\n### Configuration\n\nConfigure API keys, model names, and other settings by editing the `.env` file:\n\n```bash\n# OpenAI Configuration\nOPENAI__API_KEY=your_openai_api_key_here\nOPENAI__MODEL=gpt-4o-mini\nOPENAI__TEMPERATURE=0.0\nOPENAI__MAX_TOKENS=1500\n\n# Neo4j Configuration\nNEO4J__URI=bolt://localhost:7687\nNEO4J__USERNAME=neo4j\nNEO4J__PASSWORD=your_neo4j_password\nNEO4J__DATABASE=neo4j\n\n# Qdrant Configuration\nQDRANT__URL=http://localhost:6333\nQDRANT__API_KEY=your_qdrant_api_key\nQDRANT__COLLECTION_NAME=biomedical_papers\nQDRANT__EMBEDDING_MODEL=text-embedding-3-small\nQDRANT__EMBEDDING_DIMENSION=1536\n\n# PubMed Configuration (optional)\nPUBMED__EMAIL=your_email@example.com\nPUBMED__API_KEY=your_pubmed_api_key\n\n# Data Paths\nJSON_DATA__PUBMED_JSON_PATH=data/pubmed_dataset.json\nJSON_DATA__GENE_JSON_PATH=data/gene_dataset.json\n```\n\n### Data Collection\n\nThe system includes data collectors for biomedical and gene datasets:\n\n```bash\n# Collect PubMed papers and metadata\nmake pubmed-data-collector-run\n```\n\n```bash\n# Collect gene information related to the pubmed dataset\nmake gene-data-collector-run\n```\n\n### Infrastructure Setup\n\n#### Neo4j Graph Database\n\n```bash\n# Create the knowledge graph from datasets\nmake create-graph\n\n# Delete all graph data (clean slate)\nmake delete-graph\n```\n\n#### Qdrant Vector Database\n\n```bash\n# Create vector collection for embeddings\nmake create-qdrant-collection\n\n# Ingest embeddings into Qdrant\nmake ingest-qdrant-data\n\n# Delete vector collection\nmake delete-qdrant-collection\n```\n\n### Query Commands\n\n#### Qdrant Vector Search\n\n```bash\n# Run a custom query on the Qdrant vector store\nmake custom-qdrant-query QUESTION=\"Which institutions have collaborated most frequently on papers about 'Gene Editing' and 'Immunotherapy'?\"\n\n# Or run directly with the CLI\nuv run src/biomedical_graphrag/application/cli/query_vectorstore.py --ask \"Which institutions have collaborated most frequently on papers about 'Gene Editing' and 'Immunotherapy'?\"\n```\n\n#### Hybrid Neo4j + Qdrant Queries\n\n```bash\n# Run example queries on the Neo4j graph using GraphRAG\nmake example-graph-query\n\n# Run a custom natural language query using hybrid GraphRAG\nmake custom-graph-query QUESTION=\"What are the latest research trends in cancer immunotherapy?\"\n\n# Or run directly with the CLI\nuv run src/biomedical_graphrag/application/cli/fusion_query.py \"What are the latest research trends in cancer immunotherapy?\"\n```\n\n#### Available Query Types\n\n**Qdrant Queries:**\n\n- Semantic search across paper abstracts and content\n- Similarity-based retrieval using embeddings\n- Direct vector similarity queries\n\n**Hybrid Queries:**\n\n- Combines semantic search (Qdrant) with graph enrichment (Neo4j):\n  - Author collaboration networks\n  - Citation analysis and paper relationships\n  - Gene-paper associations\n  - MeSH term relationships\n  - Institution affiliations\n- LLM-powered automatic tool selection\n\n#### Sample Queries\n\n- Who collaborates with Jennifer Doudna on CRISPR research?\n  Which researchers work with Emmanuelle Charpentier on gene editing or genome engineering papers?\n\n- Who are George Church’s collaborators publishing on synthetic biology and genome sequencing?\n\n- List scientists collaborating with Feng Zhang on neuroscience studies\n\n- Which papers are related to PMID 31295471 based on shared MeSH terms?\n\n- Find papers similar to the CRISPR-Cas9 genome editing study with PMID 31295471\n\n- Show other studies linked by MeSH terms to PMID 27562951\n\n- Which genes are mentioned in the same papers as gag?\n\n- What genes appear together with HIF1A in cancer research?\n\n- Which genes are frequently co-mentioned with TP53?\n\n### Testing\n\nRun all tests:\n\n```bash\nmake tests\n```\n\n### Quality Checks\n\nRun all quality checks (lint, format, type check, clean):\n\n```bash\nmake all-check\nmake all-fix\n```\n\nIndividual Commands:\n\n- Display all available commands:\n\n  ```bash\n  make help\n  ```\n\n- Check code static typing\n\n  ```bash\n  make mypy\n  ```\n\n- Clean cache and build files:\n\n  ```bash\n  make clean\n  ```\n\n## License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbenitomartin%2Fbiomedical-graphrag","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbenitomartin%2Fbiomedical-graphrag","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbenitomartin%2Fbiomedical-graphrag/lists"}