{"id":13456949,"url":"https://github.com/BaranziniLab/KG_RAG","last_synced_at":"2025-03-24T11:32:13.445Z","repository":{"id":206613881,"uuid":"717297137","full_name":"BaranziniLab/KG_RAG","owner":"BaranziniLab","description":"Empower Large Language Models (LLM) using Knowledge Graph based Retrieval-Augmented Generation (KG-RAG) for knowledge intensive tasks","archived":false,"fork":false,"pushed_at":"2024-09-14T23:13:03.000Z","size":10840,"stargazers_count":650,"open_issues_count":5,"forks_count":88,"subscribers_count":16,"default_branch":"main","last_synced_at":"2024-10-18T22:40:55.178Z","etag":null,"topics":["bert-models","bioinformatics","bioinformatics-algorithms","biomedical-applications","biomedical-informatics","context-aware","gpt","gpt35turbo","gpt4","knowledge-base","knowledge-graph","large-language-models","llama","llama2","llm","prompt-engineering","prompt-tuning","rag","retrieval-augmented-generation","sentence-transformers"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/BaranziniLab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-11-11T03:22:52.000Z","updated_at":"2024-10-18T09:14:46.000Z","dependencies_parsed_at":"2023-12-26T00:27:55.615Z","dependency_job_id":"c01502ce-60ac-422c-8929-decf64ea55bb","html_url":"https://github.com/BaranziniLab/KG_RAG","commit_stats":{"total_commits":562,"total_committers":6,"mean_commits":93.66666666666667,"dds":"0.11387900355871883","last_synced_commit":"e9c6fbc010bda822df4a34b3e0a0a28b015b2b5f"},"previous_names":["baranzinilab/kg_rag"],"tags_count":3,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BaranziniLab%2FKG_RAG","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BaranziniLab%2FKG_RAG/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BaranziniLab%2FKG_RAG/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BaranziniLab%2FKG_RAG/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/BaranziniLab","download_url":"https://codeload.github.com/BaranziniLab/KG_RAG/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":221962483,"owners_count":16908339,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bert-models","bioinformatics","bioinformatics-algorithms","biomedical-applications","biomedical-informatics","context-aware","gpt","gpt35turbo","gpt4","knowledge-base","knowledge-graph","large-language-models","llama","llama2","llm","prompt-engineering","prompt-tuning","rag","retrieval-augmented-generation","sentence-transformers"],"created_at":"2024-07-31T08:01:30.692Z","updated_at":"2024-10-29T00:31:38.849Z","avatar_url":"https://github.com/BaranziniLab.png","language":"Jupyter Notebook","funding_links":[],"categories":["A01_文本生成_文本对话","Jupyter Notebook"],"sub_categories":["大语言对话模型及数据"],"readme":"\u003cp align=\"center\"\u003e\n  \u003cimg src=\"https://github.com/BaranziniLab/KG_RAG/assets/42702311/0b2f5b42-761e-4d5b-8d6f-77c8b965f017\" width=\"450\"\u003e\n\u003c/p\u003e\n\n\n\n\n## Table of Contents\n[What is KG-RAG](https://github.com/BaranziniLab/KG_RAG#what-is-kg-rag)\n\n[Example use case of KG-RAG](https://github.com/BaranziniLab/KG_RAG#example-use-case-of-kg-rag)\n - [Prompting GPT without KG-RAG](https://github.com/BaranziniLab/KG_RAG#without-kg-rag)  \n - [Prompting GPT with KG-RAG](https://github.com/BaranziniLab/KG_RAG#with-kg-rag)\n - [Example notebook for KG-RAG with GPT](https://github.com/BaranziniLab/KG_RAG/blob/main/notebooks/kg_rag_based_gpt_prompts.ipynb)\n\n[How to run KG-RAG](https://github.com/BaranziniLab/KG_RAG#how-to-run-kg-rag)\n - [Step 1: Clone the repo](https://github.com/BaranziniLab/KG_RAG#step-1-clone-the-repo)\n - [Step 2: Create a virtual environment](https://github.com/BaranziniLab/KG_RAG#step-2-create-a-virtual-environment)\n - [Step 3: Install dependencies](https://github.com/BaranziniLab/KG_RAG#step-3-install-dependencies)\n - [Step 4: Update config.yaml](https://github.com/BaranziniLab/KG_RAG#step-4-update-configyaml)\n - [Step 5: Run the setup script](https://github.com/BaranziniLab/KG_RAG#step-5-run-the-setup-script)\n - [Step 6: Run KG-RAG from your terminal](https://github.com/BaranziniLab/KG_RAG#step-6-run-kg-rag-from-your-terminal)\n    - [Using GPT](https://github.com/BaranziniLab/KG_RAG#using-gpt)\n    - [Using GPT interactive mode](https://github.com/BaranziniLab/KG_RAG/blob/main/README.md#using-gpt-interactive-mode)\n    - [Using Llama](https://github.com/BaranziniLab/KG_RAG#using-llama)\n    - [Using Llama interactive mode](https://github.com/BaranziniLab/KG_RAG/blob/main/README.md#using-llama-interactive-mode)\n  - [Command line arguments for KG-RAG](https://github.com/BaranziniLab/KG_RAG?tab=readme-ov-file#command-line-arguments-for-kg-rag)\n  \n[BiomixQA: Benchmark dataset](https://github.com/BaranziniLab/KG_RAG/tree/main?tab=readme-ov-file#biomixqa-benchmark-dataset)\n\n[Citation](https://github.com/BaranziniLab/KG_RAG/blob/main/README.md#citation)\n\n\n## What is KG-RAG?\n\nKG-RAG stands for Knowledge Graph-based Retrieval Augmented Generation.\n\n### Start by watching the video of KG-RAG\n\n\u003cvideo src=\"https://github.com/BaranziniLab/KG_RAG/assets/42702311/86e5b8a3-eb58-4648-95a4-271e9c69b4ed\" controls=\"controls\" style=\"max-width: 730px;\"\u003e\n\u003c/video\u003e\n\nIt is a task agnostic framework that combines the explicit knowledge of a Knowledge Graph (KG) with the implicit knowledge of a Large Language Model (LLM). Here is the [arXiv preprint](https://arxiv.org/abs/2311.17330) of the work.\n\nHere, we utilize a massive biomedical KG called [SPOKE](https://spoke.ucsf.edu/) as the provider for the biomedical context. SPOKE has incorporated over 40 biomedical knowledge repositories from diverse domains, each focusing on biomedical concept like genes, proteins, drugs, compounds, diseases, and their established connections. SPOKE consists of more than 27 million nodes of 21 different types and 53 million edges of 55 types [[Ref](https://doi.org/10.1093/bioinformatics/btad080)]\n\n\nThe main feature of KG-RAG is that it extracts \"prompt-aware context\" from SPOKE KG, which is defined as: \n\n**the minimal context sufficient enough to respond to the user prompt.** \n\nHence, this framework empowers a general-purpose LLM by incorporating an optimized domain-specific 'prompt-aware context' from a biomedical KG.\n\n## Example use case of KG-RAG\nFollowing snippet shows the news from FDA [website](https://www.fda.gov/drugs/news-events-human-drugs/fda-approves-treatment-weight-management-patients-bardet-biedl-syndrome-aged-6-or-older) about the drug **\"setmelanotide\"** approved by FDA for weight management in patients with *Bardet-Biedl Syndrome*\n\n\u003cimg src=\"https://github.com/BaranziniLab/KG_RAG/assets/42702311/fc4d0b8d-0edb-461d-86c5-9d0d191bd97d\" width=\"600\" height=\"350\"\u003e\n\n### Ask GPT-4 about the above drug:\n\n### WITHOUT KG-RAG\n\n*Note: This example was run using KG-RAG v0.3.0. We are prompting GPT from the terminal, NOT from the chatGPT browser. Temperature parameter is set to 0 for all the analysis. Refer [this](https://github.com/BaranziniLab/KG_RAG/blob/main/config.yaml) yaml file for parameter setting*\n\n\u003cvideo src=\"https://github.com/BaranziniLab/KG_RAG/assets/42702311/dbabb812-2a8a-48b6-9785-55b983cb61a4\" controls=\"controls\" style=\"max-width: 730px;\"\u003e\n\u003c/video\u003e\n\n### WITH KG-RAG\n\n*Note: This example was run using KG-RAG v0.3.0. Temperature parameter is set to 0 for all the analysis. Refer [this](https://github.com/BaranziniLab/KG_RAG/blob/main/config.yaml) yaml file for parameter setting*\n\n\u003cvideo src=\"https://github.com/BaranziniLab/KG_RAG/assets/42702311/acd08954-a496-4a61-a3b1-8fc4e647b2aa\" controls=\"controls\" style=\"max-width: 730px;\"\u003e\n\u003c/video\u003e\n\nYou can see that, KG-RAG was able to give the correct information about the FDA approved [drug](https://www.fda.gov/drugs/news-events-human-drugs/fda-approves-treatment-weight-management-patients-bardet-biedl-syndrome-aged-6-or-older).\n\n## How to run KG-RAG\n\n**Note: At the moment, KG-RAG is specifically designed for running prompts related to Diseases. We are actively working on improving its versatility.**\n\n### Step 1: Clone the repo\n\nClone this repository. All Biomedical data used in the paper are uploaded to this repository, hence you don't have to download that separately.\n\n### Step 2: Create a virtual environment\nNote: Scripts in this repository were run using python 3.10.9\n```\nconda create -n kg_rag python=3.10.9\nconda activate kg_rag\ncd KG_RAG\n```\n\n### Step 3: Install dependencies\n\n```\npip install -r requirements.txt\n```\n\n### Step 4: Update config.yaml \n\n[config.yaml](https://github.com/BaranziniLab/KG_RAG/blob/main/config.yaml) holds all the necessary information required to run the scripts in your machine. Make sure to populate [this](https://github.com/BaranziniLab/KG_RAG/blob/main/config.yaml) yaml file accordingly.\n\nNote: There is another yaml file called [system_prompts.yaml](https://github.com/BaranziniLab/KG_RAG/blob/main/system_prompts.yaml). This is already populated and it holds all the system prompts used in the KG-RAG framework.\n\n### Step 5: Run the setup script\nNote: Make sure you are in KG_RAG folder\n\nSetup script runs in an interactive fashion.\n\nRunning the setup script will: \n\n- create disease vector database for KG-RAG\n- download Llama model in your machine (optional, you can skip this and that is totally fine)\n\n```\npython -m kg_rag.run_setup\n```\n\n### Step 6: Run KG-RAG from your terminal\nNote: Make sure you are in KG_RAG folder\n\nYou can run KG-RAG using GPT and Llama model. \n\n#### Using GPT\n\n```\n# GPT_API_TYPE='azure'\npython -m kg_rag.rag_based_generation.GPT.text_generation -g \u003cyour favorite gpt model - \"gpt-4\" or \"gpt-35-turbo\"\u003e\n# GPT_API_TYPE='openai'\npython -m kg_rag.rag_based_generation.GPT.text_generation -g \u003cyour favorite gpt model - \"gpt-4\" or \"gpt-3.5-turbo\"\u003e\n```\n\nExample:\n\nNote: The following example was run on AWS p3.8xlarge EC2 instance and using KG-RAG v0.3.0.\n\n\u003cvideo src=\"https://github.com/BaranziniLab/KG_RAG/assets/42702311/defcbff7-e777-4db6-b028-10f54c76b234\" controls=\"controls\" style=\"max-width: 730px;\"\u003e\n\u003c/video\u003e\n\n#### Using GPT interactive mode\n\nThis allows the user to go over each step of the process in an interactive fashion\n\n```\n# GPT_API_TYPE='azure'\npython -m kg_rag.rag_based_generation.GPT.text_generation -i True -g \u003cyour favorite gpt model - \"gpt-4\" or \"gpt-35-turbo\"\u003e\n# GPT_API_TYPE='openai'\npython -m kg_rag.rag_based_generation.GPT.text_generation -i True -g \u003cyour favorite gpt model - \"gpt-4\" or \"gpt-3.5-turbo\"\u003e\n```\n\n#### Using Llama\nNote: If you haven't downloaded Llama during [setup](https://github.com/BaranziniLab/KG_RAG#step-5-run-the-setup-script) step, then when you run the following, it may take sometime since it will download the model first.\n\n```\npython -m kg_rag.rag_based_generation.Llama.text_generation -m \u003cmethod-1 or method2, if nothing is mentioned it will take 'method-1'\u003e\n```\n\nExample:\n\nNote: The following example was run on AWS p3.8xlarge EC2 instance and using KG-RAG v0.3.0.\n\n\u003cvideo src=\"https://github.com/BaranziniLab/KG_RAG/assets/42702311/94bda923-dafb-451a-943a-1d7c65f3ffd4\" controls=\"controls\" style=\"max-width: 730px;\"\u003e\n\u003c/video\u003e\n\n#### Using Llama interactive mode\n\nThis allows the user to go over each step of the process in an interactive fashion\n\n```\npython -m kg_rag.rag_based_generation.Llama.text_generation -i True -m \u003cmethod-1 or method2, if nothing is mentioned it will take 'method-1'\u003e\n```\n\n### Command line arguments for KG-RAG\n\n| Argument | Default Value         | Definition                                               | Allowed Options                     | Notes                                                            |\n|----------|-----------------|----------------------------------------------------------|------------------------------------|------------------------------------------------------------------|\n| -g       | gpt-35-turbo    | GPT model selection                                      | gpt models provided by OpenAI     | Use only for GPT models                                          |\n| -i       | False           | Flag for interactive mode (shows step-by-step)           | True or False                      | Can be used for both GPT and Llama models                        |\n| -e       | False           | Flag for showing evidence of association from the graph | True or False                      | Can be used for both GPT and Llama models                        |\n| -m       | method-1        | Which tokenizer method to use                            | method-1 or method-2. method-1 uses 'AutoTokenizer' and method-2 uses 'LlamaTokenizer' and with an additional 'legacy' flag set to False while initiating the tokenizer              | Use only for Llama models|\n\n\n## BiomixQA: Benchmark dataset\n\nBiomixQA is a curated biomedical question-answering dataset utilized to validate KG-RAG framework across different LLMs. This consists of:\n\n- Multiple Choice Questions (MCQ)\n- True/False Questions\n\nThe diverse nature of questions in this dataset, spanning multiple choice and true/false formats, along with its coverage of various biomedical concepts, makes it particularly suitable to support research and development in biomedical natural language processing, knowledge graph reasoning, and question-answering systems.\n\nThis dataset is currently hosted in Hugging Face and you can find it [here](https://huggingface.co/datasets/kg-rag/BiomixQA).\n\nIt’s easy to get started with BiomixQA—just three lines of Python to load the dataset:\n\n```\nfrom datasets import load_dataset\n\n# For MCQ data\nmcq_data = load_dataset(\"kg-rag/BiomixQA\", \"mcq\")\n\n# For True/False data\ntf_data = load_dataset(\"kg-rag/BiomixQA\", \"true_false\")\n```\n\n\n## Citation\n\n```\n@article{soman2023biomedical,\n  title={Biomedical knowledge graph-enhanced prompt generation for large language models},\n  author={Soman, Karthik and Rose, Peter W and Morris, John H and Akbas, Rabia E and Smith, Brett and Peetoom, Braian and Villouta-Reyes, Catalina and Cerono, Gabriel and Shi, Yongmei and Rizk-Jackson, Angela and others},\n  journal={arXiv preprint arXiv:2311.17330},\n  year={2023}\n}\n```\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FBaranziniLab%2FKG_RAG","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FBaranziniLab%2FKG_RAG","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FBaranziniLab%2FKG_RAG/lists"}