{"id":21441742,"url":"https://github.com/dmis-lab/olaph","last_synced_at":"2025-07-14T17:32:06.413Z","repository":{"id":241032990,"uuid":"799497988","full_name":"dmis-lab/OLAPH","owner":"dmis-lab","description":"OLAPH: Improving Factuality in Biomedical Long-form Question Answering","archived":false,"fork":false,"pushed_at":"2024-09-10T06:52:21.000Z","size":168336,"stargazers_count":38,"open_issues_count":1,"forks_count":4,"subscribers_count":3,"default_branch":"main","last_synced_at":"2024-11-06T02:36:17.482Z","etag":null,"topics":["biomedical-research","factuality","hallucination","question-answering"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/dmis-lab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-05-12T10:34:26.000Z","updated_at":"2024-11-02T19:04:05.000Z","dependencies_parsed_at":"2024-06-11T04:07:50.755Z","dependency_job_id":null,"html_url":"https://github.com/dmis-lab/OLAPH","commit_stats":null,"previous_names":["dmis-lab/olaph"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dmis-lab%2FOLAPH","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dmis-lab%2FOLAPH/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dmis-lab%2FOLAPH/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dmis-lab%2FOLAPH/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/dmis-lab","download_url":"https://codeload.github.com/dmis-lab/OLAPH/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":225990493,"owners_count":17556152,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["biomedical-research","factuality","hallucination","question-answering"],"created_at":"2024-11-23T01:41:23.523Z","updated_at":"2024-11-23T01:41:24.083Z","avatar_url":"https://github.com/dmis-lab.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# OLAPH: Improving Factuality in Biomedical Long-form Question Answering\n\nThis is a repository for [OLAPH: Improving Factuality in Biomedical Long-form Question Answering](https://arxiv.org/abs/2405.12701) by Minbyul Jeong, Hyeon Hwang, Chanwoong Yoon, Taewhoo Lee, and Jaewoo Kang.\n\n[MedLFQA](https://huggingface.co/datasets/dmis-lab/MedLFQA) | [Self-BioRAG (OLAPH)](https://huggingface.co/dmis-lab/self-biorag-7b-olaph) | [BioMistral (OLAPH)](https://huggingface.co/dmis-lab/biomistral-7b-olaph) | [Mistral (OLAPH)](https://huggingface.co/dmis-lab/mistral-7b-olaph) | [Summary](https://www.linkedin.com/posts/minbyul-jeong-183000194_introducing-medlfqa-olaph-a-biomedical-activity-7198887412050112512-5eHq?utm_source=share\u0026utm_medium=member_desktop) | [Paper](https://arxiv.org/abs/2405.12701) \n\n![](figures/olaph.png)\n\n1) **MedLFQA** is a reconstructed format of long-form question-answering (LFQA) benchmark datasets in biomedical domain to facilitate automatic evaluation especially factuality (e.g., hallucination \u0026 comprehensiveness).\n\n![](figures/motivation.png)\n\n2) **OLAPH** is a framework that reduces hallucinations and includes crucial claims by utilizing automatic evaluation to select the best response in sampling predictions and designing to answer questions in preferred manner.\n\n![](figures/model_figure.png)\n\n## Updates\n\\[**June 28, 2024**\\] We've got a first citation today! It targets [conformal prediction](https://arxiv.org/abs/2406.09714) using our MedLFQA dataset. Wonderful work from Stanford! \\\n\\[**June 08, 2024**\\] We provide A/B test result from 3 medical experts using 9 [MedPALM](https://arxiv.org/abs/2212.13138) criteria in `Human-Evaluation`. \\\n\\[**May 31, 2024**\\] Introducing two videos: [OLAPH](https://www.youtube.com/watch?v=IQd39sYOprI) (Korean) \u0026 [OLAPH](https://youtu.be/7wZpIWCEAaY) (English) in youtube! \\\n\\[**May 30, 2024**\\] update the codes to train and inference for [Gemma-7b](https://huggingface.co/google/gemma-7b), [Llama-3-8b](https://huggingface.co/meta-llama/Meta-Llama-3-8B), and [Llama-3-8b-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct). \\\n\\[**May 23, 2024**\\] **OLAPH** has been released.\n\n## Content\n1. [Installation](#installation)\n2. [Quick Usage](#quick-usage)\n3. [Datasets](#datasets)\n4. [Training](#training)\n5. [Inference](#inference)\n6. [Iterative Learning](#iterative-learning)\n7. [FactScore](#factscore)\n8. [FAQ](#faq)\n9. [Citation](#citation)\n10. [Contact Information](#contact-information)\n\n## Installation\nPlease create a conda environment by running the command below.\nNote that we use two different environments to train and inference.\nI will ensure that everything is integrated into a single environment and functions properly in the future.\n\nFirst, you have to install following [alignment-handbook](https://github.com/huggingface/alignment-handbook/tree/main).\nWe use PyTorch v2.1.2, which is important for reproducibility! \\\nSince this is dependent on your environmental settings, please follow and use compatible version of Pytorch from [here](https://pytorch.org/get-started/locally/) \\\n\nThen, we install the remaining package dependencies as follows:\n\n```\nconda create -n olaph python=3.10\nconda activate olaph\ncd ./alignment-handbook/ \\\npython -m pip install .\n```\n\nThis could lead us to install for the most recent version of torch.\nHowever, we use CUDA 11.8 version in our experimental settings.\nThus, we recommend you to download a below code to reproduce our results \\\n\n\n\u003c!-- conda install pytorch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 pytorch-cuda=11.8 -c pytorch -c nvidia --\u003e\n```\nconda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia\n```\n\nYou will need Flash Attention 2 installed \\\n\n```\npython -m pip install flash-attn==2.5.6 --no-build-isolation\n```\n\nWe need further requirments to install for automatic evaluation or vllm for boosting inference speed \\\n```\npip install -r requirements.txt --no-build-isolation\npip install git+https://github.com/lucadiliello/bleurt-pytorch.git\n```\n\nAlso, you will need to log into your Huggingface account (make sure your account token should be in WRITE status)\nThen, install the Git LFS to upload your models as follows:\n\n```\nhuggingface-cli login\nsudo apt-get install git-lfs\n```\n\u003c!-- \nFor training,\n```\nconda env create -f training.yaml\nconda activate olaph_training\n```\n\nFor inference,\n```\nconda env create -f inference.yaml\nconda activate olaph_inference\n``` --\u003e\n\n\n## Quick Usage\nYou can download 7B models trained with our OLAPH framework from HuggingFace hub.\n```py\nimport torch\nfrom transformers import AutoTokenizer, AutoModelForCausalLM\n\nmodel_name = \"dmis-lab/self-biorag-7b-olaph\" # [\"mistralai/Mistral-7B-v0.1\", \"BioMistral/BioMistral-7B\", \"meta-llama/Llama-2-7b-hf\", \"dmis-lab/selfbiorag_7b\", \"epfl-llm/meditron-7b\"]\ndevice = torch.device('cuda' if torch.cuda.is_available() else 'cpu')\nmodel = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16).to(device)\ntokenizer = AutoTokenizer.from_pretrained(model_name, padding_side=\"left\")\n\nquery = \"Can a red eye be serious?\"\n\ninput_ids = tokenizer.encode(query, return_tensors=\"pt\").to(device)\noutput = model.generate(input_ids, max_length=512, no_repeat_ngram_size=2, do_sample=False, top_p=1.0).to(device)\nresponse = tokenizer.decode(output[0], skip_special_tokens=True).strip()\n\nprint (\"Model prediction: \", response)\n\nYes, a Red Eye can be a sign of a serious condition or a complication of another underlying illness \\\nor injury. hopefully, this short guide has helped you understand the different causes of red eyes \\\nand how to properly identify and treat them.If you ever have persistent or severe redness, it is \\\nimportant to seek medical attention from a healthcare professional.\n\n```\n\n## Datasets\n**MedLFQA** is a reconstructed format of long-form question-answering (LFQA) benchmark datasets in biomedical domain to facilitate automatic evaluation.\nWe construct the **MedLFQA** with four biomedical LFQA benchmark datasets: [LiveQA](https://github.com/abachaa/LiveQA_MedicalTask_TREC2017), [MedicationQA](https://github.com/abachaa/Medication_QA_MedInfo2019), [HealthSearchQA](https://huggingface.co/datasets/katielink/healthsearchqa), and [K-QA](https://github.com/Itaymanes/K-QA).\nOur **MedLFQA** instance is comprised of four components: question (Q), long-form answer (A), Must Have Statements (MH), Nice to Have Statements (NH).\nWe provide the reconstructed datasets for automatic evaluation of long-form generated responses.\n\n## Inference\n\n* Sampling Predictions (Including Automatic Evaluation)\n\n**Note that you have to generate all predictions of MedLFQA datasets to proceed further SFT and DPO training.**\n\n```\n# For first sampling predictions \\ \n\nexport DATA_NAME=live_qa \\\nexport HUGGINGFACE_MODEL_DIR=dmis-lab/selfbiorag_7b \\\n\nCUDA_VISIBLE_DEVICES=0 python pdata_collection.py \\\n--model_name_or_path ${HUGGINGFACE_MODEL_DIR} \\\n--eval_data ${DATA_NAME} \\\n```\n\n```\n# Sampling prediction during Iterative learning (i.e., after SFT or DPO) \\\n\nexport HUGGINGFACE_MODEL_DIR=your_trained_model \\\n\nCUDA_VISIBLE_DEVICES=0 python pdata_collection.py \\\n--model_name_or_path ${HUGGINGFACE_MODEL_DIR} \\\n--eval_data ${DATA_NAME} \\\n```\n\n```\n# Make supervised fine-tuning dataset as follows\n\nexport WODATA_NAME=kqa_golden # it must be different compared to DATA_NAME \\\n\npython pred_to_sft.py \\\n--model_name_or_path ${HUGGINGFACE_MODEL_DIR} \\\n--wodata_name ${WODATA_NAME} \\\n```\n\n\n## Training\n\n* Supervised Fine-Tuning (SFT)\n\nAfter we obtain sampled predictions from previous step, we use SFT to recognize the question-answering task.\nRather than training on human-annotated answer or pseudo-optimal responses generated by GPT-4, we set a self-generated response as a labeled asnwer to remove the depedency on resources in annotation datasets.\nWe use a representative 7B model for Self-BioRAG.\nIf you want to use another models with difference configuration, you should change directions of recipes.\n\n```\ncd alignment-handbook \\\n\nCUDA_VISIBLE_DEVICES=0,1,2,3 ACCELERATE_LOG_LEVEL=info accelerate launch \\\n--config_file recipes/accelerate_configs/deepspeed_zero3.yaml  \\\n--num_processes 4 \\\nscripts/run_sft.py \\\nrecipes/selfbiorag_7b/sft/config_full.yaml \\\n```\n\n* Make synthetic preference set based on sampling predictions\n\n```\nexport HUGGINGFACE_MODEL_DIR=your_trained_model\nexport DATA_NAME=kqa_golden\nexport WODATA_NAME=kqa_golden\n\npython pred_to_preference.py \\\n--model_name ${HUGGINGFACE_MODEL_DIR} \\\n--wodata_name ${WODATA_NAME} \\\n--alpha 1.0 \\\n--beta 1.0 \\\n--gamma 1.0 \\\n--threshold 200 \\\n\npython pred_to_preference.py \\\n--model_name ${HUGGINGFACE_MODEL_DIR} \\\n--wodata_name ${WODATA_NAME} \\\n--data_names ${DATA_NAME} \\\n--alpha 1.0 \\\n--beta 1.0 \\\n--gamma 1.0 \\\n--threshold 200 \\\n```\n\n* Direct Preference Optimization (DPO)\n\n```\ncd alignment-handbook \\\n\nCUDA_VISIBLE_DEVICES=0,1,2,3 ACCELERATE_LOG_LEVEL=info accelerate launch \\\n--config_file recipes/accelerate_configs/deepspeed_zero3.yaml  \\\n--num_processes 4 \\\nscripts/run_dpo.py \\\nrecipes/selfbiorag_7b/sft/config_full.yaml \\\n```\n\n## Iterative Learning\n**Note that you should convert two things as follows:**\n\n**1. convert without dataset name in scripts/run_sft.py and scripts/run_dpo.py**\n\n**2. convert model_name_or_path in config file for iterative training**\n\n**We knew that iteartive learning is uncomfortable to follow, thus we try to fix it as soon as possible.**\n\nWe train and generate sampling predictions through separate files and do several times.\nIn future, we will provide the processes execution in one simple bash file.\n\nOur iterative learning consists of the following processes\n- Sampling predictions (`pdata_collection.py`) - Make SFT set (`pred_to_sft.py`)\n- SFT (`alignment-handboook/sft.sh`) - Sampling predictions (`pdata_collection.py`) - Make preference set (`pred_to_preference.py`)\n- DPO (`alignment-handboook/dpo.sh`) - Sampling predictions (`pdata_collection.py`) - Make preference set (`pred_to_preference.py`)\n- DPO (`alignment-handboook/dpo.sh`) - Sampling predictions (`pdata_collection.py`) - Make preference set (`pred_to_preference.py`)\n- DPO (`alignment-handboook/dpo.sh`)\n\n## FActScore\nWe provide detail experimental settings and results in [FActScore](Factscore).\n\n## FAQ\n**1. Providing each step sampling \u0026 SFT \u0026 DPO results?**\n\nA. We provide sampling results of every 7B models in the following folder `alignment-handbook/predictions/`.\n\n**2. [A/B Testing] Open sourcing about Human evaluation for K-QA datasets**\n\nA. We provide gpt-4 evaluation and 3 medical experts evaluation about A/B testing in `Human-Evaluation` folder.\n\n**3. When using Wikipedia as the knowledge source, it seems the topics need to be titles of the Wikipedia pages. I wonder what topics you use for datasets like K-QA?**\n\nA. We manually extracted biomedical or medical named entities from the questions in the K-QA dataset, as they were intuitively recognizable. If you want to utilize this in an automatic way, you could combine it with a named entity recognition model to extract the entities, then perform normalization. By doing this, you can construct a knowledge source using retrieved chunks of entities that have corresponding pages on Wikipedia.\n\n**4. Is it possible to share the biomedical knowledge source that you built for Factscore?**\n\nA. I prefer you to look at the following url [Self-BioRAG](https://github.com/dmis-lab/self-biorag) to use our biomedical knowledge source!\n\n## Citation\n```\n@article{jeong2024olaph,\n  title={OLAPH: Improving Factuality in Biomedical Long-form Question Answering},\n  author={Jeong, Minbyul and Hwang, Hyeon and Yoon, Chanwoong and Lee, Taewhoo and Kang, Jaewoo},\n  journal={arXiv preprint arXiv:2405.12701},\n  year={2024}\n}\n```\n\n## Contact Information\nFor help or issues using **MedLFQA** \u0026 **OLAPH**, please submit a GitHub issue. \\\nPlease contact Minbyul Jeong (`minbyuljeong (at) korea.ac.kr`) or Hyeon Hwang (`hyeon-hwang (at) korea.ac.kr`) for communication related to OLAPH.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdmis-lab%2Folaph","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdmis-lab%2Folaph","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdmis-lab%2Folaph/lists"}