{"id":23587659,"url":"https://github.com/AkariAsai/ScholarQABench","last_synced_at":"2025-08-30T04:31:23.687Z","repository":{"id":263592280,"uuid":"890883337","full_name":"AkariAsai/ScholarQABench","owner":"AkariAsai","description":"This repository contains ScholarQABench data and evaluation pipeline. ","archived":false,"fork":false,"pushed_at":"2025-08-10T01:35:39.000Z","size":25590,"stargazers_count":77,"open_issues_count":3,"forks_count":4,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-08-10T03:21:39.964Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"https://allenai.org/blog/openscholar","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/AkariAsai.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2024-11-19T10:53:08.000Z","updated_at":"2025-08-10T01:35:43.000Z","dependencies_parsed_at":"2025-08-10T03:13:31.089Z","dependency_job_id":"ecf89d5a-933f-47b6-b9c5-598fe936c314","html_url":"https://github.com/AkariAsai/ScholarQABench","commit_stats":null,"previous_names":["akariasai/scholarbench"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/AkariAsai/ScholarQABench","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AkariAsai%2FScholarQABench","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AkariAsai%2FScholarQABench/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AkariAsai%2FScholarQABench/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AkariAsai%2FScholarQABench/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/AkariAsai","download_url":"https://codeload.github.com/AkariAsai/ScholarQABench/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AkariAsai%2FScholarQABench/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":272805294,"owners_count":24995909,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-30T02:00:09.474Z","response_time":77,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-12-27T05:01:27.290Z","updated_at":"2025-08-30T04:31:23.676Z","avatar_url":"https://github.com/AkariAsai.png","language":"Python","funding_links":[],"categories":["A01_文本生成_文本对话"],"sub_categories":["大语言对话模型及数据"],"readme":"# ScholarQABench\n\nThis repository contains **ScholarQABench** evaluation script and data, which provides a holistic evaluation platform to test LLMs' abilities to assist researchers to conduct scientific literature synthesis. This work is from the [OpenScholar](https://github.com/AkariAsai/OpenScholar) project. See details in [our manuscript](https://openscholar.allen.ai/paper). \n\n![scholar bench overview](scholarqabench.png)\n\n**Table of Contents**\n\n1. [Overview and Installation](#overview-and-installation)\n2. [Evaluations](#evaluations)\n3. [License](#license)\n4. [Note](#notes)\n5. [Citations](#citations)\n\n\n## Overview and Installation\n### Setup\n\n```python\nconda create -n sb_env python=3.10.0\nconda activate sb_env\npip install -r requirements.txt\npython -m nltk.downloader punkt_tab\n``` \n\n### Repository Organizations\n\n- [data](data/): provides relevant data files. \n    - [scholar_cs](data/scholarqa_cs): includes **ScholarQA-CS** (Computer Science) data files\n        - `output_snippets.jsonl` : Contains the questions with system responses for eval (and some other metadata used to generate the test cases but not required for subsequent runs). Should not require further modification.\n        - `test_configs_snippets.json` : A collection of test cases in json format with associated rubrics for each question. Each question has its own test case and rubrics. Should not require further modification.\n        - `qa_metadata_all.jsonl` : Metadata file that was used to bootstrap this utility. Should not require further modification.\n        - `src_answers`: Directory containing sample system responses from 4 systems.\n\n    - [scholar_multi](scholar_multi): includes **ScholarQA-Multi** (Multi-domain; CS, Bio and Physics) data files\n    - [scholar_bio](scholar_bio): includes **ScholarQA-Bio** (Biomedicine) data files\n    - [scholar_neuro](scholar_neuro): includes **ScholarQA-Neuro** (Neuroscience) data files\n    - [single_paper_tasks](single_paper_tasks): includes **SciFact**, **PubmedQA** and **QASA**. \n        - `scifact_test.jsonl`: a jsonlines file includes the SciFact task data. Each item consists of `input` (original claim), `answer` (answer label, `true` or `false`), and `gold_ctx` (a list of dictionary where each item consists of `title` and `text`). `gold_ctx` will not be used during evaluation except for oracle gold context evaluation. \n        - `pubmed_test.jsonl`: a jsonlines file includes the PubMed task data. Each item consists of `input` (original question), `answer` (answer label, `yes` or `no`), and `gold_ctx` (a list of dictionary where each item consists of `title` and `text`). `gold_ctx` will not be used during evaluation except for oracle gold context evaluation. \n        - `qasa_test.jsonl`: a jsonlines file includes the QASA task data. Each item consists of `input` (question), `answer` (long-form answer), `ctxs` (a list of dictionary items for the full paper data) and `gold_ctxs` (a list of dictionary items where each item consists of `title` and `text` for gold contexts). `gold_ctxs` will not be used during evaluation except for oracle gold context evaluation. \n- [scripts](scripts/): provides evaluation scripts for each evaluation aspects\n    - [rubric_eval.py](scripts/rubric_eval.py): a script to run rubric based evaluations for **ScholarQA-CS**. \n    - [citation_correctness_eval.py](scripts/citation_correctness_eval.py): a script to run citations as well as string-matching-based correctness evaluations for single-paper tasks. \n    - [prometheus_eval.py](scripts/prometheus_eval.py): a script to evaluate *organization*, *relevance* and *coverage*, based on five-scale rubrics in [rubrics](rubrics/prometheous_rubrics_v8.json)\n- [rubrics](rubrics): rubrics for `prometheus_eval`. \n\nThe availability of annotations, as well as overview of the annotations are summarized below. Note that ScholarQA-Bench does not provide training data. \n\n| Dataset    | Input | Output | Label Available | Evaluation Metrics |\n| :-------- | :-------: |:-------: | :-------: | :-------: |\n| `ScholarQA-SciFact`  |  claim   | `true` or `false` |  ✅ | `accuracy`, `citations_short` |\n| `ScholarQA-PubmedQA`  |  question   | `yes` or `no` | ✅  | `accuracy`, `citations_short` |\n| `ScholarQA-QASA`  |  question   | long-form |  ✅ | `rouge-l`, `citations` |\n| `ScholarQA-CS`  |  question   | long-form |  ✅ (rubrics) | `rubrics`, `citations` |\n| `ScholarQA-Multi`  |  question   | long-form |  ✅ | `prometheus`, `citations` |\n| `ScholarQA-Bio`  |  question   | long-form |  | `citations` |\n| `ScholarQA-Neuro`  |  question   | long-form | |  `citations` |\n\n\n\n## Evaluations\nAfter you run model inferences, run evaluations for each task and aspects using the following scripts. \n\nYour answer files for all tasks are expected to format in the following way (`json``) \n```\n[\n{\"input\": query (str), \"output\": final_model_output (str), \"ctxs: citations (list of    dict, where each dict has text )}, ...\n]\n```\n\n### Citation Accuracy (All tasks) \n\n#### Short-form generation (SciFact, PubMedQA)\n\n```\npython scripts/citation_correctness_eval.py --f PATH_TO_YOUR_PREDICTION_FILE --citations_short\n```\n\n\n#### Long-form generation (QASA, ScholarQA-*)\n\n```\npython scripts/citation_correctness_eval.py --f PATH_TO_YOUR_PREDICTION_FILE --citations\n```\n\n\n### String-based Correctness (SciFact, PubmedQA, QASA)\nTo run string matching based evaluations, run the following commands:\n\n#### SciFact and PubMedQA (accuracy)\n\n```\npython scripts/citation_correctness_eval.py --f PATH_TO_YOUR_PREDICTION_FILE --match\n```\n\n#### QASA (ROUGE-L)\n\n```\npython scripts/citation_correctness_eval.py --f PATH_TO_YOUR_PREDICTION_FILE\n```\n\n\n### Rubric-based Correctness (ScholarQA-CS)\n\n#### Convert the output file\nTo run eval for your system, first setup the prediction file with the system answers to be evaluated as per following requirement:\n\nA jsonl file with fields `case_id` and `answer_text` (See [example file](https://github.com/allenai/multidoc_qa_eval/blob/main/data/src_answers/gpt.jsonl)) -\n\n- `case_id` corresponds to the identifier of the question for which the response is to be evaluated, map the question text with the case_id in `test_configs_snippets.json`\n- `answer_text` is the system answer (along with citations and excerpts, if applicable) in plain text\n\n\n##### Run evaluation \nOnce the prediction json file is ready, save it a new directory run the eval script as follows (You can save as many system response files under a directory, they will be picked together for eval):\n\n```python\nexport OPENAI_API_KEY=\u003copenai key\u003e\npython scripts/rubric_eval.py \\\n    --qa-dir data/scholarqa_cs/src_answers \\\n    --test-config data/scholarqa_cs/test_configs_snippets.json \\\n    --rubrics --snippets \\\n    --src-names \u003coptional comma-separated src names prefixes of prediction files with .jsonl, if not given all the files will be picked\u003e\n```\n**Note:** To evaluate only using rubrics, remove `--snippets` parameter and vice-versa to use only snippets. \n\n**Acknowledgements:** The original code of ScholarQA-CS is available at [allenai/multidoc_qa_eval](https://github.com/allenai/multidoc_qa_eval).\n\n### Prometheus for Coverage, Relevance and Organization (ScholarQA-CS)\n\n#### Coverage and Organization\n\n```\npython scripts/prometheus_eval.py \\\n    --batch_process_dir YOUR_PREDICTION_FILE_PATH \\\n    --output_path OUTPUT_DIR_NAME \\\n    --rubric_path rubrics/prometheus_rubrics_v8.json \\\n    --instruction \"Answer the question related to the most recent scientific literature.\" \\\n    --model prometheus-eval/prometheus-bgb-8x7b-v2.0 \\\n    --load_vllm \\\n    --top_n 10 \\\n    -f data/scholarqa_multi/human_answers.json \\\n    --aspects organization coverage relevance\n```\n\n#### Relevance \nWe use `prometheus-eval/prometheus-8x7b-v2.0` for relevance, due to unstable performance of `prometheus-eval/prometheus-bgb-8x7b-v2.0` for relevance. \n\n```\npython scripts/prometheus_eval.py \\\n    --batch_process_dir YOUR_PREDICTION_FILE_PATH \\\n    --output_path OUTPUT_DIR_NAME \\\n    --rubric_path rubrics/prometheus_rubrics_v8.json \\\n    --instruction \"Answer the question related to the most recent scientific literature.\" \\\n    --model prometheus-eval/prometheus-8x7b-v2.0 \\\n    --load_vllm \\\n    --top_n 10 \\\n    -f data/scholar_multi/human_answers.json \\\n    --aspects relevance\n```\n\n## License\nThe aggregate test cases, sample system answers under `data/src_answers` and other files under data directory are released under [ODC-BY](https://opendatacommons.org/licenses/by/1.0/) license. \nBy downloading this data you acknowledge that you have read and agreed to all the terms in this license.\nFor constituent datasets, also go through the individual licensing requirements, as applicable. \n\n## Notes \n- The reference rubrics and expert-written answers were created by domain experts between July and September 2024. While we do not discourage the use of retrieval systems that surface papers published after this period, please note that such systems may access information unavailable to the original annotators, potentially affecting comparability with the provided references.\n\n## Citations\n```\n@article{openscholar,\n  title={{OpenScholar}: Synthesizing Scientific Literature with Retrieval-Augmented Language Models},\n  author={Asai, Akari and He*, Jacqueline and Shao*, Rulin and Shi, Weijia and Singh, Amanpreet and Chang, Joseph Chee  and Lo,  Kyle and Soldaini, Luca and Feldman, Tian, Sergey and Mike, D’arcy and Wadden, David and Latzke, Matt and Minyang and Ji, Pan and Liu, Shengyan and Tong, Hao and Wu, Bohao and Xiong, Yanyu and Zettlemoyer, Luke and Weld, Dan and Neubig, Graham and Downey, Doug and Yih, Wen-tau and Koh, Pang Wei and Hajishirzi, Hannaneh},\n  journal={Arxiv},\n  year={2024},\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FAkariAsai%2FScholarQABench","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FAkariAsai%2FScholarQABench","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FAkariAsai%2FScholarQABench/lists"}