{"id":37080056,"url":"https://github.com/zli12321/qa_metrics","last_synced_at":"2026-01-14T09:42:53.274Z","repository":{"id":218378450,"uuid":"746280574","full_name":"zli12321/qa_metrics","owner":"zli12321","description":"An easy python package to run quick basic QA evaluations. This package includes standardized QA evaluation metrics and semantic evaluation metrics: Black-box and Open-Source large language model prompting and evaluation, exact match, F1 Score, PEDANT semantic match, transformer match. Our package also supports prompting OPENAI and Anthropic API.","archived":false,"fork":false,"pushed_at":"2025-07-18T22:42:40.000Z","size":16971,"stargazers_count":59,"open_issues_count":0,"forks_count":6,"subscribers_count":1,"default_branch":"master","last_synced_at":"2026-01-06T18:58:59.570Z","etag":null,"topics":["exact-matching","llm","llm-evaluation","llm-evaluation-framework","llm-evaluation-toolkit","qa-automation-test","reward-modeling","rl-training"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/zli12321.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2024-01-21T15:56:42.000Z","updated_at":"2025-12-12T16:27:35.000Z","dependencies_parsed_at":"2024-01-21T16:31:15.701Z","dependency_job_id":"61cd06e3-3750-4697-a3d8-a1bd46f5117a","html_url":"https://github.com/zli12321/qa_metrics","commit_stats":null,"previous_names":["zli12321/qa_metrics"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/zli12321/qa_metrics","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zli12321%2Fqa_metrics","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zli12321%2Fqa_metrics/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zli12321%2Fqa_metrics/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zli12321%2Fqa_metrics/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/zli12321","download_url":"https://codeload.github.com/zli12321/qa_metrics/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zli12321%2Fqa_metrics/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28416120,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-14T08:38:59.149Z","status":"ssl_error","status_checked_at":"2026-01-14T08:38:43.588Z","response_time":107,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["exact-matching","llm","llm-evaluation","llm-evaluation-framework","llm-evaluation-toolkit","qa-automation-test","reward-modeling","rl-training"],"created_at":"2026-01-14T09:42:52.676Z","updated_at":"2026-01-14T09:42:53.268Z","avatar_url":"https://github.com/zli12321.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# QA-Evaluation-Metrics 📊\r\n\r\n[![PyPI version qa-metrics](https://img.shields.io/pypi/v/qa-metrics.svg)](https://pypi.org/project/qa-metrics/) \r\n[![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1Ke23KIeHFdPWad0BModmcWKZ6jSbF5nI?usp=sharing)\r\n\r\n\u003e A fast and lightweight Python package for evaluating question-answering models and prompting of black-box and open-source large language models.\r\n\r\n\u003e `pip install qa-metrics` is all you need!\r\n\r\n\u003e 🤗 Huggingface [Model](https://huggingface.co/zli12321/roberta-large-qa-evaluator) and [Dataset](https://huggingface.co/datasets/zli12321/pedants_qa_evaluation_bench)\r\n\r\n## Release Updates\r\n- **Version 0.2.42 Released! (06/20/2025)**\r\n  - RewardBert (ModerBert base) supports batch scores prediction to speed up prediction for RL training.\r\n\r\n- **Version 0.2.35 Released! (06/18/2025)**\r\n  - RewardBert (ModerBert base) trained to evaluate both short-form and long-form generations.\r\n  - RewardBert outputs a likert scale between 1-5 or normalized score between 0-1.\r\n  - Turn off nltk download verbose logs.\r\n\r\n- **Version 0.2.30 Released!**\r\n  - Enhanced PEDANTS with multi-pipeline support and improved edge case handling\r\n  - Introduced trained tiny-bert for QA evaluation (18MB model size)\r\n  - Added direct Huggingface model download support for TransformerMatcher\r\n\r\n## 🚀 Quick Start\r\n\r\n## Table of Contents\r\n* 1. [RewardBert](#BERT)\r\n* 2. [Normalized Exact Match](#em)\r\n* 2. [Token F1 Score](#f1)\r\n* 3. [PEDANTS](#pedants)\r\n* 4. [Finetuned Neural Matching](#neural)\r\n* 5. [Prompting LLM](#llm)\r\n\r\n### Prerequisites\r\n- Python \u003e= 3.6\r\n- openai \u003e= 1.0\r\n\r\n### Installation\r\n```bash\r\npip install qa-metrics\r\n```\r\n\r\n## 💡 Features\r\n\r\nOur package offers six QA evaluation methods with varying strengths:\r\n\r\n| Method | Best For | Cost | Correlation with Human Judgment |\r\n|--------|----------|------|--------------------------------|\r\n| RewardBert | General Text Generations | Free | Very High |\r\n| Normalized Exact Match | Short-form QA (NQ-OPEN, HotpotQA, etc.) | Free | Good |\r\n| PEDANTS | Both short \u0026 medium-form QA | Free | Very High |\r\n| [Neural Evaluation](https://huggingface.co/zli12321/answer_equivalence_tiny_bert) | Both short \u0026 long-form QA | Free | High |\r\n| [Open Source LLM Evaluation](https://huggingface.co/zli12321/prometheus2-2B) | All QA types | Free | High |\r\n| Black-box LLM Evaluation | All QA types | Paid | Highest |\r\n\r\n\r\n\r\n## 📖 Documentation\r\n\r\n### 1. \u003ca name='BERT'\u003e\u003c/a\u003eRewardBert\r\n\r\n#### Method: `compute_score`\r\n**Parameters**\r\n- `reference_answer` (str): gold (correct) answer to the question\r\n- `candidate_answer` (str): The answer provided by a candidate that needs to be evaluated\r\n\r\n**Returns**\r\n- `tuple`: A tuple of normalized and raw scores.\r\n\r\n```python\r\nfrom qa_metrics.RewardBert import RewardBert\r\n\r\nrb = RewardBert(device='cuda')\r\nreference_answer = \"The Frog Prince\"\r\ncandidate_answer = \"The movie \\\"The Princess and the Frog\\\" is loosely based off the Brother Grimm's \\\"Iron Henry\\\"\"\r\nrb.compute_score(reference_answer, candidate_answer)\r\n# (0.29113227128982544, 2.1645290851593018)\r\n```\r\n\r\n\r\n#### Method: `compute_batch_scores`\r\n**Parameters**\r\n- `reference_answers` (list of str): A list of gold (correct) answers to the question\r\n- `candidate_answer` (list of str): A list of answers provided by a candidate that needs to be evaluated\r\n- `batch_size` (int): batch size to predict (default 1)\r\n\r\n**Returns**\r\n- `tuple`: A tuple of a list of normalized and raw scores.\r\n\r\n```python\r\nfrom qa_metrics.RewardBert import RewardBert\r\n\r\nrb = RewardBert(device='cuda')\r\nreference_answer = [\"The Frog Prince\"]\r\ncandidate_answer = [\"The movie \\\"The Princess and the Frog\\\" is loosely based off the Brother Grimm's \\\"Iron Henry\\\"\"]\r\nrb.compute_batch_scores(reference_answer, candidate_answer, batch_size=1)\r\n# ([0.29113227128982544], [2.1645290851593018])\r\n```\r\n\r\n\r\n### 2. \u003ca name='em'\u003e\u003c/a\u003eNormalized Exact Match\r\n\r\n#### Method: `em_match`\r\n**Parameters**\r\n- `reference_answer` (list of str): A list of gold (correct) answers to the question\r\n- `candidate_answer` (str): The answer provided by a candidate that needs to be evaluated\r\n\r\n**Returns**\r\n- `boolean`: True if there are any exact normalized matches between gold and candidate answers\r\n\r\n```python\r\nfrom qa_metrics.em import em_match\r\n\r\nreference_answer = [\"The Frog Prince\", \"The Princess and the Frog\"]\r\ncandidate_answer = \"The movie \\\"The Princess and the Frog\\\" is loosely based off the Brother Grimm's \\\"Iron Henry\\\"\"\r\nmatch_result = em_match(reference_answer, candidate_answer)\r\n```\r\n\r\n### 3. \u003ca name='f1'\u003e\u003c/a\u003eF1 Score\r\n\r\n#### Method: `f1_score_with_precision_recall`\r\n**Parameters**\r\n- `reference_answer` (str): A gold (correct) answer to the question\r\n- `candidate_answer` (str): The answer provided by a candidate that needs to be evaluated\r\n\r\n**Returns**\r\n- `dictionary`: Contains the F1 score, precision, and recall between a gold and candidate answer\r\n\r\n#### Method: `f1_match`\r\n**Parameters**\r\n- `reference_answer` (list of str): List of gold answers\r\n- `candidate_answer` (str): Candidate answer to evaluate\r\n- `threshold` (float): F1 score threshold for considering a match (default: 0.5)\r\n\r\n**Returns**\r\n- `boolean`: True if F1 score exceeds threshold for any gold answer\r\n\r\n```python\r\nfrom qa_metrics.f1 import f1_match, f1_score_with_precision_recall\r\n\r\nf1_stats = f1_score_with_precision_recall(reference_answer[0], candidate_answer)\r\nmatch_result = f1_match(reference_answer, candidate_answer, threshold=0.5)\r\n```\r\n\r\n### 4. \u003ca name='pedants'\u003e\u003c/a\u003ePEDANTS\r\n\r\n#### Method: `get_score`\r\n**Parameters**\r\n- `reference_answer` (str): A Gold answer\r\n- `candidate_answer` (str): Candidate answer to evaluate\r\n- `question` (str): The question being evaluated\r\n\r\n**Returns**\r\n- `float`: The similarity score between two strings (0 to 1)\r\n\r\n#### Method: `get_highest_score`\r\n**Parameters**\r\n- `reference_answer` (list of str): List of gold answers\r\n- `candidate_answer` (str): Candidate answer to evaluate\r\n- `question` (str): The question being evaluated\r\n\r\n**Returns**\r\n- `dictionary`: Contains the gold answer and candidate answer pair with highest matching score\r\n\r\n#### Method: `get_scores`\r\n**Parameters**\r\n- `reference_answer` (list of str): List of gold answers\r\n- `candidate_answer` (str): Candidate answer to evaluate\r\n- `question` (str): The question being evaluated\r\n\r\n**Returns**\r\n- `dictionary`: Contains matching scores for all gold answer and candidate answer pairs\r\n\r\n#### Method: `evaluate`\r\n**Parameters**\r\n- `reference_answer` (list of str): List of gold answers\r\n- `candidate_answer` (str): Candidate answer to evaluate\r\n- `question` (str): The question being evaluated\r\n\r\n**Returns**\r\n- `boolean`: True if candidate answer matches any gold answer\r\n\r\n#### Method: `get_question_type`\r\n**Parameters**\r\n- `reference_answer` (list of str): List of gold answers\r\n- `question` (str): The question being evaluated\r\n\r\n**Returns**\r\n- `list`: The type of the question (what, who, when, how, why, which, where)\r\n\r\n#### Method: `get_judgement_type`\r\n**Parameters**\r\n- `reference_answer` (list of str): List of gold answers\r\n- `candidate_answer` (str): Candidate answer to evaluate\r\n- `question` (str): The question being evaluated\r\n\r\n**Returns**\r\n- `list`: A list revised rules applicable to judge answer correctness\r\n\r\n```python\r\nfrom qa_metrics.pedant import PEDANT\r\n\r\npedant = PEDANT()\r\nscores = pedant.get_scores(reference_answer, candidate_answer, question)\r\nmatch_result = pedant.evaluate(reference_answer, candidate_answer, question)\r\n```\r\n\r\n### 5. \u003ca name='neural'\u003e\u003c/a\u003eTransformer Neural Evaluation\r\n\r\n#### Method: `get_score`\r\n**Parameters**\r\n- `reference_answer` (str): A Gold answer\r\n- `candidate_answer` (str): Candidate answer to evaluate\r\n- `question` (str): The question being evaluated\r\n\r\n**Returns**\r\n- `float`: The similarity score between two strings (0 to 1)\r\n\r\n#### Method: `get_highest_score`\r\n**Parameters**\r\n- `reference_answer` (list of str): List of gold answers\r\n- `candidate_answer` (str): Candidate answer to evaluate\r\n- `question` (str): The question being evaluated\r\n\r\n**Returns**\r\n- `dictionary`: Contains the gold answer and candidate answer pair with highest matching score\r\n\r\n#### Method: `get_scores`\r\n**Parameters**\r\n- `reference_answer` (list of str): List of gold answers\r\n- `candidate_answer` (str): Candidate answer to evaluate\r\n- `question` (str): The question being evaluated\r\n\r\n**Returns**\r\n- `dictionary`: Contains matching scores for all gold answer and candidate answer pairs\r\n\r\n#### Method: `transformer_match`\r\n**Parameters**\r\n- `reference_answer` (list of str): List of gold answers\r\n- `candidate_answer` (str): Candidate answer to evaluate\r\n- `question` (str): The question being evaluated\r\n\r\n**Returns**\r\n- `boolean`: True if transformer model considers candidate answer equivalent to any gold answer\r\n\r\n```python\r\nfrom qa_metrics.transformerMatcher import TransformerMatcher\r\n\r\n### supports zli12321/roberta-large-qa-evaluator, `zli12321/answer_equivalence_bert`, `zli12321/answer_equivalence_distilbert`, `zli12321/answer_equivalence_roberta`, `zli12321/answer_equivalence_distilroberta`\r\ntm = TransformerMatcher(\"zli12321/answer_equivalence_tiny_bert\")\r\nmatch_result = tm.transformer_match(reference_answer, candidate_answer, question)\r\n```\r\n\r\n### 6. \u003ca name='llm'\u003e\u003c/a\u003eLLM Integration\r\n\r\n#### Method: `prompt_gpt`\r\n**Parameters**\r\n- `prompt` (str): The input prompt text\r\n- `model_engine` (str): OpenAI model to use (e.g., 'gpt-3.5-turbo')\r\n- `temperature` (float): Controls randomness (0-1)\r\n- `max_tokens` (int): Maximum tokens in response\r\n\r\n```python\r\nfrom qa_metrics.prompt_llm import CloseLLM\r\n\r\nmodel = CloseLLM()\r\nmodel.set_openai_api_key(YOUR_OPENAI_KEY)\r\nresult = model.prompt_gpt(prompt=prompt, model_engine='gpt-3.5-turbo')\r\n```\r\n\r\n#### Method: `prompt_claude`\r\n**Parameters**\r\n- `prompt` (str): The input prompt text\r\n- `model_engine` (str): Claude model to use\r\n- `anthropic_version` (str): API version\r\n- `max_tokens_to_sample` (int): Maximum tokens in response\r\n- `temperature` (float): Controls randomness (0-1)\r\n\r\n```python\r\nmodel = CloseLLM()\r\nmodel.set_anthropic_api_key(YOUR_ANTHROPIC_KEY)\r\nresult = model.prompt_claude(prompt=prompt, model_engine='claude-v1')\r\n```\r\n\r\n#### Method: `prompt`\r\n**Parameters**\r\n- `message` (str): The input message text\r\n- `model_engine` (str): Model to use\r\n- `temperature` (float): Controls randomness (0-1)\r\n- `max_tokens` (int): Maximum tokens in response\r\n\r\n```python\r\nfrom qa_metrics.prompt_open_llm import OpenLLM\r\n\r\nmodel = OpenLLM()\r\nmodel.set_deepinfra_key(YOUR_DEEPINFRA_KEY)\r\nresult = model.prompt(message=prompt, model_engine='mistralai/Mixtral-8x7B-Instruct-v0.1')\r\n```\r\n\r\n## 🤗 Model Hub\r\n\r\nOur fine-tuned models are available on Huggingface:\r\n- [BERT](https://huggingface.co/Zongxia/answer_equivalence_bert)\r\n- [DistilRoBERTa](https://huggingface.co/Zongxia/answer_equivalence_distilroberta)\r\n- [DistilBERT](https://huggingface.co/Zongxia/answer_equivalence_distilbert)\r\n- [RoBERTa](https://huggingface.co/Zongxia/answer_equivalence_roberta)\r\n- [Tiny-BERT](https://huggingface.co/Zongxia/answer_equivalence_tiny_bert)\r\n- [RoBERTa-Large](https://huggingface.co/Zongxia/answer_equivalence_roberta-large)\r\n\r\n## 📚 Resources\r\n\r\n- [Full Paper](https://arxiv.org/abs/2402.11161)\r\n- [Dataset Repository](https://github.com/zli12321/Answer_Equivalence_Dataset.git)\r\n- [Supported Models on Deepinfra](https://deepinfra.com/models)\r\n\r\n## 📄 Citation\r\n\r\n```bibtex\r\n@misc{li2024pedantscheapeffectiveinterpretable,\r\n      title={PEDANTS: Cheap but Effective and Interpretable Answer Equivalence}, \r\n      author={Zongxia Li and Ishani Mondal and Yijun Liang and Huy Nghiem and Jordan Lee Boyd-Graber},\r\n      year={2024},\r\n      eprint={2402.11161},\r\n      archivePrefix={arXiv},\r\n      primaryClass={cs.CL},\r\n      url={https://arxiv.org/abs/2402.11161}, \r\n}\r\n```\r\n\r\n## 📝 License\r\n\r\nThis project is licensed under the [MIT License](LICENSE.md).\r\n\r\n## 📬 Contact\r\n\r\nFor questions or comments, please contact: zli12321@umd.edu\r\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fzli12321%2Fqa_metrics","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fzli12321%2Fqa_metrics","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fzli12321%2Fqa_metrics/lists"}