{"id":22843478,"url":"https://github.com/vectara/mirage-bench","last_synced_at":"2026-02-27T19:43:44.963Z","repository":{"id":258644135,"uuid":"858754382","full_name":"vectara/mirage-bench","owner":"vectara","description":"Repository for Multililngual Generation, RAG evaluations, and surrogate judge training for Arena RAG leaderboard (NAACL'25)","archived":false,"fork":false,"pushed_at":"2025-04-10T15:32:13.000Z","size":2940,"stargazers_count":9,"open_issues_count":0,"forks_count":0,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-09-28T14:21:50.547Z","etag":null,"topics":["anyscale-endpoint","arena","azure-api","claude-api","cohere-api","evaluation-framework","gemini-api","llm-inference","openai-api","rag","retrieval-augmented-generation","vllm"],"latest_commit_sha":null,"homepage":"https://mirage-bench.github.io/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/vectara.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2024-09-17T13:30:59.000Z","updated_at":"2025-05-09T02:44:11.000Z","dependencies_parsed_at":"2025-04-10T16:22:07.406Z","dependency_job_id":"9eb54afc-56da-4a27-b956-42250b3b3c08","html_url":"https://github.com/vectara/mirage-bench","commit_stats":null,"previous_names":["vectara/mirage-bench"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/vectara/mirage-bench","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vectara%2Fmirage-bench","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vectara%2Fmirage-bench/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vectara%2Fmirage-bench/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vectara%2Fmirage-bench/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/vectara","download_url":"https://codeload.github.com/vectara/mirage-bench/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vectara%2Fmirage-bench/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29911067,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-27T19:37:42.220Z","status":"ssl_error","status_checked_at":"2026-02-27T19:37:41.463Z","response_time":57,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["anyscale-endpoint","arena","azure-api","claude-api","cohere-api","evaluation-framework","gemini-api","llm-inference","openai-api","rag","retrieval-augmented-generation","vllm"],"created_at":"2024-12-13T02:15:00.120Z","updated_at":"2026-02-27T19:43:44.946Z","avatar_url":"https://github.com/vectara.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003c!--- BADGES: START ---\u003e\n[![Website](https://img.shields.io/website?url=https%3A%2F%2Fmirage-bench.github.io%2F)][#website]\n[![HF Datasets](https://img.shields.io/badge/%F0%9F%A4%97-datasets-yellow)][#huggingface]\n[![GitHub - License](https://img.shields.io/github/license/vectara/mirage-bench?logo=github\u0026style=flat\u0026color=green)][#github-license]\n[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/mirage-bench?logo=pypi\u0026style=flat\u0026color=blue)][#pypi-package]\n[![PyPI - Package Version](https://img.shields.io/pypi/v/mirage-bench?logo=pypi\u0026style=flat\u0026color=orange)][#pypi-package]\n[![YouTube Video](https://img.shields.io/youtube/views/usvu6Sk1ynk?logo=youtube\u0026style=flat\u0026color=red)][#youtube]\n\n\n[#github-license]: https://github.com/vectara/mirage-bench/blob/master/LICENSE\n[#pypi-package]: https://pypi.org/project/mirage-bench/\n[#youtube]: https://www.youtube.com/watch?v=usvu6Sk1ynk\u0026t=2655s\n[#huggingface]: https://huggingface.co/collections/nthakur/mirage-bench-naacl25-67ddb6166a7938a37436a455\n[#website]: https://mirage-bench.github.io/\n\u003c!--- BADGES: END ---\u003e\n\n# Benchmarking LLM Generation in Multilingual RAG \n\n\u003ca href=\"http://www.youtube.com/watch?feature=player_embedded\u0026v=jV8Mkx5zjaM\n\" target=\"_blank\"\u003e\u003cimg src=\"./images/mirage-bench-teaser.png\" \nalt=\"IMAGE ALT TEXT HERE\" width=\"600\" height=\"270\" border=\"5\" /\u003e\u003c/a\u003e\n\nThis repository provides an easy way to achieve the following four objectives:\n\n1. Generate RAG-based answers to multilingual questions, with support for many open-source LLMs integrated via [vLLM](https://docs.vllm.ai/en/latest/getting_started/installation.html), as well as closed-source LLMs through APIs such as Azure OpenAI, Cohere, Anthropic, etc.\n2. Evaluate multilingual RAG answers based on a variety of heuristic features (e.g., support, fluency) or automatic evaluations using open-source LLMs supported in vLLM.\n3. Conduct an LLM-as-a-Judge design to compare pairwise multilingual RAG answers and train a Bradley-Terry model (with bootstrapping) to build an offline multilingual RAG arena.\n4. Train a surrogate judge (linear regression) to learn from and bootstrap the expensive LLM-as-a-Judge approach using heuristic features.\n\nFor more information, check out our publication:\n- [MIRAGE-Bench: Automatic Multilingual Benchmark Arena for Retrieval-Augmented Generation Systems](https://arxiv.org/abs/2410.13716) (Accepted at NAACL 2025 Main Conference :star:)\n\n## Installation\n\nWe recommend **Python 3.9+** and installing the latest version of **[vLLM](https://docs.vllm.ai/en/latest/getting_started/installation.html)**.\n\n**Install with pip:**\n\n```bash\npip install -U mirage-bench\n```\n\n**Install from sources**\n\nAlternatively, you can also clone the latest version from the [repository](https://github.com/vectara/mirage-bench) and install it directly from the source code:\n\n```bash\npip install -e .\n```\n\n## Datasets\n\n| Resource | Description |\n|:---------|:------------|\n| :hugs: [mirage-bench](https://huggingface.co/datasets/nthakur/mirage-bench) | All queries \u0026 input prompts available in MIRAGE-Bench |\n| :hugs: [mirage-bench-output](https://huggingface.co/datasets/nthakur/mirage-bench-output) | Pre-computed RAG answers and all feature scores for 21 models |\n| :hugs: [mirage-bench-pairwise-judgments](https://huggingface.co/datasets/nthakur/mirage-bench-pairwise-judgments) | Pairwise judgments using GPT-4o LLM judge across all 19 models |\n\n## Getting Started\n\nMake sure you have the latest **[vLLM](https://docs.vllm.ai/en/latest/getting_started/installation.html)** installed correctly.\n\n### 1. Multilingual RAG Answer Generation\n\nGenerate the RAG answer for given multilingual queries in mirage-bench using an LLM model.\n\u003e Similarly, you can even generate answers with HF models on single/multiple GPU instances with [vLLM](https://github.com/vectara/mirage-bench/blob/main/examples/generation/vllm_generation.py).\n\n```python\n# export AZURE_OPENAI_ENDPOINT=\"xxxxx\"\n# export AZURE_OPENAI_API_KEY=\"xxxx\"\n\nfrom mirage_bench import util\nfrom mirage_bench.generate import AzureOpenAIClient\n\n# Many other clients also available, e.g., Cohere or Anthropic\nclient = AzureOpenAIClient(model_name_or_path=\"gpt-4o-mini\")\n\n### Prompts_dict contains query_id as key and prompt as value\nprompts_dict = util.load_prompts(\n    dataset_name=\"nthakur/mirage-bench\", \n    language_code=\"en\", # or \"ar\", \"bn\" ... 18 languages supported\n    split=\"dev\" # only dev split is available in mirage-bench\n) \nquery_ids = list(prompts_dict.keys())\noutputs = client.batch_call(\n    prompts=list(prompts_dict.values()),\n    temperature=0.1,\n    max_new_tokens=2048,\n)\n#### output contains the List of RAG outputs\n# [\"##Reason: Passage [] provides reasoning ... ##Answer: Therefore answer is X\"]\n```\n\n### 2. Heuristic \\\u0026 Automatic RAG Evaluation\n\nAfter generating RAG answers, we evaluate the quality of the response using heuristic features:\n\n```python\nfrom mirage_bench import util\nfrom mirage_bench.evaluate import RougeBleuEvaluator\n\nevaluator = RougeBleuEvaluator(language_code=\"en\")\n\n# Load the documents (relevant \u0026 non-relevant)\ndocuments = util.load_documents(\n    dataset_name=\"nthakur/mirage-bench\", \n    language_code=\"en\", \n    split=\"dev\"\n)\n\n# Load the multilingual RAG predictions available for 20+ models.\n# In this example, we are evaluating: meta-llama/Meta-Llama-3-8B-Instruct\npredictions = util.load_predictions(\n    dataset_name=\"nthakur/mirage-bench-output\",\n    language_code=\"en\",\n    split=\"dev\",\n    model_name=\"meta-llama/Meta-Llama-3-8B-Instruct\",\n)\n\n# Need to load the reference model, i.e., ground_truth predictions\n# This step is not necessary in all heuristic features\nreference_predictions = util.load_predictions(\n    dataset_name=\"nthakur/mirage-bench-output\",\n    language_code=\"en\",\n    split=\"dev\",\n    model_name=\"gpt-4-azure\",\n)\n\n# Evaluate the predictions\nscores = evaluator.evaluate(\n    predictions=predictions, \n    reference_predictions=reference_predictions, \n    documents=documents\n)\n# =\u003e query_id: {\"answer_bleu\": 0.9, \"answer_rougeL\": 0.75}\n```\n\n### 3. LLM-as-a-Judge Pairwise Evaluation\n\nAfter generating RAG answers, we can also use a LLM as a judge to compare two RAG outputs and decide which output is better.\n\n```python\nfrom mirage_bench import util\nfrom mirage_bench.evaluate import PairwiseLLMJudgeEvaluator\n\nevaluator = PairwiseLLMJudgeEvaluator(\n    client=\"azure_openai\",\n    model_name_or_path=\"gpt-4o-mini\"\n)\n\n# Load the documents (relevant \u0026 non-relevant)\ndocuments = util.load_documents(\n    dataset_name=\"nthakur/mirage-bench\", \n    language_code=\"en\", \n    split=\"dev\"\n)\nqueries = util.load_queries(\n    dataset_name=\"nthakur/mirage-bench\", \n    language_code=\"en\", \n    split=\"dev\"\n)\n\n# In this example we will evaluate two models:\nmodels = [\n    \"meta-llama/Meta-Llama-3-8B-Instruct\",\n    \"meta-llama/Meta-Llama-3-70B-Instruct\"\n]\n\nfor model_name in models:\n    predictions[model_name] = util.load_predictions(\n        dataset_name=\"nthakur/mirage-bench-output\",\n        language_code=\"en\",\n        split=\"dev\",\n        model_name=model_name,\n    )\n\nscores = evaluator.evaluate(\n    predictions=predictions,\n    all_model_names=models, # provide all model names\n    documents=documents,\n    queries=queries\n)\n# IMP: model_A and model_B are randomly switched\n# =\u003e [{\"query_id\": 1, \n#      \"judge\": \"gpt-4o-mini\", \n#      \"model_A\": \"meta-llama/Meta-Llama-3-8B-Instruct\", \n#      \"model_B\": \"meta-llama/Meta-Llama-3-70B-Instruct\", \n#      \"output\": \u003cjudge_output\u003e,\n#      \"verdict\": A/B/Tie.\n#    }]\n```\n\n## Application Examples\n\nYou can use this framework for:\n\n- [Multilingual RAG Generation](https://github.com/vectara/mirage-bench/tree/main/examples/generation)\n- [Heuristic RAG Evaluations](https://github.com/vectara/mirage-bench/tree/main/examples/heuristic_evals)\n- [Arena RAG Evaluations](https://github.com/vectara/mirage-bench/tree/main/examples/arena_evals)\n- [Surrogate Judge Training \\\u0026 Inference](https://github.com/vectara/mirage-bench/tree/main/examples/surrogate_judge)\n\n## Citing \u0026 Authors\n\nThis work was done in a collaboration between Vectara and University of Waterloo.\n\nIf you find this repository helpful, feel free to cite our publication [MIRAGE-Bench: Automatic Multilingual Benchmark Arena for Retrieval-Augmented Generation Systems](https://arxiv.org/abs/2410.13716):\n\n```bibtex \n@article{thakur-mirage-bench:2024,\n  author       = {Nandan Thakur and\n                  Suleman Kazi and\n                  Ge Luo and\n                  Jimmy Lin and\n                  Amin Ahmad},\n  title        = {MIRAGE-Bench: Automatic Multilingual Benchmark Arena for Retrieval-Augmented\n                  Generation Systems},\n  journal      = {CoRR},\n  volume       = {abs/2410.13716},\n  year         = {2024},\n  url          = {https://doi.org/10.48550/arXiv.2410.13716},\n  doi          = {10.48550/ARXIV.2410.13716},\n  eprinttype    = {arXiv},\n  eprint       = {2410.13716},\n  timestamp    = {Wed, 27 Nov 2024 09:01:16 +0100},\n  biburl       = {https://dblp.org/rec/journals/corr/abs-2410-13716.bib},\n  bibsource    = {dblp computer science bibliography, https://dblp.org}\n}\n```\n\nMaintainer: [Nandan Thakur](https://github.com/thakur-nandan), PhD Student @ University of Waterloo\n\nDon't hesitate to open an issue if something is broken (and it shouldn't be) or if you have further questions.\n\n\u003e This repository contains experimental software and is published for the sole purpose of giving additional background details on the respective publication.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvectara%2Fmirage-bench","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fvectara%2Fmirage-bench","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvectara%2Fmirage-bench/lists"}