{"id":34683711,"url":"https://github.com/nvidia-nemo/evaluator","last_synced_at":"2026-04-16T11:03:46.144Z","repository":{"id":309062885,"uuid":"1008648645","full_name":"NVIDIA-NeMo/Evaluator","owner":"NVIDIA-NeMo","description":"Open-source library for scalable, reproducible evaluation of AI models and benchmarks.","archived":false,"fork":false,"pushed_at":"2026-03-02T01:38:25.000Z","size":13184,"stargazers_count":207,"open_issues_count":36,"forks_count":26,"subscribers_count":3,"default_branch":"main","last_synced_at":"2026-03-02T05:17:09.899Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"https://docs.nvidia.com/nemo/evaluator/latest/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/NVIDIA-NeMo.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":".github/CODEOWNERS","security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-06-25T21:58:37.000Z","updated_at":"2026-03-02T01:37:31.000Z","dependencies_parsed_at":"2025-10-06T03:13:35.279Z","dependency_job_id":"d0de2c26-1726-4a8b-b446-d122e81ca74b","html_url":"https://github.com/NVIDIA-NeMo/Evaluator","commit_stats":null,"previous_names":["nvidia-nemo/eval","nvidia-nemo/evaluator"],"tags_count":197,"template":false,"template_full_name":null,"purl":"pkg:github/NVIDIA-NeMo/Evaluator","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NVIDIA-NeMo%2FEvaluator","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NVIDIA-NeMo%2FEvaluator/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NVIDIA-NeMo%2FEvaluator/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NVIDIA-NeMo%2FEvaluator/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/NVIDIA-NeMo","download_url":"https://codeload.github.com/NVIDIA-NeMo/Evaluator/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NVIDIA-NeMo%2FEvaluator/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":30107651,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-05T01:39:18.192Z","status":"online","status_checked_at":"2026-03-05T02:00:06.710Z","response_time":93,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-12-24T21:37:02.244Z","updated_at":"2026-04-16T11:03:46.136Z","avatar_url":"https://github.com/NVIDIA-NeMo.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# NeMo Evaluator SDK\n\n\u003e [!NOTE]\n\u003e **Preview: NeMo Evaluator 0.3.0** — A ground-up rewrite with a unified `nel` CLI, pluggable environment architecture, and built-in agentic eval support is available on the [`dev/0.3.0`](https://github.com/NVIDIA-NeMo/Evaluator/tree/dev/0.3.0) branch. Feedback welcome via [Issues](https://github.com/NVIDIA-NeMo/Evaluator/issues).\n\n[![License](https://img.shields.io/badge/License-Apache%202.0-brightgreen.svg)](https://github.com/NVIDIA-NeMo/Evaluator/blob/main/LICENSE)\n[![Python 3.10+](https://img.shields.io/badge/python-3.10%2B-green)](https://www.python.org/downloads/)\n[![Tests](https://github.com/NVIDIA-NeMo/Evaluator/actions/workflows/cicd-main.yml/badge.svg)](https://github.com/NVIDIA-NeMo/Evaluator/actions/workflows/cicd-main.yml)\n[![codecov](https://codecov.io/github/NVIDIA-NeMo/Evaluator/graph/badge.svg)](https://codecov.io/github/NVIDIA-NeMo/Evaluator)\n[![Code style: ruff](https://img.shields.io/badge/code%20style-ruff-000000.svg)](https://github.com/astral-sh/ruff)\n[![nemo-evaluator PyPI version](https://img.shields.io/pypi/v/nemo-evaluator.svg)](https://pypi.org/project/nemo-evaluator/)\n[![nemo-evaluator PyPI downloads](https://img.shields.io/pypi/dm/nemo-evaluator.svg)](https://pypi.org/project/nemo-evaluator/)\n[![nemo-evaluator-launcher PyPI version](https://img.shields.io/pypi/v/nemo-evaluator-launcher.svg)](https://pypi.org/project/nemo-evaluator-launcher/)\n[![nemo-evaluator-launcher PyPI downloads](https://img.shields.io/pypi/dm/nemo-evaluator-launcher.svg)](https://pypi.org/project/nemo-evaluator-launcher/)\n[![Project Status](https://img.shields.io/badge/Status-Production%20Ready-green)](#)\n\n## 🆕 What's New in 26.01 Release\n\n### Telemetry\n- Anonymous telemetry to help improve the project. See [Telemetry](#-telemetry) for details and opt-out options.\n\n### New Evaluation Harnesses\n- **TAU2-Bench** (`tau2-bench`): Conversational agents in dual-control environments (telecom, airline, retail)\n- **RULER** (`long-context-eval`): Long-context evaluation with configurable sequence lengths (4K to 1M tokens)\n- **CoDec** (`contamination-detection`): Contamination detection - practical and accurate method to detect and quantify training data contamination in large language models\n- **MTEB** (`mteb`): Massive Text Embedding Benchmark\n\n---\n\n## [📖 Documentation](https://docs.nvidia.com/nemo/evaluator/latest/)\n\nNeMo Evaluator SDK is an open-source platform for robust, reproducible, and scalable evaluation of Large Language Models. It enables you to run hundreds of benchmarks across popular evaluation harnesses against any OpenAI-compatible model API. Evaluations execute in open-source Docker containers for auditable and trustworthy results. The platform's containerized architecture allows for the rapid integration of public benchmarks and private datasets.\n\nNeMo Evaluator SDK is built on four core principles to provide a reliable and versatile evaluation experience:\n\n- **Reproducibility by Default**: All configurations, random seeds, and software provenance are captured automatically for auditable and repeatable evaluations.\n- **Scale Anywhere**: Run evaluations from a local machine to a Slurm cluster or cloud-native backends like Lepton AI without changing your workflow.\n- **State-of-the-Art Benchmarking**: Access a comprehensive suite of over 100 benchmarks from 18 popular open-source evaluation harnesses. See the full list of [Supported benchmarks and evaluation harnesses](#-supported-benchmarks-and-evaluation-harnesses).\n- **Extensible and Customizable**: Integrate new evaluation harnesses, add custom benchmarks with proprietary data, and define custom result exporters for existing MLOps tooling.\n\n## ⚙️ How It Works: Launcher and Core Engine\n\nThe platform consists of two main components:\n\n- **`nemo-evaluator` ([The Evaluation Core Engine](https://docs.nvidia.com/nemo/evaluator/latest/libraries/nemo-evaluator/index.html))**: A Python library that manages the interaction between an evaluation harness and the model being tested.\n- **`nemo-evaluator-launcher` ([The CLI and Orchestration](https://docs.nvidia.com/nemo/evaluator/latest/libraries/nemo-evaluator-launcher/index.html))**: The primary user interface and orchestration layer. It handles configuration, selects the execution environment, and launches the appropriate container to run the evaluation.\n\nMost users typically interact with `nemo-evaluator-launcher`, which serves as a universal gateway to different benchmarks and harnesses. However, it is also possible to interact directly with `nemo-evaluator` by following this [guide](https://docs.nvidia.com/nemo/evaluator/latest/libraries/nemo-evaluator/workflows/cli.html).\n\n\n## 📊 Supported Benchmarks and Evaluation Harnesses\n\nNeMo Evaluator Launcher provides pre-built evaluation containers for different evaluation harnesses through the NVIDIA NGC catalog. Each harness supports a variety of benchmarks, which can then be called via `nemo-evaluator`. This table provides a list of benchmark names per harness. A more detailed list of task names can be found in the [list of NGC containers](https://docs.nvidia.com/nemo/evaluator/latest/libraries/nemo-evaluator/containers/index.html).\n\n| Container | Description | NGC Catalog | Latest Tag | Supported benchmarks |\n|-----------|-------------|-------------|------------| ------------|\n| **bfcl** | Function calling | [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/bfcl) | `26.03` | BFCL v2 and v3 |\n| **bigcode-evaluation-harness** | Code generation evaluation | [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/bigcode-evaluation-harness) | `26.03` | MBPP, MBPP-Plus, HumanEval, HumanEval+, Multiple (cpp, cs, d, go, java, jl, js, lua, php, pl, py, r, rb, rkt, rs, scala, sh, swift, ts) |\n| **compute-eval** | CUDA code evaluation | [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/compute-eval) | `26.03` | CCCL, Combined Problems, CUDA |\n| **CoDec** | Contamination detection | [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/contamination-detection) | `26.03` | CoDec |\n| **garak** | Safety and vulnerability testing | [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/garak) | `26.03` | Garak |\n| **genai-perf** | GenAI performance benchmarking | [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/genai-perf) | `26.03` | GenAI Perf Generation \u0026 Summarization |\n| **helm** | Holistic evaluation framework | [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/helm) | `26.03` | MedHelm |\n| **hle** | Academic knowledge and problem solving | [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/hle) | `26.03` | HLE |\n| **ifbench** | Instruction following | [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/ifbench) | `26.03` | IFBench |\n| **livecodebench** | Coding | [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/livecodebench) | `26.03` | LiveCodeBench (v1-v6, 0724_0125, 0824_0225) |\n| **lm-evaluation-harness** | Language model benchmarks | [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/lm-evaluation-harness) | `26.03` | ARC Challenge (also multilingual), GSM8K, HumanEval, HumanEval+, MBPP, MBPP+, MINERVA Math, RACE, AGIEval, BBH, BBQ, CSQA, Frames, Global MMLU, GPQA-D, HellaSwag (also multilingual), IFEval, MGSM, MMLU, MMLU-Pro, MMLU-ProX (de, es, fr, it, ja), MMLU-Redux, MUSR, OpenbookQA, Piqa, Social IQa, TruthfulQA, WikiLingua, WinoGrande |\n| **long-context-eval** | Long context evaluation | [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/long-context-eval) | `26.03` | Ruler |\n| **mmath** | Multilingual math reasoning | [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/mmath) | `26.03` | EN, ZH, AR, ES, FR, JA, KO, PT, TH, VI |\n| **mtbench** | Multi-turn conversation evaluation | [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/mtbench) | `26.03` | MT-Bench |\n| **MTEB** | Multimodal toolbox for evaluating embeddings and retrieval systems | [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/mteb) | `26.03` | fiqa, miracl, ViDoRe |\n| **nemo-skills** | Language model benchmarks (science, math, agentic)  | [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/nemo-skills) | `26.03` | AIME 24 \u0026 25, BFCL_v3, GPQA, HLE, LiveCodeBench, MMLU, MMLU-Pro |\n| **profbench** | Professional domains in Business and Scientific Research | [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/profbench) | `26.03` | ProfBench |\n| **safety-harness** | Safety and bias evaluation | [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/safety-harness) | `26.03` | Aegis v2, WildGuard |\n| **scicode** | Coding for scientific research | [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/scicode) | `26.03` | SciCode |\n| **simple-evals** | Common evaluation tasks | [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/simple-evals) | `26.03` | GPQA-D, MATH-500, AIME 24 \u0026 25, HumanEval, HumanEval+, MGSM, MMLU (also multilingual), MMLU-Pro, MMLU-lite (AR, BN, DE, EN, ES, FR, HI, ID, IT, JA, KO, MY, PT, SW, YO, ZH), SimpleQA, BrowseComp, HealthBench |\n| **tau2-bench** | TAU2 benchmark evaluation | [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/tau2-bench) | `26.03` | TAU2-Bench telecom, airline, retail |\n| **tooltalk** | Tool usage evaluation | [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/tooltalk) | `26.03` | ToolTalk |\n| **vlmevalkit** | Vision-language model evaluation | [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/vlmevalkit) | `26.03` | AI2D, ChartQA, MMMU, MathVista-MINI, OCRBench, SlideVQA |\n\n\u003c!-- BEGIN AUTOGENERATION --\u003e\n\u003c!-- mapping toml checksum: sha256:684a594af1f5dbd089d2eb04366579a6ecd43a02cdd09770006badc1aa2325d7 --\u003e\n\u003c!--\n| Container | Description | NGC Catalog | Latest Tag | Arch | Supported benchmarks |\n|-----------|-------------|-------------|------------|------|----------------------|\n| **AA-LCR** | A challenging benchmark measuring language models' ability to extract, reason about, and synthesize information from long-form documents ranging from 10k to 100k tokens (measured using the cl100k_base tokenizer). | [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/aa-lcr?version=26.03) | `26.03` | `multiarch` | aa_lcr |\n| **bfcl** | The Berkeley Function Calling Leaderboard V3 (also called Berkeley Tool Calling Leaderboard V3) evaluates the LLM's ability to call functions (aka tools) accurately. | [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/bfcl?version=26.03) | `26.03` | `multiarch` | bfclv3, bfclv3_ast, bfclv3_ast_prompting, bfclv2, bfclv2_ast, bfclv2_ast_prompting |\n| **bigcode-evaluation-harness** | A framework for the evaluation of autoregressive code generation language models. | [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/bigcode-evaluation-harness?version=26.03) | `26.03` | `multiarch` | humaneval, humaneval_instruct, humanevalplus, mbpp-chat, mbpp-completions, mbppplus-chat, mbppplus-completions, mbppplus_nemo, multiple-py, multiple-sh, multiple-cpp, multiple-cs, multiple-d, multiple-go, multiple-java, multiple-js, multiple-jl, multiple-lua, multiple-pl, multiple-php, multiple-r, multiple-rkt, multiple-rb, multiple-rs, multiple-scala, multiple-swift, multiple-ts |\n| **codec** | Contamination detection framework for evaluating language models | [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/contamination-detection?version=26.03) | `26.03` | `amd` | mmlu_test, gpqa_diamond, gsm8k_train, gsm8k_test, ifeval, mmlu_pro_test, openai_humaneval, frames, hellaswag_test, hellaswag_train, aime_2025, aime_2024, livecodebench_v1, livecodebench_v5, bfcl_v3, bbq, reward_bench_v1, reward_bench_v2, math_500_problem, math_500_solution, swebench_test, swebench_train, hle, ifbench, scicode, terminalbench, taubench |\n| **garak** | Garak is an LLM vulnerability scanner. | [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/garak?version=26.03) | `26.03` | `multiarch` | garak, garak-completions |\n| **genai_perf_eval** | GenAI Perf is a tool to evaluate the performance of LLM endpoints, based on GenAI Perf. | [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/genai-perf?version=26.03) | `26.03` | `amd` | genai_perf_summarization, genai_perf_generation, genai_perf_summarization_completions, genai_perf_generation_completions |\n| **helm** | A framework for evaluating large language models in medical applications across various healthcare tasks | [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/helm?version=26.03) | `26.03` | `amd` | medcalc_bench, medec, head_qa, medbullets, pubmed_qa, ehr_sql, race_based_med, medhallu, mtsamples_replicate, aci_bench, mtsamples_procedures, medication_qa, med_dialog_healthcaremagic, med_dialog_icliniq, medi_qa |\n| **hle** | Humanity's Last Exam (HLE) is a multi-modal benchmark at the frontier of human knowledge, designed to be the final closed-ended academic benchmark of its kind with broad subject coverage. Humanity's Last Exam consists of 3,000 questions across dozens of subjects, including mathematics, humanities, and the natural sciences. HLE is developed globally by subject-matter experts and consists of multiple-choice and short-answer questions suitable for automated grading. | [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/hle?version=26.03) | `26.03` | `multiarch` | hle, hle_aa_v2 |\n| **ifbench** | IFBench is a new, challenging benchmark for precise instruction following. | [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/ifbench?version=26.03) | `26.03` | `multiarch` | ifbench, ifbench_aa_v2 |\n| **livecodebench** | Holistic and Contamination Free Evaluation of Large Language Models for Code. | [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/livecodebench?version=26.03) | `26.03` | `multiarch` | codegeneration_release_latest, codegeneration_release_v1, codegeneration_release_v2, codegeneration_release_v3, codegeneration_release_v4, codegeneration_release_v5, codegeneration_release_v6, codegeneration_notfast, testoutputprediction, codeexecution_v2, codeexecution_v2_cot, livecodebench_0724_0125, livecodebench_aa_v2, livecodebench_0824_0225 |\n| **lm-evaluation-harness** | This project provides a unified framework to test generative language models on a large number of different evaluation tasks. | [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/lm-evaluation-harness?version=26.03) | `26.03` | `multiarch` | mmlu, mmlu_instruct, mmlu_instruct_completions, mmlu_cot_0_shot_chat, ifeval, mmlu_pro, mmlu_pro_instruct, mmlu_redux, mmlu_redux_instruct, m_mmlu_id_str_chat, m_mmlu_id_str_completions, gsm8k, gsm8k_cot_instruct, gsm8k_cot_zeroshot, gsm8k_cot_llama, gsm8k_cot_zeroshot_llama, humaneval_instruct, mbpp_plus_chat, mbpp_plus_completions, mgsm, mgsm_cot_chat, mgsm_cot_completions, wikilingua, winogrande, arc_challenge, arc_challenge_chat, hellaswag, truthfulqa, bbh, bbh_instruct, musr, gpqa, gpqa_diamond_cot, commonsense_qa, openbookqa, mmlu_logits, piqa, social_iqa, adlr_agieval_en_cot, adlr_math_500_4_shot_sampled, adlr_race, adlr_truthfulqa_mc2, adlr_arc_challenge_llama_25_shot, adlr_gpqa_diamond_cot_5_shot, adlr_mmlu, adlr_mmlu_pro_5_shot_base, adlr_minerva_math_nemo_4_shot, adlr_gsm8k_cot_8_shot, adlr_humaneval_greedy, adlr_humaneval_sampled, adlr_mbpp_sanitized_3_shot_greedy, adlr_mbpp_sanitized_3_shot_sampled, adlr_global_mmlu_lite_5_shot, adlr_mgsm_native_cot_8_shot, adlr_commonsense_qa_7_shot, adlr_winogrande_5_shot, bbq_chat, bbq_completions, arc_multilingual, hellaswag_multilingual, mmlu_prox_chat, mmlu_prox_completions, mmlu_prox_fr_chat, mmlu_prox_fr_completions, mmlu_prox_de_chat, mmlu_prox_de_completions, mmlu_prox_it_chat, mmlu_prox_it_completions, mmlu_prox_ja_chat, mmlu_prox_ja_completions, mmlu_prox_es_chat, mmlu_prox_es_completions, global_mmlu_full, global_mmlu_full_am, global_mmlu_full_ar, global_mmlu_full_bn, global_mmlu_full_cs, global_mmlu_full_de, global_mmlu_full_el, global_mmlu_full_en, global_mmlu_full_es, global_mmlu_full_fa, global_mmlu_full_fil, global_mmlu_full_fr, global_mmlu_full_ha, global_mmlu_full_he, global_mmlu_full_hi, global_mmlu_full_id, global_mmlu_full_ig, global_mmlu_full_it, global_mmlu_full_ja, global_mmlu_full_ko, global_mmlu_full_ky, global_mmlu_full_lt, global_mmlu_full_mg, global_mmlu_full_ms, global_mmlu_full_ne, global_mmlu_full_nl, global_mmlu_full_ny, global_mmlu_full_pl, global_mmlu_full_pt, global_mmlu_full_ro, global_mmlu_full_ru, global_mmlu_full_si, global_mmlu_full_sn, global_mmlu_full_so, global_mmlu_full_sr, global_mmlu_full_sv, global_mmlu_full_sw, global_mmlu_full_te, global_mmlu_full_tr, global_mmlu_full_uk, global_mmlu_full_vi, global_mmlu_full_yo, global_mmlu_full_zh, global_mmlu, global_mmlu_ar, global_mmlu_bn, global_mmlu_de, global_mmlu_en, global_mmlu_es, global_mmlu_fr, global_mmlu_hi, global_mmlu_id, global_mmlu_it, global_mmlu_ja, global_mmlu_ko, global_mmlu_pt, global_mmlu_sw, global_mmlu_yo, global_mmlu_zh, agieval, wikitext |\n| **mmath** | MMATH is a new benchmark specifically designed for multilingual complex reasoning. It comprises 374 carefully selected math problems from high-quality sources, including AIME, CNMO, and MATH-500, and covers ten typologically and geographically diverse languages. Each problem is translated and validated through a rigorous pipeline that combines frontier LLMs with human verification, ensuring semantic consistency. | [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/mmath?version=26.03) | `26.03` | `multiarch` | mmath_en, mmath_zh, mmath_ar, mmath_es, mmath_fr, mmath_ja, mmath_ko, mmath_pt, mmath_th, mmath_vi |\n| **mtbench** | MT-bench is designed to test multi-turn conversation and instruction-following ability, covering common use cases and focusing on challenging questions to differentiate models. | [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/mtbench?version=26.03) | `26.03` | `multiarch` | mtbench, mtbench-cor1 |\n| **nemo_skills** | NeMo Skills - a project to improve skills of LLMs | [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/nemo-skills?version=26.03) | `26.03` | `multiarch` | ns_aime2024, ns_aime2025, ns_aime2026, ns_hmmt_feb2025, ns_gpqa, ns_bfcl_v3, ns_bfcl_v4, ns_livecodebench, ns_livecodebench_v5, ns_livecodebench_aa, ns_hle, ns_hle_aa, ns_ruler, ns_mmlu, ns_mmlu_pro, ns_arena_hard_v2, ns_scicode, ns_aa_lcr, ns_ifbench, ns_wmt24pp, ns_wmt24pp_comet, ns_ifeval, ns_mmlu_prox, ns_mmmu_pro, ns_critpt, ns_omniscience |\n| **profbench** | Professional domain benchmark for evaluating LLMs on Physics PhD, Chemistry PhD, Finance MBA, and Consulting MBA tasks | [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/profbench?version=26.03) | `26.03` | `multiarch` | report_generation, llm_judge |\n| **ruler** | RULER generates synthetic examples to evaluate long-context language models with configurable sequence length and task complexity. | [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/long-context-eval?version=26.03) | `26.03` | `multiarch` | ruler-chat, ruler-completions, ruler-4k-chat, ruler-4k-completions, ruler-8k-chat, ruler-8k-completions, ruler-16k-chat, ruler-16k-completions, ruler-32k-chat, ruler-32k-completions, ruler-64k-chat, ruler-64k-completions, ruler-128k-chat, ruler-128k-completions, ruler-256k-chat, ruler-256k-completions, ruler-512k-chat, ruler-512k-completions, ruler-1m-chat, ruler-1m-completions |\n| **safety_eval** | Harness for Safety evaluations | [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/safety-harness?version=26.03) | `26.03` | `multiarch` | aegis_v2, aegis_v2_completions, aegis_v2_reasoning, wildguard, wildguard_completions, compliance |\n| **scicode** | SciCode is a challenging benchmark designed to evaluate the capabilities of LLMs in generating code for solving realistic scientific research problems. | [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/scicode?version=26.03) | `26.03` | `multiarch` | scicode, scicode_background, scicode_aa_v2 |\n| **simple_evals** | simple-evals - a lightweight library for evaluating language models. | [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/simple-evals?version=26.03) | `26.03` | `multiarch` | AIME_2025, AIME_2024, AA_AIME_2024, AA_math_test_500, math_test_500, mgsm, humaneval, humanevalplus, mmlu_pro, mmlu_am, mmlu_ar, mmlu_bn, mmlu_cs, mmlu_de, mmlu_el, mmlu_en, mmlu_es, mmlu_fa, mmlu_fil, mmlu_fr, mmlu_ha, mmlu_he, mmlu_hi, mmlu_id, mmlu_ig, mmlu_it, mmlu_ja, mmlu_ko, mmlu_ky, mmlu_lt, mmlu_mg, mmlu_ms, mmlu_ne, mmlu_nl, mmlu_ny, mmlu_pl, mmlu_pt, mmlu_ro, mmlu_ru, mmlu_si, mmlu_sn, mmlu_so, mmlu_sr, mmlu_sv, mmlu_sw, mmlu_te, mmlu_tr, mmlu_uk, mmlu_vi, mmlu_yo, mmlu_ar-lite, mmlu_bn-lite, mmlu_de-lite, mmlu_en-lite, mmlu_es-lite, mmlu_fr-lite, mmlu_hi-lite, mmlu_id-lite, mmlu_it-lite, mmlu_ja-lite, mmlu_ko-lite, mmlu_my-lite, mmlu_pt-lite, mmlu_sw-lite, mmlu_yo-lite, mmlu_zh-lite, mmlu, gpqa_diamond, gpqa_extended, gpqa_main, simpleqa, aime_2025_nemo, aime_2024_nemo, math_test_500_nemo, gpqa_diamond_nemo, gpqa_diamond_aa_v2_llama_4, gpqa_diamond_aa_v2, AIME_2025_aa_v2, mgsm_aa_v2, mmlu_pro_aa_v2, mmlu_llama_4, mmlu_pro_llama_4, healthbench, healthbench_consensus, healthbench_hard, browsecomp, gpqa_diamond_aa_v3, mmlu_pro_aa_v3 |\n| **tau2_bench** | Evaluating Conversational Agents in a Dual-Control Environment | [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/tau2-bench?version=26.03) | `26.03` | `multiarch` | tau2_bench_telecom, tau2_bench_airline, tau2_bench_retail |\n| **tooltalk** | ToolTalk is designed to evaluate tool-augmented LLMs as a chatbot. ToolTalk contains a handcrafted dataset of 28 easy conversations and 50 hard conversations. | [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/tooltalk?version=26.03) | `26.03` | `multiarch` | tooltalk |\n| **vlmevalkit** | VLMEvalKit is an open-source evaluation toolkit of large vision-language models (LVLMs). It enables one-command evaluation of LVLMs on various benchmarks, without the heavy workload of data preparation under multiple repositories. In VLMEvalKit, we adopt generation-based evaluation for all LVLMs, and provide the evaluation results obtained with both exact matching and LLM-based answer extraction. | [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/vlmevalkit?version=26.03) | `26.03` | `amd` | ai2d_judge, chartqa, mathvista-mini, mmmu_judge, ocrbench, ocr_reasoning, slidevqa |\n--\u003e\n\u003c!-- END AUTOGENERATION --\u003e\n\n## 🚀 Quickstart\n\nGet your first evaluation result in minutes. This guide uses your local machine to run a small benchmark against an OpenAI API-compatible endpoint.\n\n### 1. Install the Launcher\n\nThe launcher is the only package required to get started.\n\n```bash\npip install nemo-evaluator-launcher\n```\n\n### 2. Set Up Your Model Endpoint\n\nNeMo Evaluator works with any model that exposes an OpenAI-compatible endpoint. For this quickstart, we will use the OpenAI API.\n\n**What is an OpenAI-compatible endpoint?** A server that exposes /v1/chat/completions and /v1/completions endpoints, matching the OpenAI API specification.\n\n**Options for model endpoints:**\n\n- **Hosted endpoints** (fastest): Use ready-to-use hosted models from providers like [build.nvidia.com](https://build.nvidia.com) that expose OpenAI-compatible APIs with no hosting required.\n- **Self-hosted options**: Host your own models using tools like NVIDIA NIM, vLLM, or TensorRT-LLM for full control over your evaluation environment.\n- **Models trained with NeMo framework**: Host your models trained with NeMo framework by deploying them as OpenAI-compatible endpoints using [NeMo Export-Deploy](https://github.com/nvidia-nemo/export-deploy/tree/main). More detailed user guide [here](https://github.com/nvidia-nemo/evaluator/tree/main/docs/nemo-fw).\n\n\u003c!-- TODO(martas): uncomment once publish --\u003e\n\u003c!-- For detailed setup instructions including self-hosted configurations, see the [tutorials](https://docs.nvidia.com/nemo/evaluator/latest/tutorials/). --\u003e\n\n**Getting an NGC API Key for build.nvidia.com:**\n\nTo use out-of-the-box build.nvidia.com APIs, you need an API key:\n\n1. Register an account at [build.nvidia.com](https://build.nvidia.com).\n2. In the Setup menu under Keys/Secrets, generate an API key.\n3. Set the environment variable by executing `export NGC_API_KEY=\u003cYOUR_API_KEY\u003e`.\n\n### 3. Run Your First Evaluation\n\nRun a small evaluation on your local machine. The launcher automatically pulls the correct container and executes the benchmark. The list of benchmarks is directly configured in the YAML file.\n\n**Configuration Examples**: Explore ready-to-use configuration files in [`packages/nemo-evaluator-launcher/examples/`](./packages/nemo-evaluator-launcher/examples/) for local, Lepton, and Slurm deployments with various model hosting options (vLLM, NIM, hosted endpoints).\n\nOnce you have the example configuration file, either by cloning this repository or downloading one directly such as `local_basic.yaml`, you can run the following command:\n\n\n```bash\nnemo-evaluator-launcher run --config packages/nemo-evaluator-launcher/examples/local_basic.yaml -o execution.output_dir=\u003cYOUR_OUTPUT_LOCAL_DIR\u003e\n```\n\nAfter running this command, you will see a `job_id`, which can be used to track the job and its results. All logs will be available in your `\u003cYOUR_OUTPUT_LOCAL_DIR\u003e`.\n\n### 4. Check Your Results\n\nResults, logs, and run configurations are saved locally. Inspect the status of the evaluation job by using the corresponding `job_id`:\n\n```bash\nnemo-evaluator-launcher status \u003cjob_id_or_invocation_id\u003e\n```\n\n## Agentic Skills\n\nNeMo Evaluator provides [Agent Skills](https://agentskills.io/) for interactive assistance.\n\n| Skill | Description |\n|-------|-------------|\n| `nel-assistant` | Interactive config wizard for creating and modifying evaluation configs |\n| `launching-evals` | Run, monitor, debug, and analyze evaluations |\n| `accessing-mlflow` | Query and browse evaluation results stored in MLflow |\n| `nemo-evaluator-byob` | Create custom LLM evaluation benchmarks using the BYOB decorator framework |\n\nWe recommend using the skills with **Claude Sonnet or better** for the best experience.\n\n### Install via `nel` CLI\n\n```bash\nnel skills add [--claude] [--cursor] [--codex] [--opencode]\n```\n\nUse `--project` to install into the current project directory instead of your home directory. See `nel skills add --help` for all options.\n\n### Claude Code Marketplace\n\n```bash\n/plugin marketplace add NVIDIA-NeMo/Evaluator\n/plugin install nel-assistant@NVIDIA-NeMo/Evaluator\n```\n\n## 🤝 Contribution Guide\n\nWe welcome community contributions. Please see our [Contribution Guide](https://github.com/NVIDIA-NeMo/Evaluator/blob/main/CONTRIBUTING.md) for instructions on submitting pull requests, reporting issues, and suggesting features.\n\n\n## 📡 Telemetry\n\nNeMo Evaluator collects telemetry to help improve the project.\n\n**All telemetry events are collected anonymously.**\n\n### Event: `EvaluationTaskEvent` (from `nemo-evaluator`)\n\n| Field | Description |\n|---|---|\n| `task` | Evaluated task/benchmark name. |\n| `frameworkName` | Evaluation framework name (for example `lm-eval`, `helm`). |\n| `model` | Model name used for evaluation (redacted at level 1). |\n| `executionDurationSeconds` | Evaluation duration in seconds. |\n| `status` | Task status: `started`, `success`, or `failure`. |\n\n### Event: `LauncherJobEvent` (from `nemo-evaluator-launcher`)\n\n| Field | Description |\n|---|---|\n| `executorType` | Launcher executor backend (`local`, `slurm`, `lepton`, etc.). |\n| `deploymentType` | Deployment type (`none`, `vllm`, `sglang`, `nim`, etc.). |\n| `model` | Model name for the launched run (redacted at level 1). |\n| `tasks` | List of requested evaluation tasks. |\n| `exporters` | List of configured exporters. |\n| `status` | Job status: `started`, `success`, or `failure`. |\n\n### Telemetry Controls\n\n| Control | Effect |\n|---|---|\n| `NEMO_EVALUATOR_TELEMETRY_LEVEL=0` | Disables telemetry. |\n| `NEMO_EVALUATOR_TELEMETRY_LEVEL=1` | Usage data only. |\n| `NEMO_EVALUATOR_TELEMETRY_LEVEL=2` | Usage data + model ID (default). |\n| `nemo-evaluator-launcher config set telemetry.level \u003c0\\|1\\|2\u003e` | Persists telemetry level to config file. |\n\n### Aggregate Reporting\n\nWe may share aggregated telemetry trends with the community (for example, popularity of models, tasks, and execution backends). Aggregates are anonymous and are not used to track individual users.\n\n\n## 📄 License\n\nThis project is licensed under the Apache License 2.0. See the [LICENSE](https://github.com/NVIDIA-NeMo/Evaluator/blob/main/LICENSE) file for details.\n\n\n## 📞 Support\n\n- **Issues**: [GitHub Issues](https://github.com/NVIDIA-NeMo/Evaluator/issues)\n- **Discussions**: [GitHub Discussions](https://github.com/NVIDIA-NeMo/Evaluator/discussions)\n- **Documentation**: [NeMo Evaluator Documentation](https://docs.nvidia.com/nemo/evaluator/latest/)\n\n\n## 🐛 Known issues\n\n- `nel ls` might require docker authenthication and currently does not support fetching credentials from known password management systems such as MacOS's Keychain or GNOME Keyring.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnvidia-nemo%2Fevaluator","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fnvidia-nemo%2Fevaluator","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnvidia-nemo%2Fevaluator/lists"}