{"id":25982404,"url":"https://github.com/open-compass/GPassK","last_synced_at":"2025-03-05T09:03:28.449Z","repository":{"id":268647396,"uuid":"904618956","full_name":"open-compass/GPassK","owner":"open-compass","description":"Official Repository of Are Your LLMs Capable of Stable Reasoning?","archived":false,"fork":false,"pushed_at":"2025-02-25T06:13:35.000Z","size":2431,"stargazers_count":20,"open_issues_count":2,"forks_count":1,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-02-25T07:25:57.996Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"http://arxiv.org/abs/2412.13147","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/open-compass.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-12-17T08:39:05.000Z","updated_at":"2025-02-25T06:13:38.000Z","dependencies_parsed_at":"2024-12-18T04:26:12.007Z","dependency_job_id":"420238f6-3b80-4d4d-a34b-f7cd5f6cd344","html_url":"https://github.com/open-compass/GPassK","commit_stats":null,"previous_names":["open-compass/gpassk"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/open-compass%2FGPassK","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/open-compass%2FGPassK/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/open-compass%2FGPassK/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/open-compass%2FGPassK/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/open-compass","download_url":"https://codeload.github.com/open-compass/GPassK/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":241997418,"owners_count":20055117,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-03-05T09:03:25.638Z","updated_at":"2025-03-05T09:03:28.441Z","avatar_url":"https://github.com/open-compass.png","language":"Python","readme":"# GPassK: Are Your LLMs Capable of Stable Reasoning?\n\n\u003cdiv align=\"center\"\u003e\n\n\u003c!-- [🏰[Project Page](https://github.com/open-compass/GPassK/)] --\u003e\n[📄[ArXiv Paper](http://arxiv.org/abs/2412.13147)]\n[📚[LeaderBoard](https://open-compass.github.io/GPassK/)]\n\u003c/div\u003e\n\n\n\u003cdiv align=\"center\"\u003e\n \u003cimg src=\"assets/pass-at-k-v-s-greedy-g-pass-at-k.png\" width=\"800\"/\u003e\n\u003c/div\u003e\n\n\u003c!-- [🏰[Project Page](https://github.com/open-compass/GPassK/)]\n[📚[LeaderBoard](https://github.com/open-compass/GPassK/index.html)] --\u003e\n\n## 🚀 News\n- **[2025.2.28]** 🔥 We provide **[Python Implementation](#use_in_your_pro)** and **[Evalution Framework](#use_in_lighteval)** using **[Lighteval](https://github.com/huggingface/lighteval)**.\n- **[2025.2.13]** 🔥 We release new results on LiveMathBench, MATH, and AIME24/25.\n- **[2025.1.10]** 🔥 We release a small-scale judge model **[LiveMath-Judge](https://huggingface.co/jnanliu/LiveMath-Judge)**.\n- **[2025.1.6]** 🔥 **[LiveMathBench](https://huggingface.co/datasets/opencompass/LiveMathBench)** now can be accessed through hugginface, and you can now evaluate your LLMs on it using G-Pass@k in OpenCompass. We have addressed potential errors in LiveMathBench and inconsistencies in the sampling parameters. Please also refer to our updated version of the **[Paper](http://arxiv.org/abs/2412.13147)** for further details.\n- **[2024.12.18]** 🎉 We release the **[ArXiv Paper](http://arxiv.org/abs/2412.13147)** of G-Pass@k. \n\n\n## ☀️Introduction\n\n**G-Pass@k** is a novel evaluation metric that provides a continuous assessment of model performance across multiple sampling attempts, quantifying both the model’s peak performance potential and its stability. In addition, it comes with **LiveMathBench**, a dynamic benchmark comprising challenging, contemporary mathematical problems designed to minimize data leakage risks during evaluation. In order to track the latest performance and stability of LLMs, we will continue updating the benchmark with new comptition level mathmatical problems and provide the latest results of the models on the benchmark with G-Pass@k.\n\n\n## 🌲 Definition of G-Pass@k\n$$ \\text{G-Pass@}k = \\mathbb{E}_{\\text{Questions}} \\left[ \\frac{{c \\choose k}}{{n \\choose k}} \\right] $$ \n\nwhere $n$ represents the total number of generations per question, and $c$ denotes the number\nof generations resulting in correct solutions.\n\n$$ \\text{G-Pass@}k_{\\tau} = E_{\\text{Questions}} \\left[ \\sum_{j = \\lceil \\tau \\cdot k \\rceil}^{c} \\frac{\\binom{c}{j} \\cdot \\binom{n - c}{k - j}}{\\binom{n}{k}} \\right] $$\n\nwhere $\\lceil \\tau \\cdot k \\rceil$ denotes the smallest integer greater than or equal to $\\tau \\cdot k$.\n\n$$ \\text{mG-Pass@}k_{\\tau} = 2\\int_{0.5}^{1.0} \\text{G-Pass@}k_{\\tau} d \\tau = \\frac{2}{k} \\sum_{i= \\lceil 0.5 \\cdot k \\rceil + 1}^{k} \\text{G-Pass@}k_{\\frac{i}{k}} $$\n\nIntuitively, $\\text{mG-Pass@}k$ provides an interpolated estimate of the area under the curve of $\\text{mG-Pass@}k_{[0.5:1.0]}$, serving as a comprehensive metric that integrates all $\\text{G-Pass@}k_{\\tau}$ values where $\\tau \\in [0.5, 1.0]$. \n\n## 📚 Main Results\n\n* ⚽: General Models\n* 🏐: Math Models\n* 🏀: o1-like Models\n\n### *LiveMathBench-202412*\n\n|LLMs|Greedy|G-Pass@16_0.5|G-Pass@16_0.75|G-Pass@16_1.0|mG-Pass@16|\n|--|--|--|--|--|--|\n|Llama-3.1-8B-Instruct ⚽|24.0|18.2|11.3|4.5|10.4|\n|Qwen2.5-7B-Instruct ⚽|37.0|36.5|27.2|16.0|25.8|\n|Llama-3.3-70B-Instruct ⚽|40.3|36.2|28.9|19.1|27.5|\n|InternLM3-8B-Instruct ⚽|44.5|43.0|35.4|23.0|33.6|\n|Claude-3.5-Sonnet ⚽|46.7|44.1|36.2|26.6|35.3|\n|Mistral-Large-Instruct-2411 ⚽|41.6|39.4|37.1|32.9|36.4|\n|Qwen2.5-Math-7B-Instruct 🏐|68.4|44.1|38.3|28.1|36.6|\n|Qwen2.5-32B-Instruct ⚽|50.8|47.3|39.6|29.0|37.8|\n|Qwen2.5-Max ⚽|52.9|52.7|44.3|31.1|42.2|\n|Qwen2.5-Math-72B-Instruct 🏐|57.6|52.7|45.4|27.9|42.3|\n|DeepSeek-Distill-Llama-8B 🏀|58.4|67.8|56.8|31.9|52.2|\n|QwQ-32B-Preview 🏀|72.7|74.9|65.8|40.1|61.2|\n|DeepSeek-Distill-Qwen-7B 🏀|65.6|73.0|66.4|48.4|63.1|\n|OpenAI-o1-mini 🏀|74.1|76.3|67.3|48.3|64.8|\n|DeepSeek-Distill-Qwen-32B 🏀|67.7|81.2|72.3|54.5|69.7|\n|DeepSeek-Distill-Llama-70B 🏀|74.8|80.8|73.0|53.0|69.7|\n|OpenAI-o3-mini 🏀|84.7|85.7|78.8|65.3|76.8|\n|DeepSeek-R1 🏀|81.1|83.6|79.1|69.5|77.6|\n\n\n### *LiveMathBench-Hard-202412*\n\n|LLMs|Greedy|G-Pass@16_0.5|G-Pass@16_0.75|G-Pass@16_1.0|mG-Pass@16|\n|--|--|--|--|--|--|\n|Llama-3.1-8B-Instruct ⚽|2.2|0.8|0.0|0.0|0.0|\n|Qwen2.5-7B-Instruct ⚽|13.3|6.2|3.2|2.2|3.3|\n|Qwen2.5-Math-7B-Instruct 🏐|15.6|8.2|3.3|2.2|3.8|\n|QwQ-32B-Preview 🏀|15.6|5.9|4.4|2.4|4.0|\n|Llama-3.3-70B-Instruct ⚽|4.4|7.8|4.8|2.4|4.6|\n|DeepSeek-Distill-Llama-8B 🏀|8.9|16.1|5.6|2.4|6.2|\n|Llama-3.1-70B-Instruct ⚽|4.4|12.3|7.4|2.7|6.9|\n|InternLM3-8B-Instruct ⚽|11.1|10.7|8.2|2.7|7.0|\n|Qwen2.5-Math-72B-Instruct 🏐|11.1|11.8|7.9|5.9|7.9|\n|DeepSeek-Distill-Qwen-7B 🏀|17.8|13.9|8.8|3.3|8.1|\n|OpenAI-o1-mini 🏀|18.4|21.0|10.1|0.5|8.5|\n|Qwen2.5-32B-Instruct ⚽|13.3|14.1|10.5|3.5|9.1|\n|Qwen2.5-72B-Instruct ⚽|17.8|15.3|11.3|5.4|10.5|\n|DeepSeek-Distill-Qwen-32B 🏀|22.2|29.9|16.9|3.3|15.1|\n|DeepSeek-Distill-Llama-70B 🏀|35.6|33.1|19.0|5.8|17.3|\n|OpenAI-o3-mini 🏀|43.3|47.4|32.5|7.7|28.6|\n|DeepSeek-R1 🏀|42.2|46.6|33.6|9.8|29.6|\n\n### *MATH500-L5*\n\n|LLMs|Greedy|G-Pass@16_0.5|G-Pass@16_0.75|G-Pass@16_1.0|mG-Pass@16|\n|--|--|--|--|--|--|\n|Llama-3.1-8B-Instruct ⚽|26.1|17.8|10.7|3.5|9.7|\n|Llama-3.1-70B-Instruct ⚽|39.6|41.8|32.1|16.1|29.3|\n|InternLM3-8B-Instruct ⚽|51.5|49.9|40.3|26.9|38.3|\n|Qwen2.5-7B-Instruct ⚽|56.0|54.9|43.3|28.0|41.5|\n|Llama-3.3-70B-Instruct ⚽|54.5|55.4|49.5|35.0|47.3|\n|Qwen2.5-72B-Instruct ⚽|63.4|62.5|54.4|44.9|53.1|\n|Qwen2.5-Max ⚽|63.4|65.8|57.3|38.9|54.5|\n|Qwen2.5-32B-Instruct ⚽|64.2|66.6|59.4|41.0|55.6|\n|Qwen2.5-Math-72B-Instruct 🏐|71.6|64.9|59.4|46.0|57.4|\n|Qwen2.5-Math-7B-Instruct 🏐|65.7|65.0|62.2|57.6|61.5|\n|DeepSeek-Distill-Llama-8B 🏀|65.7|79.5|70.0|39.5|64.5|\n|QwQ-32B-Preview 🏀|82.8|87.2|78.8|57.4|75.6|\n|DeepSeek-Distill-Qwen-7B 🏀|78.4|87.9|80.5|62.6|77.6|\n|DeepSeek-Distill-Qwen-32B 🏀|83.6|89.9|83.8|70.4|81.9|\n|DeepSeek-Distill-Llama-70B 🏀|87.3|89.6|85.5|66.8|81.9|\n\n### *AIME2024-45*\n\n|LLMs|Greedy|G-Pass@16_0.5|G-Pass@16_0.75|G-Pass@16_1.0|mG-Pass@16|\n|--|--|--|--|--|--|\n|Llama-3.1-8B-Instruct ⚽|4.4|2.2|1.6|0.0|1.2|\n|Qwen2.5-Math-7B-Instruct 🏐|11.1|4.6|2.6|2.2|3.7|\n|Qwen2.5-32B-Instruct ⚽|11.1|7.1|3.4|2.2|3.7|\n|InternLM3-8B-Instruct ⚽|11.1|7.2|4.3|1.0|3.7|\n|Qwen2.5-7B-Instruct ⚽|11.1|8.9|8.1|4.7|7.5|\n|Llama-3.1-70B-Instruct ⚽|15.6|15.0|8.1|3.0|8.0|\n|Qwen2.5-Max ⚽|22.2|15.5|9.9|5.3|9.8|\n|Qwen2.5-72B-Instruct ⚽|13.3|13.7|12.9|7.5|11.7|\n|Qwen2.5-Math-72B-Instruct 🏐|20.0|18.7|16.2|6.7|14.1|\n|Llama-3.3-70B-Instruct ⚽|22.2|25.3|18.2|6.9|16.4|\n|QwQ-32B-Preview 🏀|44.4|41.0|28.6|8.1|24.7|\n|DeepSeek-Distill-Llama-8B 🏀|44.4|53.9|30.4|9.0|28.0|\n|DeepSeek-Distill-Qwen-7B 🏀|44.4|56.3|35.4|17.5|33.8|\n|OpenAI-o1-mini 🏀|60.3|62.2|53.3|15.6|43.1|\n|DeepSeek-Distill-Llama-70B 🏀|62.2|72.9|63.4|32.2|57.6|\n|DeepSeek-Distill-Qwen-32B 🏀|62.2|77.0|66.5|31.3|59.3|\n\n### *AIME2025*\n\n|LLMs|Greedy|G-Pass@16_0.5|G-Pass@16_0.75|G-Pass@16_1.0|mG-Pass@16|\n|--|--|--|--|--|--|\n|Llama-3.1-8B-Instruct ⚽|0.0|0.0|0.0|0.0|0.0|\n|Llama-3.1-70B-Instruct ⚽|6.7|4.6|0.2|0.0|0.7|\n|InternLM3-8B-Instruct ⚽|13.3|6.7|0.1|0.0|0.8|\n|Qwen2.5-32B-Instruct ⚽|20.0|11.5|0.2|0.0|1.4|\n|Qwen2.5-7B-Instruct ⚽|6.7|9.7|6.2|0.2|4.7|\n|Qwen2.5-72B-Instruct ⚽|20.0|12.2|5.8|0.1|4.9|\n|Llama-3.3-70B-Instruct ⚽|6.7|6.7|6.6|0.5|5.0|\n|Qwen2.5-Math-7B-Instruct 🏐|20.0|8.7|6.7|6.7|6.8|\n|Qwen2.5-Max ⚽|13.3|11.9|6.8|2.9|6.8|\n|Qwen2.5-Math-72B-Instruct 🏐|13.3|13.3|13.3|13.3|13.3|\n|Gemini-2.0-Flash-Exp ⚽|26.7|26.5|21.5|14.0|21.2|\n|QwQ-32B-Preview 🏀|26.7|34.5|32.4|15.6|28.1|\n|OpenAI-o1-mini 🏀|46.7|39.9|32.5|14.0|28.4|\n|DeepSeek-Distill-Llama-8B 🏀|40.0|40.4|21.2|7.9|21.0|\n|DeepSeek-Distill-Qwen-7B 🏀|46.7|46.6|38.3|22.7|36.1|\n|DeepSeek-Distill-Llama-70B 🏀|46.7|52.5|38.6|26.8|37.4|\n|DeepSeek-R1 🏀|66.7|52.6|46.8|24.3|42.5|\n|OpenAI-o3-mini 🏀|53.3|59.0|46.5|29.4|43.6|\n|DeepSeek-Distill-Qwen-32B 🏀|46.7|59.7|50.2|29.5|47.3|\n\n## 🖋\u003cspan id=\"use_in_your_pro\"\u003eUse G-Pass@k in Your Project\u003c/span\u003e\n\nYou can use the following class in your work, you need to define the parameters of G-Pass@k, such as `k`, `n`, and `thresholds`. Additionally, you must define a function to score each sample pair, which should return a binary (0 or 1) label for each pair of prediction and corresponding gold. The compute method will then return a dictionary containing the metrics for each gold standard value and its corresponding predictions. You can aggregate these metrics across your dataset as needed.\n\n```python\nclass GPassAtK:\n    def __init__(\n        self,\n        k: Union[int, List[int]],\n        n: int = None,\n        thresholds: List[float] = [0.0, 0.25, 0.5, 0.75, 1.0],\n        sample_scoring_function: Union[Callable[[str, str], float], str] = None,\n    ):\n        \"\"\"Computing G-Pass@k from http://arxiv.org/abs/2412.13147\n\n        Args:\n            k (int, list): The number of successful attempts to be considered.\n            n (int): Number of samples to generate.\n            thresholds (list): Thresholds to control successful attempts in k generate.\n            sample_scoring_function (callable or str, optional): Function to use to score each sample.\n                Either pass the full function (should take a string prediction and a string gold, and return a score between 0 and 1)\n                a string (any of `prefix`, `suffix` or `full`) to define the type of exact match that you want, or nothing to defaults to \"full\".\n                    `prefix` checks if the prediction starts with the gold,\n                    `suffix` if the prediction ends with the gold,\n                    `full` if the prediction and gold are equal\n        \"\"\"\n        self.k = as_list(k)\n        self.n = n\n        self.thresholds = thresholds\n\n        # Managed the logic of the per prediction of sample scoring\n        if callable(sample_scoring_function):\n            self.score_sample = sample_scoring_function\n            self.type_exact_match = None\n        else:\n            if isinstance(sample_scoring_function, str):\n                if sample_scoring_function not in [\"prefix\", \"suffix\", \"full\"]:\n                    raise ValueError(\n                        f\"type_exact_match (used in parametrized_exact_match) must be one of prefix, suffix, or full. Was {sample_scoring_function} instead.\"\n                    )\n                self.type_exact_match = sample_scoring_function\n            else:\n                self.type_exact_match = \"full\"\n            self.score_sample = self.default_sample_scoring\n\n    def compute(self, predictions: List[str], gold: str, **kwargs) -\u003e dict[str, float]:\n        \"\"\"Computes the metric over a list of golds and predictions for one single item with possibly many samples.\n        It applies normalisation (if needed) to model prediction and gold, computes their per prediction score,\n        then aggregates the scores over the samples using a pass@k.\n\n        Args:\n            golds (list[str]): Reference targets\n            predictions (list[str]): k predicted strings\n\n        Returns:\n            float: Aggregated score over the current sample's items.\n        \"\"\"\n        if len(golds) \u003e 1:\n            raise Exception(\"Cannot compute G-Pass@k with several golds\")\n\n        if self.n is None:\n            self.n = len(predictions)\n            logger.warning(\n                \"n undefined in the G-Pass@k. We assume it's the same as the sample's number of predictions.\"\n            )\n        elif len(predictions) \u003c self.n:\n            logger.warning(f\"Number of predictions is less than {self.n} for G-Pass@k.\")\n\n        all_scores = []\n        for pred in predictions[: self.n]:\n            all_scores.append(self.score_sample(pred, gold))\n\n        return self.g_pass_at_k(all_scores)\n\n    def default_sample_scoring(self, pred: str, gold: str) -\u003e int:\n        if self.type_exact_match == \"prefix\":\n            return 1 if pred.startswith(gold) else 0\n        if self.type_exact_match == \"suffix\":\n            return 1 if pred.endswith(gold) else 0\n        return 1 if gold == pred else 0\n\n    def g_pass_at_k(self, all_scores: list[int]) -\u003e float:\n        \"\"\"Computation of G-Pass@k details from http://arxiv.org/abs/2412.13147\"\"\"\n        c: int = sum(all_scores)\n        n: int = self.n\n        ks: int = self.k\n        thresholds: List[float] = self.thresholds\n\n        def _compute_g_pass_at_k(n, c, k, m):\n            if m \u003e min(c, k) or k \u003e n or c \u003c 0 or n \u003c= 0 or m \u003c 0:\n                return 0.0\n            return hypergeom.sf(m - 1, n, c, k)\n\n        def compute_g_pass_at_k(n, c, k, t):\n            m = max(int(np.ceil(k * t)), 1)\n            return _compute_g_pass_at_k(n, c, k, m)\n\n        def compute_mg_pass_at_k(n, c, k):\n            low, high = int(np.ceil(k * 0.5)), k\n\n            mg_pass_at_k = 0.0\n            for i in range(low + 1, high + 1):\n                mg_pass_at_k += _compute_g_pass_at_k(n, c, k, i)\n            mg_pass_at_k = 2 * mg_pass_at_k / k\n\n            return mg_pass_at_k\n\n        metrics = {}\n        for k in ks:\n            for t in thresholds:\n                metrics[f\"G-Pass@{k}_{t}\"] = compute_g_pass_at_k(n, c, k, t)\n            metrics[f\"mG-Pass@{k}\"] = compute_mg_pass_at_k(n, c, k)\n\n        return metrics\n\n    @property\n    def all_metrics(self):\n        ks: int = self.k\n        thresholds: List[float] = self.thresholds\n\n        metrics = []\n        for k in ks:\n            for t in thresholds:\n                metrics.append(f\"G-Pass@{k}_{t}\")\n            metrics.append(f\"mG-Pass@{k}\")\n\n        return metrics\n```\n\n\n## 🖋Use G-Pass@k in OpenCompass\n[OpenCompass](https://github.com/open-compass/opencompass) is a toolkit for evaluating the performance of large language models (LLMs). To use GPassK in OpenCompass, you can follow the steps below:\n\n### 1. Prepare Environment\nFollow these steps to ensure your environment is ready:\n\n```bash\n# Clone the main repository\ngit clone https://github.com/open-compass/GPassK.git\ncd GPassK/opencompass\n\n# Create and activate a conda environment with specific Python and PyTorch versions\nconda create -n livemathbench-eval python=3.10 pytorch torchvision torchaudio pytorch-cuda -c nvidia -c pytorch -y\nconda activate livemathbench-eval\n\n# Install additional required packages\npip install loguru\n\n# Clone and install OpenCompass for extended functionality\ngit clone https://github.com/open-compass/opencompass.git\ncd opencompass\npip install -e .\n```\n\n\n### 2. Prepare Dataset\nLiveMathBench dataset can be obtained from HuggingFace. First, you should be granted to access the dataset from the following link: [huggingface](https://huggingface.co/datasets/opencompass/LiveMathBench).\nThen, refer to [security-tokens](https://huggingface.co/docs/hub/security-tokens) to set up your HF tokens.\n\n\n### 3. Deploy Judge Models\nWe leverage Qwen2.5-72B-Instruct as the judge model for judging the correctness of generated answers. We recommend to deploy services using deployment tools such as [vllm](https://github.com/vllm-project/vllm) or [lmdeploy](https://github.com/InternLM/lmdeploy) for invocation by different evaluation tasks.\n\nBelow is an example configuration for deploying the judge model using `lmdeploy`:\n```bash\nlmdeploy serve api_server Qwen/Qwen2.5-72B-Instruct --server-port 8000 \\\n    --tp 4 \\ # at least 4 A100 or equivalent GPUs are required\n    --cache-max-entry-count 0.9 \\\n    --log-level INFO \n```\nAfter setting up the judge model, define the URLs in the `eval_urls` and `eval_model_name` within `opencompass_config_templates/*.py`. Adjust other parameters such as `k`， `temperatures`, `llm_infos`, and other params according to your needs.\n\n\u003e [!NOTE]\n\u003e Note that omitting `eval_urls` will default to an internal rule-based judge, which might only apply to datasets with numerical answers \n\n\u003e [!TIP]\n\u003e 💡Now you can use the [LiveMath-Judge](https://huggingface.co/jnanliu/LiveMath-Judge) for judging, which greatly reduces deploy and inference costs.\n\n### 4. Evaluation\n\nTo begin the evaluation, first generate the necessary configuration files by running the following script:\n```bash\ncd opencompass\npython dump_opencompass_configs.py --config_template_file {config_templates/nono1.py|config_templates/o1.py|config_templates/close.py}\n```\n\nUpon execution, verify the generated configuration files located in `opencompass_configs/:\n\n```\n.\n├── deepseek-math-7b-rl_t0-3_p0-8_k50_rp1-0_rs42_l8192@LiveMathBench-v202412-k4_8_16-r3.py\n├── deepseek-math-7b-rl_t0-5_p0-8_k50_rp1-0_rs42_l8192@LiveMathBench-v202412-k4_8_16-r3.py\n├── deepseek-math-7b-rl_t0-7_p0-8_k50_rp1-0_rs42_l8192@LiveMathBench-v202412-k4_8_16-r3.py\n├── deepseek-math-7b-rl_t1-0_p0-8_k50_rp1-0_rs42_l8192@LiveMathBench-v202412-k4_8_16-r3.py\n```\n\nThese files follow a naming convention that reflects the model settings and dataset used:\n```\n[MODEL_ABBR]_t[TEMPERATUE]_p[TOP_P]_k[TOP_K]_rp[REPETITION_PENALTY]_l[MAX_OUT_LEN]@[DATASET_ABBR]_k[LIST_OF_K]_r[REPLICATION].py\n```\n\nWith the configurations prepared, initiate the evaluation process with the commands below:\n\n```bash\ncd GPassK\nconda activate livemathbench-eval\npython opencompass/run.py {path/to/config_file} \\\n      -w ./opencompass_outputs/ \\\n      --dump-eval-details \\\n```\nRefer to the OpenCompass documentation for additional arguments that may enhance your evaluation experience.\n\n## 🖋\u003cspan id=\"use_in_lighteval\"\u003eUse G-Pass@k in Ligheval\u003c/span\u003e\n\n[Lighteval](https://github.com/huggingface/lighteval) is your all-in-one toolkit for evaluating LLMs across multiple backends—whether it's transformers, tgi, vllm, or nanotron—with ease.\n\n\n### 1. Prepare Environment\nFollow these steps to ensure your environment is ready:\n\n```bash\n# Clone the main repository\ngit clone https://github.com/open-compass/GPassK.git\ncd GPassK/lighteval\n\n# Create and activate a conda environment with specific Python and PyTorch versions\nconda create -n lighteval-eval python=3.10 pytorch torchvision torchaudio pytorch-cuda -c nvidia -c pytorch -y\nconda activate lighteval-eval\n\n# Clone and install OpenCompass for extended functionality\ngit clone https://github.com/huggingface/lighteval\ncd lighteval\npip install -e .\n\n# Install additional required packages\npip install opencompass vllm\n```\n\n### 2. Prepare Dataset\nLiveMathBench dataset can be obtained from HuggingFace. First, you should be granted to access the dataset from the following link: [huggingface](https://huggingface.co/datasets/opencompass/LiveMathBench).\nThen, refer to [security-tokens](https://huggingface.co/docs/hub/security-tokens) to set up your HF tokens.\n\n\n### 3. Deploy Judge Models\nWe leverage Qwen2.5-72B-Instruct as the judge model for judging the correctness of generated answers. We recommend to deploy services using deployment tools such as [vllm](https://github.com/vllm-project/vllm) or [lmdeploy](https://github.com/InternLM/lmdeploy) for invocation by different evaluation tasks.\n\nBelow is an example configuration for deploying the judge model using `lmdeploy`:\n```bash\nlmdeploy serve api_server Qwen/Qwen2.5-72B-Instruct --server-port 8000 \\\n    --tp 4 \\ # at least 4 A100 or equivalent GPUs are required\n    --cache-max-entry-count 0.9 \\\n    --log-level INFO \n```\nAfter setting up the judge model, define the URLs in the `eval_urls` and `eval_model` within `lighteval/configs/eval_cfg.yaml`. Adjust other parameters such as `k`， `n`, `model_name_or_path`, and other params according to your needs.\n\n### 4. Evaluation\n\nTo begin the evaluation, running the following script:\n```bash\ncd lighteval\npython lighteval_run.py\n```\n\n\n## 📄 Citation and Tech Report\nIf you use G-Pass@k in your research, please cite the following paper:\n```\n@article{liu2024your,\n  title={Are Your LLMs Capable of Stable Reasoning?},\n  author={Liu, Junnan and Liu, Hongwei and Xiao, Linchen and Wang, Ziyi and Liu, Kuikun and Gao, Songyang and Zhang, Wenwei and Zhang, Songyang and Chen, Kai},\n  journal={arXiv preprint arXiv:2412.13147},\n  year={2024}\n}\n```\n","funding_links":[],"categories":["A01_文本生成_文本对话"],"sub_categories":["大语言对话模型及数据"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fopen-compass%2FGPassK","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fopen-compass%2FGPassK","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fopen-compass%2FGPassK/lists"}