{"id":26123326,"url":"https://github.com/gair-nlp/scaleeval","last_synced_at":"2025-06-23T17:06:03.993Z","repository":{"id":220360904,"uuid":"749941580","full_name":"GAIR-NLP/scaleeval","owner":"GAIR-NLP","description":"Scalable Meta-Evaluation of LLMs as Evaluators","archived":false,"fork":false,"pushed_at":"2024-02-15T13:54:40.000Z","size":4082,"stargazers_count":42,"open_issues_count":1,"forks_count":3,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-03-27T05:34:29.277Z","etag":null,"topics":["evaluation-framework","generative-ai","llm","nlp"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/GAIR-NLP.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-01-29T17:36:51.000Z","updated_at":"2025-03-07T06:53:29.000Z","dependencies_parsed_at":"2024-02-01T18:08:54.346Z","dependency_job_id":"9e326a9e-604d-4394-b798-591e6eec0707","html_url":"https://github.com/GAIR-NLP/scaleeval","commit_stats":null,"previous_names":["gair-nlp/scaleeval"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GAIR-NLP%2Fscaleeval","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GAIR-NLP%2Fscaleeval/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GAIR-NLP%2Fscaleeval/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GAIR-NLP%2Fscaleeval/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/GAIR-NLP","download_url":"https://codeload.github.com/GAIR-NLP/scaleeval/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248728696,"owners_count":21152277,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["evaluation-framework","generative-ai","llm","nlp"],"created_at":"2025-03-10T15:52:55.974Z","updated_at":"2025-04-13T14:34:40.462Z","avatar_url":"https://github.com/GAIR-NLP.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# ScaleEval: Scalable Meta-Evaluation of LLMs as Evaluators via Agent Debate\nThis repository contains the source code and link to our [paper](https://arxiv.org/pdf/2401.16788.pdf).\n\nScaleEval is an agent-debate-assisted meta-evaluation framework that leverages the capabilities of multiple communicative LLM agents. This framework supports multi-round discussions to assist humans in discerning the most capable LLM-based evaluators. Users can supply their LLM submissions, criteria, and scenarios with our framework to conduct meta-evaluation.\n\n\u003cp align=\"center\"\u003e\n    \u003cimg src=\"figs/scale_framework.png\" width=\"600\" height=\"400\" alt=\"ScaleEval Framework\"\u003e\n\u003c/p\u003e\n\n\n## Get Started\n\n```shell\npip install scaleeval\nexport OPENAI_API_KEY=XXXX.YYYY.ZZZ\nexport ANTHROPIC_API_KEY=XXXX.YYYY.ZZZ\n```\npython 3.9+ is required.\n\n## For Developers\n\n### Install as developer\n\n```bash\ngit clone git@github.com:GAIR-NLP/scaleeval.git\ncd scaleeval\npip install -e .\n\n# install pre-commit hooks\npip install pre-commit\npre-commit install\n```\n\n### Run formating\n```shell\n# this is necessary before you commit\ngit init\ngit add .\npre-commit run\n```\n\n## Run Meta-Evaluation\n\n```bash\ncd scaleeval\npython evaluation.py\n```\n\n### Sample Criteria \n* `criteria/metaeval_creativity`: Scoring of 1 to 5 for each LLM submission based on the creativity criteria, and decide which submission is better.\n* `criteria/metaeval_helpfulness`: Scoring of 1 to 5 for each LLM submission based on the helpfulness criteria, and decide which submission is better.\n* `criteria/metaeval_interpretability`: Scoring of 1 to 5 for each LLM submission based on the interpretability, and decide which submission is better.\n* `criteria/metaeval_reasoning`: Scoring of 1 to 5 for each LLM submission based on the reasoning criteria, and decide which submission is better.\n\n### Sample LLM Question Prompts and Submissions\n* `scaleeval/responses/gpt3.5/brainstorming.jsonl`: Responses generated from gpt3.5-turbo based on the prompts in the brainstorming scenario.\n* `scaleeval/responses/claude-instant/brainstorming.jsonl`: Responses generated from claude-instant based on the prompts in the brainstorming scenario.\n\n### Supply your own criteria and scenario!\n```bash\ncd criteria\n```\n\nAdd your own criteria descriptions to a new .yaml file! Follow the format below:\n```python\ncriteria:\n    [your defined criteria]:\n        \"1\": ...\n        \"2\": ...\n        \"3\": ...\n        \"4\": ...\n        \"5\": ...\n```\n\n### Supply your own LLM submissions!\nCollect your LLM submissions and save them in a .jsonl file with the following format:\n```python\n{\"prompt\": ..., \"category\": ..., \"response\": ...}\n```\n\nHere is one example:\n```python\n{\"prompt\": \"Can you tell me how to make chocolate chip cookies?\", \"category\": \"brainstorming\", \"response\": \"Certainly! Here's a simple recipe for making chocolate chip cookies:\\n\\nIngredients:\\n- 1 cup (2 sticks) unsalted butter, softened\\n- 1 cup granulated sugar\\n- 1 cup packed brown sugar\\n- 2 large eggs\\n- 1 teaspoon vanilla extract\\n- 3 cups all-purpose flour\\n- 1 teaspoon baking soda\\n- 1/2 teaspoon salt\\n- 2 cups chocolate chips\\n\\nInstructions:\\n1. Preheat your oven to 350\\u00b0F (175\\u00b0C) and line a baking sheet with parchment paper.\\n2. In a large mixing bowl, cream together the softened butter, granulated sugar, and brown sugar until well combined and creamy.\\n3. Add the eggs one at a time, beating well after each addition. Then, stir in the vanilla extract.\\n4. In a separate bowl, whisk together the flour, baking soda, and salt. Gradually add this dry mixture to the wet ingredients, mixing until just combined.\\n5. Fold in the chocolate chips until evenly distributed throughout the dough.\\n6. Using a cookie scoop or tablespoon, drop rounded portions of dough onto the prepared baking sheet, spacing them about 2 inches apart.\\n7. Bake in the preheated oven for 10-12 minutes, or until the edges are golden brown. The centers may still appear slightly soft, but they will firm up as they cool.\\n8. Remove the baking sheet from the oven and let the cookies cool on the sheet for a few minutes before transferring them to a wire rack to cool completely.\\n9. Repeat the process with the remaining dough until all the cookies are baked.\\n\\nEnjoy your homemade chocolate chip cookies!\"}\n```\n\n## Citation\nPlease consider citing our [paper](https://arxiv.org/pdf/2401.16788.pdf) if you find our repo and results useful!\n```\n@article{chern2024scaleeval,\n  title={Can Large Language Models be Trusted for Evaluation? Scalable Meta-Evaluation of LLMs as Evaluators via Agent Debate},\n  author={Chern, Steffi and Chern, Ethan and Neubig, Graham and Liu, Pengfei},\n  journal={arXiv preprint arXiv:2401.16788},\n  year={2024}\n}\n```\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgair-nlp%2Fscaleeval","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgair-nlp%2Fscaleeval","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgair-nlp%2Fscaleeval/lists"}