{"id":17457778,"url":"https://github.com/kolenaIO/autoarena","last_synced_at":"2025-03-02T12:31:16.064Z","repository":{"id":256399946,"uuid":"849004809","full_name":"kolenaIO/autoarena","owner":"kolenaIO","description":"Rank LLMs, RAG systems, and prompts using automated head-to-head evaluation","archived":false,"fork":false,"pushed_at":"2024-12-16T12:25:44.000Z","size":2643,"stargazers_count":102,"open_issues_count":4,"forks_count":8,"subscribers_count":7,"default_branch":"trunk","last_synced_at":"2025-02-25T14:59:39.451Z","etag":null,"topics":["ai","evaluation","hacktoberfest","llm","llm-evaluation","rag","testing"],"latest_commit_sha":null,"homepage":"https://www.kolena.com/autoarena/","language":"TypeScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/kolenaIO.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-08-28T19:55:28.000Z","updated_at":"2025-02-01T19:02:33.000Z","dependencies_parsed_at":"2024-09-10T14:29:38.915Z","dependency_job_id":"2eccc182-f167-4a76-b4b3-87464793242d","html_url":"https://github.com/kolenaIO/autoarena","commit_stats":null,"previous_names":["kolenaio/autoarena"],"tags_count":5,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kolenaIO%2Fautoarena","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kolenaIO%2Fautoarena/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kolenaIO%2Fautoarena/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kolenaIO%2Fautoarena/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/kolenaIO","download_url":"https://codeload.github.com/kolenaIO/autoarena/tar.gz/refs/heads/trunk","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":241507745,"owners_count":19973880,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","evaluation","hacktoberfest","llm","llm-evaluation","rag","testing"],"created_at":"2024-10-18T03:01:39.272Z","updated_at":"2025-03-02T12:31:16.057Z","avatar_url":"https://github.com/kolenaIO.png","language":"TypeScript","funding_links":[],"categories":["TypeScript"],"sub_categories":[],"readme":"\u003cdiv align=\"center\"\u003e\n\n# AutoArena\n\n**Create leaderboards ranking LLM outputs against one another using automated judge evaluation**\n\n[![Apache-2.0 License](https://img.shields.io/pypi/l/autoarena?style=flat-square)](https://www.apache.org/licenses/LICENSE-2.0)\n[![CI](https://img.shields.io/github/actions/workflow/status/kolenaIO/autoarena/ci.yml?logo=github\u0026style=flat-square)](https://github.com/kolenaIO/autoarena/actions)\n[![Test Coverage](https://img.shields.io/codecov/c/github/kolenaIO/autoarena?logo=codecov\u0026style=flat-square\u0026logoColor=white)](https://app.codecov.io/gh/kolenaIO/autoarena)\n[![PyPI Version](https://img.shields.io/pypi/v/autoarena?logo=python\u0026logoColor=white\u0026style=flat-square)](https://pypi.python.org/pypi/autoarena)\n[![Supported Python Versions](https://img.shields.io/pypi/pyversions/autoarena.svg?style=flat-square)](https://pypi.org/project/autoarena)\n[![Slack](https://img.shields.io/badge/Slack-4A154B?logo=slack\u0026logoColor=white\u0026style=flat-square)](https://kolena-autoarena.slack.com)\n\n\u003c/div\u003e\n\n---\n\n- 🏆 Rank outputs from different LLMs, RAG setups, and prompts to find the best configuration of your system\n- ⚔️ Perform automated head-to-head evaluation using judges from OpenAI, Anthropic, Cohere, and more\n- 🤖 Define and run your own custom judges, connecting to internal services or implementing bespoke logic\n- 💻 Run application locally, getting full control over your environment and data\n\n[![AutoArena user interface](https://raw.githubusercontent.com/kolenaIO/autoarena/trunk/assets/autoarena.jpg)](https://www.youtube.com/watch?v=GMuQPwo-JdU)\n\n## 🤔 Why Head-to-Head Evaluation?\n\n- LLMs are better at judging responses head-to-head than they are in isolation\n  ([arXiv:2408.08688](https://www.arxiv.org/abs/2408.08688v3)) — leaderboard rankings computed using Elo scores from\n  many automated side-by-side comparisons should be more trustworthy than leaderboards using metrics computed on each\n  model's responses independently!\n- The [LMSYS Chatbot Arena](https://lmarena.ai/) has replaced benchmarks for many people as the trusted true leaderboard\n  for foundation model performance ([arXiv:2403.04132](https://arxiv.org/abs/2403.04132)). Why not apply this approach\n  to your own foundation model selection, RAG system setup, or prompt engineering efforts?\n- Using a \"jury\" of multiple smaller models from different model families like `gpt-4o-mini`, `command-r`, and\n  `claude-3-haiku` generally yields better accuracy than a single frontier judge like `gpt-4o` — while being faster and\n  _much_ cheaper to run. AutoArena is built around this technique, called PoLL: **P**anel **o**f **LL**M evaluators\n  ([arXiv:2404.18796](https://arxiv.org/abs/2404.18796)).\n- Automated side-by-side comparison of model outputs is one of the most prevalent evaluation practices\n  ([arXiv:2402.10524](https://arxiv.org/abs/2402.10524)) — AutoArena makes this process easier than ever to get up\n  and running.\n\n## 🔥 Getting Started\n\nInstall from [PyPI](https://pypi.org/project/autoarena/):\n\n```shell\npip install autoarena\n```\n\nRun as a module and visit [localhost:8899](http://localhost:8899/) in your browser:\n\n```shell\npython -m autoarena\n```\n\nWith the application running, getting started is simple:\n\n1. Create a project via the UI.\n1. Add responses from a model by selecting a CSV file with `prompt` and `response` columns.\n1. Configure an automated judge via the UI. Note that most judges require credentials, e.g. `X_API_KEY` in the\n   environment where you're running AutoArena.\n1. Add responses from a second model to kick off an automated judging task using the judges you configured in the\n   previous step to decide which of the models you've uploaded provided a better `response` to a given `prompt`.\n\nThat's it! After these steps you're fully set up for automated evaluation on AutoArena.\n\n### 📄 Formatting Your Data\n\nAutoArena requires two pieces of information to test a model: the input `prompt` and corresponding model `response`.\n\n- `prompt`: the inputs to your model. When uploading responses, any other models that have been run on the same prompts\n   are matched and evaluated using the automated judges you have configured.\n- `response`: the output from your model. Judges decide which of two models produced a better response, given the same\n   prompt.\n\n### 📂 Data Storage\n\nData is stored in `./data/\u003cproject\u003e.sqlite` files in the directory where you invoked AutoArena. See\n[`data/README.md`](./data/README.md) for more details on data storage in AutoArena.\n\n## 🦾 Development\n\nAutoArena uses [uv](https://github.com/astral-sh/uv) to manage dependencies. To set up this repository for development,\nrun:\n\n```shell\nuv venv \u0026\u0026 source .venv/bin/activate\nuv pip install --all-extras -r pyproject.toml\nuv tool run pre-commit install\nuv run python3 -m autoarena serve --dev\n```\n\nTo run AutoArena for development, you will need to run both the backend and frontend service:\n\n- Backend: `uv run python3 -m autoarena serve --dev` (the `--dev`/`-d` flag enables automatic service reloading when\n    source files change)\n- Frontend: see [`ui/README.md`](./ui/README.md)\n\nTo build a release tarball in the `./dist` directory:\n\n```shell\n./scripts/build.sh\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FkolenaIO%2Fautoarena","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FkolenaIO%2Fautoarena","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FkolenaIO%2Fautoarena/lists"}