{"id":13450956,"url":"https://github.com/openai/evals","last_synced_at":"2025-05-13T10:49:39.001Z","repository":{"id":142698019,"uuid":"592489166","full_name":"openai/evals","owner":"openai","description":"Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.","archived":false,"fork":false,"pushed_at":"2024-12-18T22:09:47.000Z","size":6656,"stargazers_count":16057,"open_issues_count":145,"forks_count":2697,"subscribers_count":268,"default_branch":"main","last_synced_at":"2025-05-05T20:21:32.742Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/openai.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":".github/CODEOWNERS","security":"SECURITY.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-01-23T20:51:04.000Z","updated_at":"2025-05-05T16:48:57.000Z","dependencies_parsed_at":"2023-04-24T12:02:24.649Z","dependency_job_id":"6b13775f-e835-4fee-bcb1-8d19b7759c72","html_url":"https://github.com/openai/evals","commit_stats":{"total_commits":669,"total_committers":447,"mean_commits":"1.4966442953020134","dds":0.9551569506726457,"last_synced_commit":"234bcde34b5951233681455faeb92baaaef97573"},"previous_names":[],"tags_count":10,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/openai%2Fevals","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/openai%2Fevals/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/openai%2Fevals/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/openai%2Fevals/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/openai","download_url":"https://codeload.github.com/openai/evals/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":253928430,"owners_count":21985793,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-07-31T07:00:40.798Z","updated_at":"2025-05-13T10:49:38.952Z","avatar_url":"https://github.com/openai.png","language":"Python","funding_links":[],"categories":["Prompts \u0026 Datasets","Evaluation","📜 Regulatory Compliance Testing","Papers","Open-source LLM-backed app evaluation products","Python","Prompting libraries \u0026 tools","Uncategorized","9. Evaluation \u0026 Observability","Materials","Openai","Tools","Evaluation \u0026 Benchmarks","Large Language Models (LLMs)","A01_文本生成_文本对话","Evaluation \u0026 Benchmarking","LLM Applications","Evaluation and Monitoring","Repos","Community and Official Guidance Resources","others","Benchmarks and Evaluation","Tools \u0026 Platforms","Evaluation Frameworks","Tools \u0026 Frameworks","Related Concepts","🛠 Build","5 · Evaluation infrastructure (the eval stack: datasets, scorers, online/offline, tracing, CI)","LLM and Agent Observability","Frameworks","🔧 Data \u0026 Infrastructure","RLHF","LLM-as-Judge Evaluation","Evaluation, Observability \u0026 Safety","Agent Observability and Testing","🛠️ Developer Infrastructure","9. Evaluation, Benchmarks \u0026 Datasets","3）参考实现与开源工具（GitHub）","LLM Evaluation Framework","Evaluation Basics"],"sub_categories":["Frameworks \u0026 Tools","Papers/Methods",":earth_americas:Evaluation Organization","Uncategorized","GitHub repositories","LLM Evaluation","大语言对话模型及数据","2023","Community Frameworks and Guidance","Evaluation Frameworks","Open Source Frameworks","Prompt Testing \u0026 Optimization","Evaluation tooling (adjacent)","Evals","5a · Eval frameworks \u0026 harnesses (code-first test-runners)","Eval \u0026 Testing","Evaluators and Test Harnesses","AI Governance \u0026 Compliance","LLMOps vs MLOps","Evaluation \u0026 Observability","Benchmark Reality Check (real-world tool use)","Evaluation \u0026 Testing","评测框架与 Agent Benchmarks"],"readme":"# OpenAI Evals\n\n\u003e You can now configure and run Evals directly in the OpenAI Dashboard. [Get started →](https://platform.openai.com/docs/guides/evals)\n\nEvals provide a framework for evaluating large language models (LLMs) or systems built using LLMs. We offer an existing registry of evals to test different dimensions of OpenAI models and the ability to write your own custom evals for use cases you care about. You can also use your data to build private evals which represent the common LLMs patterns in your workflow without exposing any of that data publicly.\n\nIf you are building with LLMs, creating high quality evals is one of the most impactful things you can do. Without evals, it can be very difficult and time intensive to understand how different model versions might affect your use case. In the words of [OpenAI's President Greg Brockman](https://twitter.com/gdb/status/1733553161884127435):\n\n\u003cimg width=\"596\" alt=\"https://x.com/gdb/status/1733553161884127435?s=20\" src=\"https://github.com/openai/evals/assets/35577566/ce7840ff-43a8-4d88-bb2f-6b207410333b\"\u003e\n\n## Setup\n\nTo run evals, you will need to set up and specify your [OpenAI API key](https://platform.openai.com/account/api-keys). After you obtain an API key, specify it using the [`OPENAI_API_KEY` environment variable](https://platform.openai.com/docs/quickstart/step-2-setup-your-api-key). Please be aware of the [costs](https://openai.com/pricing) associated with using the API when running evals. You can also run and create evals using [Weights \u0026 Biases](https://wandb.ai/wandb_fc/openai-evals/reports/OpenAI-Evals-Demo-Using-W-B-Prompts-to-Run-Evaluations--Vmlldzo0MTI4ODA3).\n\n**Minimum Required Version: Python 3.9**\n\n### Downloading evals\n\nOur evals registry is stored using [Git-LFS](https://git-lfs.com/). Once you have downloaded and installed LFS, you can fetch the evals (from within your local copy of the evals repo) with:\n```sh\ncd evals\ngit lfs fetch --all\ngit lfs pull\n```\n\nThis will populate all the pointer files under `evals/registry/data`.\n\nYou may just want to fetch data for a select eval. You can achieve this via:\n```sh\ngit lfs fetch --include=evals/registry/data/${your eval}\ngit lfs pull\n```\n\n### Making evals\n\nIf you are going to be creating evals, we suggest cloning this repo directly from GitHub and installing the requirements using the following command:\n\n```sh\npip install -e .\n```\n\nUsing `-e`, changes you make to your eval will be reflected immediately without having to reinstall.\n\nOptionally, you can install the formatters for pre-committing with:\n\n```sh\npip install -e .[formatters]\n```\n\nThen run `pre-commit install` to install pre-commit into your git hooks. pre-commit will now run on every commit.\n\nIf you want to manually run all pre-commit hooks on a repository, run `pre-commit run --all-files`. To run individual hooks use `pre-commit run \u003chook_id\u003e`.\n\n## Running evals\n\nIf you don't want to contribute new evals, but simply want to run them locally, you can install the evals package via pip:\n\n```sh\npip install evals\n```\n\nYou can find the full instructions to run existing evals in [`run-evals.md`](docs/run-evals.md) and our existing eval templates in [`eval-templates.md`](docs/eval-templates.md). For more advanced use cases like prompt chains or tool-using agents, you can use our [Completion Function Protocol](docs/completion-fns.md).\n\nWe provide the option for you to log your eval results to a Snowflake database, if you have one or wish to set one up. For this option, you will further have to specify the `SNOWFLAKE_ACCOUNT`, `SNOWFLAKE_DATABASE`, `SNOWFLAKE_USERNAME`, and `SNOWFLAKE_PASSWORD` environment variables.\n\n## Writing evals\n\nWe suggest getting starting by: \n\n- Walking through the process for building an eval: [`build-eval.md`](docs/build-eval.md)\n- Exploring an example of implementing custom eval logic: [`custom-eval.md`](docs/custom-eval.md)\n- Writing your own completion functions: [`completion-fns.md`](docs/completion-fns.md)\n- Review our starter guide for writing evals: [Getting Started with OpenAI Evals](https://cookbook.openai.com/examples/evaluation/getting_started_with_openai_evals)\n\nPlease note that we are currently not accepting evals with custom code! While we ask you to not submit such evals at the moment, you can still submit model-graded evals with custom model-graded YAML files.\n\nIf you think you have an interesting eval, please open a pull request with your contribution. OpenAI staff actively review these evals when considering improvements to upcoming models.\n\n## FAQ\n\nDo you have any examples of how to build an eval from start to finish?\n\n- Yes! These are in the `examples` folder. We recommend that you also read through [`build-eval.md`](docs/build-eval.md) in order to gain a deeper understanding of what is happening in these examples.\n\nDo you have any examples of evals implemented in multiple different ways?\n\n- Yes! In particular, see `evals/registry/evals/coqa.yaml`. We have implemented small subsets of the [CoQA](https://stanfordnlp.github.io/coqa/) dataset for various eval templates to help illustrate the differences.\n\nWhen I run an eval, it sometimes hangs at the very end (after the final report). What's going on?\n\n- This is a known issue, but you should be able to interrupt it safely and the eval should finish immediately after.\n\nThere's a lot of code, and I just want to spin up a quick eval. Help? OR,\n\nI am a world-class prompt engineer. I choose not to code. How can I contribute my wisdom?\n\n- If you follow an existing [eval template](docs/eval-templates.md) to build a basic or model-graded eval, you don't need to write any evaluation code at all! Just provide your data in JSON format and specify your eval parameters in YAML. [build-eval.md](docs/build-eval.md) walks you through these steps, and you can supplement these instructions with the Jupyter notebooks in the `examples` folder to help you get started quickly. Keep in mind, though, that a good eval will inevitably require careful thought and rigorous experimentation!\n\n## Disclaimer\n\nBy contributing to evals, you are agreeing to make your evaluation logic and data under the same MIT license as this repository. You must have adequate rights to upload any data used in an eval. OpenAI reserves the right to use this data in future service improvements to our product. Contributions to OpenAI evals will be subject to our usual Usage Policies: https://platform.openai.com/docs/usage-policies.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fopenai%2Fevals","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fopenai%2Fevals","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fopenai%2Fevals/lists"}