{"id":11031554,"url":"https://github.com/braintrustdata/autoevals","last_synced_at":"2025-08-28T08:06:40.711Z","repository":{"id":180330877,"uuid":"664963846","full_name":"braintrustdata/autoevals","owner":"braintrustdata","description":"AutoEvals is a tool for quickly and easily evaluating AI model outputs using best practices.","archived":false,"fork":false,"pushed_at":"2025-08-19T23:12:13.000Z","size":942,"stargazers_count":585,"open_issues_count":15,"forks_count":38,"subscribers_count":4,"default_branch":"main","last_synced_at":"2025-08-20T01:12:20.387Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/braintrustdata.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2023-07-11T06:25:52.000Z","updated_at":"2025-08-19T22:08:53.000Z","dependencies_parsed_at":"2023-12-06T20:45:07.562Z","dependency_job_id":"08d884fe-7bb7-4835-b598-01b851d0da1f","html_url":"https://github.com/braintrustdata/autoevals","commit_stats":null,"previous_names":["braintrustdata/autoevals"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/braintrustdata/autoevals","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/braintrustdata%2Fautoevals","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/braintrustdata%2Fautoevals/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/braintrustdata%2Fautoevals/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/braintrustdata%2Fautoevals/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/braintrustdata","download_url":"https://codeload.github.com/braintrustdata/autoevals/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/braintrustdata%2Fautoevals/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":272467011,"owners_count":24939492,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-28T02:00:10.768Z","response_time":74,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-06-11T17:33:55.648Z","updated_at":"2025-08-28T08:06:40.683Z","avatar_url":"https://github.com/braintrustdata.png","language":"Python","readme":"# Autoevals\n\nAutoevals is a tool to quickly and easily evaluate AI model outputs.\n\nIt bundles together a variety of automatic evaluation methods including:\n\n- LLM-as-a-judge\n- Heuristic (e.g. Levenshtein distance)\n- Statistical (e.g. BLEU)\n\nAutoevals is developed by the team at [Braintrust](https://braintrust.dev/).\n\nAutoevals uses model-graded evaluation for a variety of subjective tasks including fact checking,\nsafety, and more. Many of these evaluations are adapted from OpenAI's excellent [evals](https://github.com/openai/evals)\nproject but are implemented so you can flexibly run them on individual examples, tweak the prompts, and debug\ntheir outputs.\n\nYou can also create your own model-graded evaluations with Autoevals. It's easy to add custom prompts, parse outputs,\nand manage exceptions.\n\n\u003cdiv className=\"hidden\"\u003e\n\n### Requirements\n\n- Python 3.9 or higher\n- Compatible with both OpenAI Python SDK v0.x and v1.x\n\n\u003c/div\u003e\n\n## Installation\n\n\u003cdiv className=\"tabs\"\u003e\n\n### TypeScript\n\n```bash\nnpm install autoevals\n```\n\n### Python\n\n```bash\npip install autoevals\n```\n\n\u003c/div\u003e\n\n## Getting started\n\nUse Autoevals to model-grade an example LLM completion using the [Factuality prompt](templates/factuality.yaml).\nBy default, Autoevals uses your `OPENAI_API_KEY` environment variable to authenticate with OpenAI's API.\n\n\u003cdiv className=\"tabs\"\u003e\n\n### Python\n\n```python\nfrom autoevals.llm import *\nimport asyncio\n\n# Create a new LLM-based evaluator\nevaluator = Factuality()\n\n# Synchronous evaluation\ninput = \"Which country has the highest population?\"\noutput = \"People's Republic of China\"\nexpected = \"China\"\n\n# Using the synchronous API\nresult = evaluator(output, expected, input=input)\nprint(f\"Factuality score (sync): {result.score}\")\nprint(f\"Factuality metadata (sync): {result.metadata['rationale']}\")\n\n# Using the asynchronous API\nasync def main():\n    result = await evaluator.eval_async(output, expected, input=input)\n    print(f\"Factuality score (async): {result.score}\")\n    print(f\"Factuality metadata (async): {result.metadata['rationale']}\")\n\n# Run the async example\nasyncio.run(main())\n```\n\n### TypeScript\n\n```typescript\nimport { Factuality } from \"autoevals\";\n\n(async () =\u003e {\n  const input = \"Which country has the highest population?\";\n  const output = \"People's Republic of China\";\n  const expected = \"China\";\n\n  const result = await Factuality({ output, expected, input });\n  console.log(`Factuality score: ${result.score}`);\n  console.log(`Factuality metadata: ${result.metadata?.rationale}`);\n})();\n```\n\n\u003c/div\u003e\n\n## Using other AI providers\n\nWhen you use Autoevals, it will look for an `OPENAI_BASE_URL` environment variable to use as the base for requests to an OpenAI compatible API. If `OPENAI_BASE_URL` is not set, it will default to the [AI proxy](https://www.braintrust.dev/docs/guides/proxy).\n\nIf you choose to use the proxy, you'll also get:\n\n- Simplified access to many AI providers\n- Reduced costs with automatic request caching\n- Increased observability when you enable logging to Braintrust\n\nThe proxy is free to use, even if you don't have a Braintrust account.\n\nIf you have a Braintrust account, you can optionally set the `BRAINTRUST_API_KEY` environment variable instead of `OPENAI_API_KEY` to unlock additional features like logging and monitoring. You can also route requests to [supported AI providers and models](https://www.braintrust.dev/docs/guides/proxy#supported-models) or custom models you have configured in Braintrust.\n\n\u003cdiv className=\"tabs\"\u003e\n\n### Python\n\n```python\n# NOTE: ensure BRAINTRUST_API_KEY is set in your environment and OPENAI_API_KEY is not set\nfrom autoevals.llm import *\n\n# Create an LLM-based evaluator using the Claude 3.5 Sonnet model from Anthropic\nevaluator = Factuality(model=\"claude-3-5-sonnet-latest\")\n\n# Evaluate an example LLM completion\ninput = \"Which country has the highest population?\"\noutput = \"People's Republic of China\"\nexpected = \"China\"\n\nresult = evaluator(output, expected, input=input)\n\n# The evaluator returns a score from [0,1] and includes the raw outputs from the evaluator\nprint(f\"Factuality score: {result.score}\")\nprint(f\"Factuality metadata: {result.metadata['rationale']}\")\n```\n\n### TypeScript\n\n```typescript\n// NOTE: ensure BRAINTRUST_API_KEY is set in your environment and OPENAI_API_KEY is not set\nimport { Factuality } from \"autoevals\";\n\n(async () =\u003e {\n  const input = \"Which country has the highest population?\";\n  const output = \"People's Republic of China\";\n  const expected = \"China\";\n\n  // Run an LLM-based evaluator using the Claude 3.5 Sonnet model from Anthropic\n  const result = await Factuality({\n    model: \"claude-3-5-sonnet-latest\",\n    output,\n    expected,\n    input,\n  });\n\n  // The evaluator returns a score from [0,1] and includes the raw outputs from the evaluator\n  console.log(`Factuality score: ${result.score}`);\n  console.log(`Factuality metadata: ${result.metadata?.rationale}`);\n})();\n```\n\n\u003c/div\u003e\n\n## Custom client configuration\n\nThere are two ways you can configure a custom client when you need to use a different OpenAI compatible API:\n\n1. **Global configuration**: Initialize a client that will be used by all evaluators\n2. **Instance configuration**: Configure a client for a specific evaluator\n\n### Global configuration\n\nSet up a client that all your evaluators will use:\n\n\u003cdiv className=\"tabs\"\u003e\n\n#### Python\n\n```python\nimport openai\nimport asyncio\nfrom autoevals import init\nfrom autoevals.llm import Factuality\n\nclient = init(openai.AsyncOpenAI(base_url=\"https://api.openai.com/v1/\"))\n\nasync def main():\n    evaluator = Factuality()\n    result = await evaluator.eval_async(\n        input=\"What is the speed of light in a vacuum?\",\n        output=\"The speed of light in a vacuum is 299,792,458 meters per second.\",\n        expected=\"The speed of light in a vacuum is approximately 300,000 kilometers per second.\"\n    )\n    print(f\"Factuality score: {result.score}\")\n\nasyncio.run(main())\n```\n\n#### TypeScript\n\n```typescript\nimport OpenAI from \"openai\";\nimport { init, Factuality } from \"autoevals\";\n\nconst client = new OpenAI({\n  baseURL: \"https://api.openai.com/v1/\",\n});\n\ninit({ client });\n\n(async () =\u003e {\n  const result = await Factuality({\n    input: \"What is the speed of light in a vacuum?\",\n    output: \"The speed of light in a vacuum is 299,792,458 meters per second.\",\n    expected:\n      \"The speed of light in a vacuum is approximately 300,000 kilometers per second (or precisely 299,792,458 meters per second).\",\n  });\n\n  console.log(\"Factuality Score:\", result);\n})();\n```\n\n\u003c/div\u003e\n\n### Instance configuration\n\nConfigure a client for a specific evaluator instance:\n\n\u003cdiv className=\"tabs\"\u003e\n\n#### Python\n\n```python\nimport openai\nfrom autoevals.llm import Factuality\n\ncustom_client = openai.OpenAI(base_url=\"https://custom-api.example.com/v1/\")\nevaluator = Factuality(client=custom_client)\n```\n\n#### TypeScript\n\n```typescript\nimport OpenAI from \"openai\";\nimport { Factuality } from \"autoevals\";\n\n(async () =\u003e {\n  const customClient = new OpenAI({\n    baseURL: \"https://custom-api.example.com/v1/\",\n  });\n\n  const result = await Factuality({\n    client: customClient,\n    output: \"Paris is the capital of France\",\n    expected:\n      \"Paris is the capital of France and has a population of over 2 million\",\n    input: \"Tell me about Paris\",\n  });\n  console.log(result);\n})();\n```\n\n\u003c/div\u003e\n\n## Using Braintrust with Autoevals (optional)\n\nOnce you grade an output using Autoevals, you can optionally use [Braintrust](https://www.braintrust.dev/docs/libs/python) to log and compare your evaluation results. This integration is completely optional and not required for using Autoevals.\n\n\u003cdiv className=\"tabs\"\u003e\n\n### TypeScript\n\nCreate a file named `example.eval.js` (it must take the form `*.eval.[ts|tsx|js|jsx]`):\n\n```typescript\nimport { Eval } from \"braintrust\";\nimport { Factuality } from \"autoevals\";\n\nEval(\"Autoevals\", {\n  data: () =\u003e [\n    {\n      input: \"Which country has the highest population?\",\n      expected: \"China\",\n    },\n  ],\n  task: () =\u003e \"People's Republic of China\",\n  scores: [Factuality],\n});\n```\n\nThen, run\n\n```bash\nnpx braintrust run example.eval.js\n```\n\n### Python\n\nCreate a file named `eval_example.py` (it must take the form `eval_*.py`):\n\n```python\nimport braintrust\nfrom autoevals.llm import Factuality\n\nEval(\n    \"Autoevals\",\n    data=lambda: [\n        dict(\n            input=\"Which country has the highest population?\",\n            expected=\"China\",\n        ),\n    ],\n    task=lambda *args: \"People's Republic of China\",\n    scores=[Factuality],\n)\n```\n\n\u003c/div\u003e\n\n## Supported evaluation methods\n\n### LLM-as-a-judge evaluations\n\n- Battle\n- Closed QA\n- Humor\n- Factuality\n- Moderation\n- Security\n- Summarization\n- SQL\n- Translation\n- Fine-tuned binary classifiers\n\n### RAG evaluations\n\n- Context precision\n- Context relevancy\n- Context recall\n- Context entity recall\n- Faithfulness\n- Answer relevancy\n- Answer similarity\n- Answer correctness\n\n### Composite evaluations\n\n- Semantic list contains\n- JSON validity\n\n### Embedding evaluations\n\n- Embedding similarity\n\n### Heuristic evaluations\n\n- Levenshtein distance\n- Exact match\n- Numeric difference\n- JSON diff\n\n## Custom evaluation prompts\n\nAutoevals supports custom evaluation prompts for model-graded evaluation. To use them, simply pass in a prompt and scoring mechanism:\n\n\u003cdiv className=\"tabs\"\u003e\n\n### Python\n\n```python\nfrom autoevals import LLMClassifier\n\n# Define a prompt prefix for a LLMClassifier (returns just one answer)\nprompt_prefix = \"\"\"\nYou are a technical project manager who helps software engineers generate better titles for their GitHub issues.\nYou will look at the issue description, and pick which of two titles better describes it.\n\nI'm going to provide you with the issue description, and two possible titles.\n\nIssue Description: {{input}}\n\n1: {{output}}\n2: {{expected}}\n\"\"\"\n\n# Define the scoring mechanism\n# 1 if the generated answer is better than the expected answer\n# 0 otherwise\noutput_scores = {\"1\": 1, \"2\": 0}\n\nevaluator = LLMClassifier(\n    name=\"TitleQuality\",\n    prompt_template=prompt_prefix,\n    choice_scores=output_scores,\n    use_cot=True,\n)\n\n# Evaluate an example LLM completion\npage_content = \"\"\"\nAs suggested by Nicolo, we should standardize the error responses coming from GoTrue, postgres, and realtime (and any other/future APIs) so that it's better DX when writing a client,\nWe can make this change on the servers themselves, but since postgrest and gotrue are fully/partially external may be harder to change, it might be an option to transform the errors within the client libraries/supabase-js, could be messy?\nNicolo also dropped this as a reference: http://spec.openapis.org/oas/v3.0.3#openapi-specification\"\"\"\noutput = \"Standardize error responses from GoTrue, Postgres, and Realtime APIs for better DX\"\nexpected = \"Standardize Error Responses across APIs\"\n\nresponse = evaluator(output, expected, input=page_content)\n\nprint(f\"Score: {response.score}\")\nprint(f\"Metadata: {response.metadata}\")\n```\n\n### TypeScript\n\n```typescript\nimport { LLMClassifierFromTemplate } from \"autoevals\";\n\n(async () =\u003e {\n  const promptTemplate = `You are a technical project manager who helps software engineers generate better titles for their GitHub issues.\nYou will look at the issue description, and pick which of two titles better describes it.\n\nI'm going to provide you with the issue description, and two possible titles.\n\nIssue Description: {{input}}\n\n1: {{output}}\n2: {{expected}}`;\n\n  const choiceScores = { 1: 1, 2: 0 };\n\n  const evaluator = LLMClassifierFromTemplate\u003c{ input: string }\u003e({\n    name: \"TitleQuality\",\n    promptTemplate,\n    choiceScores,\n    useCoT: true,\n  });\n\n  const input = `As suggested by Nicolo, we should standardize the error responses coming from GoTrue, postgres, and realtime (and any other/future APIs) so that it's better DX when writing a client,\nWe can make this change on the servers themselves, but since postgrest and gotrue are fully/partially external may be harder to change, it might be an option to transform the errors within the client libraries/supabase-js, could be messy?\nNicolo also dropped this as a reference: http://spec.openapis.org/oas/v3.0.3#openapi-specification`;\n  const output = `Standardize error responses from GoTrue, Postgres, and Realtime APIs for better DX`;\n  const expected = `Standardize Error Responses across APIs`;\n\n  const response = await evaluator({ input, output, expected });\n\n  console.log(\"Score\", response.score);\n  console.log(\"Metadata\", response.metadata);\n})();\n```\n\n\u003c/div\u003e\n\n## Creating custom scorers\n\nYou can also create your own scoring functions that do not use LLMs. For example, to test whether the word `'banana'`\nis in the output, you can use the following:\n\n\u003cdiv className=\"tabs\"\u003e\n\n### Python\n\n```python\nfrom autoevals import Score\n\ndef banana_scorer(output, expected, input):\n    return Score(name=\"banana_scorer\", score=1 if \"banana\" in output else 0)\n\ninput = \"What is 1 banana + 2 bananas?\"\noutput = \"3\"\nexpected = \"3 bananas\"\n\nresult = banana_scorer(output, expected, input)\n\nprint(f\"Banana score: {result.score}\")\n```\n\n### TypeScript\n\n```typescript\nimport { Score } from \"autoevals\";\n\nconst bananaScorer = ({\n  output,\n  expected,\n  input,\n}: {\n  output: string;\n  expected: string;\n  input: string;\n}): Score =\u003e {\n  return { name: \"banana_scorer\", score: output.includes(\"banana\") ? 1 : 0 };\n};\n\n(async () =\u003e {\n  const input = \"What is 1 banana + 2 bananas?\";\n  const output = \"3\";\n  const expected = \"3 bananas\";\n\n  const result = bananaScorer({ output, expected, input });\n  console.log(`Banana score: ${result.score}`);\n})();\n```\n\n\u003c/div\u003e\n\n## Why does this library exist?\n\nThere is nothing particularly novel about the evaluation methods in this library. They are all well-known and well-documented. However, there are a few things that are particularly difficult when evaluating in practice:\n\n- Normalizing metrics between 0 and 1 is tough. For example, check out the calculation in [number.py](/py/autoevals/number.py) to see how it's done for numeric differences.\n- Parsing the outputs on model-graded evaluations is also challenging. There are frameworks that do this, but it's hard to\n  debug one output at a time, propagate errors, and tweak the prompts. Autoevals makes these tasks easy.\n- Collecting metrics behind a uniform interface makes it easy to swap out evaluation methods and compare them. Prior to Autoevals, we couldn't find an open source library where you can simply pass in `input`, `output`, and `expected` values through a bunch of different evaluation methods.\n\n\u003cdiv className=\"hidden\"\u003e\n\n## Documentation\n\nThe full docs are available [for your reference](https://www.braintrust.dev/docs/reference/autoevals).\n\n## Contributing\n\nWe welcome contributions!\n\nTo install the development dependencies, run `make develop`, and run `source env.sh` to activate the environment. Make a `.env` file from the `.env.example` file and set the environment variables. Run `direnv allow` to load the environment variables.\n\nTo run the tests, run `pytest` from the root directory.\n\nSend a PR and we'll review it! We'll take care of versioning and releasing.\n\n\u003c/div\u003e\n","funding_links":[],"categories":["Python","Tools","📊 Evaluation \u0026 Benchmarking"],"sub_categories":["LLM-as-Judge Evaluation"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbraintrustdata%2Fautoevals","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbraintrustdata%2Fautoevals","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbraintrustdata%2Fautoevals/lists"}