https://github.com/ash-project/evals

Tools for evaluating models against Elixir code, helping us find what works and what doesn't
https://github.com/ash-project/evals

Last synced: 3 months ago
JSON representation

Tools for evaluating models against Elixir code, helping us find what works and what doesn't

Host: GitHub
URL: https://github.com/ash-project/evals
Owner: ash-project
Created: 2025-06-23T23:42:08.000Z (5 months ago)
Default Branch: main
Last Pushed: 2025-08-21T01:43:24.000Z (3 months ago)
Last Synced: 2025-08-22T19:50:31.935Z (3 months ago)
Language: Elixir
Size: 125 KB
Stars: 37
Watchers: 6
Forks: 3
Open Issues: 6
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

awesome-elixir-ai - Evals - Tool for evaluating AI language models on Elixir code generation with side-by-side model comparisons and automated testing. (Observability, Evaluation & Guardrails / How to Join)
awesome-elixir-ai - Evals - Tool for evaluating AI language models on Elixir code generation with side-by-side model comparisons and automated testing. (Observability, Evaluation & Guardrails / How to Join)
awesome-ml-gen-ai-elixir - Evals - Tool for evaluating AI language models on Elixir code generation with side-by-side model comparisons and automated testing. (Generative AI / Development Tools)

README

          # Evals

A evaluation tool for testing and comparing AI language models on various coding tasks. This allows you to run structured evaluations, compare model performance with and without usage rules, and generate detailed reports.

## Features

- **Multiple Model Support**: Evaluate and compare different language models side-by-side

- **Usage Rules Integration**: Test how well models follow specific package usage rules and guidelines

- **Code Generation & Validation**: Evaluate models on code writing tasks with automated assertion testing

- **Flexible Evaluation Options**: Control iterations, debug output, and evaluation scope

- **Rich Reporting**: Generate summary or detailed reports with performance breakdowns

- **YAML-Based Test Definitions**: Define evaluations in simple YAML files organized by category

## Roadmap

- For `write_code_and_assert` type, more complex setup tasks where the LLM only needs to generate a subset of a response, not all the code.

- Different types of evals, like `response_contains`,  `response_doesnt_contain`, and also `llm_judge` where you ask a separate judge LLM if a certain property is attained by the output.

- The ability to experiment with different system prompts, i.e does "you are an expert Elixir developer" matter?

- The ability to benchmark fully agentic flows like multi-turn working with hex docs search, plan files, custom context etc.

## Report

We only have a few evals here, but eventually this will be expensive for me to

operate, so its not running in CI etc. I will run it when I feel like its worth

running again, when the are more evals etc. Others are encouraged to run this

locally with their own keys if they want to throw a few coins in the machine to

help out.

See the [reports folder](reports/) for more.

For example:

[reports/flagship](reports/flagship.md?plain=1)

## Quick Start

```elixir

# Define your models

models = [

  {"gpt-4", %LangChain.ChatModels.ChatOpenAI{model: "gpt-4"}},

  {"claude-3-sonnet", %LangChain.ChatModels.ChatAnthropic{model: "claude-3-sonnet-20240229"}}

]

# Run evaluations and get a report

{results, report} = Evals.report(models,

  usage_rules: :compare,

  title: "Model Comparison",

  format: "summary"

)

IO.puts(report)

```

## Common Model Comparisons

The `Evals.Common` module provides convenient functions for testing common model combinations:

### Flagship Models

Compare the latest flagship models from OpenAI and Anthropic:

```elixir

# Quick flagship comparison

report = Evals.Common.flagship(usage_rules: :compare, format: "summary")

IO.puts(report)

# Full detailed report

report = Evals.Common.flagship(usage_rules: :compare, format: :full)

IO.puts(report)

```

This compares:

- GPT-4.1

- GPT-4o

- Claude Sonnet 4

- Claude Sonnet 3.7

### GPT Models Only

Compare different GPT model variants:

```elixir

report = Evals.Common.gpt(usage_rules: :compare)

IO.puts(report)

```

This compares:

- GPT-4.1

- GPT-4o

All `Evals.Common` functions accept the same options as `Evals.report/2` and return the formatted report string directly.

## Contributing Evaluations

We welcome contributions of new evaluation cases! Here's how to add your own:

### Creating a New Evaluation

1. **Choose a category** or create a new one in the `evals/` directory

2. **Create a YAML file** with a descriptive name (e.g., `async_genserver.yml`)

3. **Follow the evaluation format** shown below

### Evaluation Guidelines

- **Be specific**: Test one clear concept or skill per evaluation

- **Include context**: Provide enough background in the user message

- **Write clear assertions**: Make sure your test validates the intended behavior

- **Test edge cases**: Consider boundary conditions and common mistakes

- **Add realistic scenarios**: Use examples that mirror real-world usage

### Example Contribution

```yaml

# evals/genserver/async_operations.yml

type: write_code_and_assert

messages:

  - type: user

    text: |

      Write a function called `add` that adds two numbers. Return just the function, not wrapped in a module

eval:

  assert:

    # wrap the answer in a module

    wrap_in_module: true

    assertion: "<%= @module_name %>.add(2, 3) == 5"

```

### Testing Your Evaluation

Before submitting, test your evaluation locally:

```elixir

# Test only your new evaluation

{results, report} = Evals.report(models, only: "evals/your_category/your_eval.yml")

IO.puts(report)

```

## Evaluation Structure

Evaluations are organized in the `evals/` directory by category:

```

evals/

├── basic_elixir/

│   ├── pattern_matching.yml

│   └── list_operations.yml

├── ash_framework/

│   ├── resource_definition.yml

│   └── changeset_usage.yml

└── phoenix/

    ├── controller_actions.yml

    └── live_view_basics.yml

```

Each YAML file defines a test case with:

- **Type**: Currently supports `write_code_and_assert`

- **Messages**: Conversation history leading to the code generation request

- **Code**: Optional existing code context

- **Install**: Package dependencies to install

- **Eval**: Assertion criteria for validating the generated code

### Example Evaluation File

```yaml

type: write_code_and_assert

install:

  - package: ash

    version: "~> 3.0"

messages:

  - type: user

    text: "Create a basic Ash resource for a User with name and email fields"

eval:

  assert:

    wrap_in_module: true

    assertion: |

      Code.ensure_loaded(<%= assigns.module_name %>)

      function_exported?(<%= assigns.module_name %>, :__resource__, 0)

```

## API Reference

### Core Functions

#### `Evals.evaluate(models, opts \\ [])`

Runs evaluations and returns raw results.

**Options:**

- `:iterations` - Number of runs per test (default: 1). Higher iterations will cause much longer evaluation times due to rate limits

- `:usage_rules` - `:compare`, `true`, or `false` (default: `false`)

- `:only` - Limit to specific file pattern

- `:debug` - Enable debug output

- `:system_prompt` - Override system prompt

#### `Evals.report(models, opts \\ [])`

Runs evaluations and returns formatted report.

**Additional Report Options:**

- `:title` - Custom report title

- `:format` - `:summary` or `:full` (default: `:full`)

### Usage Rules

When `:usage_rules` is enabled, the framework automatically:

1. Installs specified packages via `Mix.install`

2. Locates `usage-rules.md` files in package dependencies

3. Includes these rules in the system prompt

4. Compares model performance with and without rules (when `:compare`)

### Example Results

```elixir

results = %{

  {"gpt-4", "ash_framework", "resource_definition", true} => 0.85,

  {"gpt-4", "ash_framework", "resource_definition", false} => 0.72,

  {"claude-3-sonnet", "ash_framework", "resource_definition", true} => 0.78,

  {"claude-3-sonnet", "ash_framework", "resource_definition", false} => 0.65

}

```

## Report Formats

### Summary Format

Shows only model averages, optionally broken down by usage rules:

```

================================================================================

Model Performance Comparison

Iterations: 1

================================================================================

OVERALL SUMMARY:

----------------------------------------

With usage rules:

  gpt-4              | 85.2%

  claude-3-sonnet    | 82.1%

Without usage rules:

  gpt-4              | 72.4%

  claude-3-sonnet    | 69.8%

================================================================================

```

### Full Format

Includes detailed breakdown by category and individual tests.

## Setup

1. **Clone the repository:**

   ```bash

   git clone 

   cd evals

   ```

2. **Install dependencies:**

   ```bash

   mix deps.get

   ```

3. **Set up your API keys:**

   ```bash

   export OPENAI_API_KEY="your-openai-key"

   export ANTHROPIC_API_KEY="your-anthropic-key"

   ```

4. **Run evaluations:**

   ```bash

   iex -S mix

   ```

   Then in the IEx console:

   ```elixir

   models = [

     {"gpt-4", %LangChain.ChatModels.ChatOpenAI{model: "gpt-4"}},

     {"claude-3-sonnet", %LangChain.ChatModels.ChatAnthropic{model: "claude-3-sonnet-20240229"}}

   ]

   {results, report} = Evals.report(models, usage_rules: :compare)

   IO.puts(report)

   ```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/ash-project/evals

Awesome Lists containing this project

README