https://github.com/simonpcouch/evalthat

testthat-style LLM evaluation for R
https://github.com/simonpcouch/evalthat

Last synced: 2 months ago
JSON representation

testthat-style LLM evaluation for R

Host: GitHub
URL: https://github.com/simonpcouch/evalthat
Owner: simonpcouch
License: other
Created: 2024-11-25T20:38:12.000Z (7 months ago)
Default Branch: main
Last Pushed: 2024-12-12T15:19:59.000Z (6 months ago)
Last Synced: 2024-12-12T15:31:55.236Z (6 months ago)
Language: R
Homepage:
Size: 4.88 MB
Stars: 2
Watchers: 2
Forks: 0
Open Issues: 18
Metadata Files:
- Readme: README.Rmd
- License: LICENSE

Awesome Lists containing this project

jimsghstars - simonpcouch/evalthat - testthat-style LLM evaluation for R (R)

README

        ---

output: github_document

---

```{r, include = FALSE}

should_eval <- getOption(".evalthat_eval_readme", default = FALSE)

knitr::opts_chunk$set(

  eval = should_eval,

  collapse = TRUE,

  comment = "#>",

  fig.path = "man/figures/README-",

  out.width = "100%"

)

```

# evalthat

[![Lifecycle: experimental](https://img.shields.io/badge/lifecycle-experimental-orange.svg)](https://lifecycle.r-lib.org/articles/stages.html#experimental)

[![CRAN status](https://www.r-pkg.org/badges/version/evalthat)](https://CRAN.R-project.org/package=evalthat)

> IMPORTANT:

>

> This package is archived and has been superseded by [rinspect](https://github.com/simonpcouch/rinspect).

evalthat provides a testthat-style framework for LLM evaluation in R. If you can write unit tests, you can compare performance across various LLMs, improve your prompts using evidence, and quantify variability in model output.

## Installation

You can install the development version of evalthat like so:

``` r

# install.packages("pak")

pak::pak("simonpcouch/evalthat")

```

## Example

evalthat code looks a lot like testthat code. Here's an example:

```r

function(chat = ellmer::chat_claude()) {

  test_that("model can make a basic histogram", {

    input <- input(

      "Write ggplot code to plot a histogram of the mpg variable in mtcars. 

       Return only the plotting code, no backticks and no exposition."

    )

    

    output <- output(chat$chat(input))

    

    # check that output was syntactically code R code

    expect_r_code(output)

    

    # match keywords to affirm intended functionality

    expect_match(output, "ggplot(", fixed = TRUE)

    expect_match(output, "aes(", fixed = TRUE)

    expect_match(output, "geom_histogram(", fixed = TRUE)

    

    # flag output for grading either by yourself or an LLM judge

    target <- "ggplot(mtcars) + aes(x = mpg) + geom_histogram()"

    grade_output(target)

  })

}

```

testthat users will notice a couple changes:

* The testing file is wrapped in `function()` with arguments defining variables to evaluate `across`. Unlike typical test files, evals are meant to be ran across many configurations to help users compare prompts, model providers, etc.

* The functions `input()` and `output()` flag "what went into the model?" and "what came out?"

* In addition to the regular `expect_*()` functions from testthat, the package supplies a number of new expectation functions that are helpful in evaluating R code contained in a character string (as it will be when outputted from ellmer or its extensions). Those that begin with `expect_*()` are automated, those that begin with `grade_*()` are less-so.

Running the above test file results in a persistent _result file_—think of it like a snapshot. evalthat supplies a number of helpers for working with result files, allowing you to compare performance across various models, iterate on prompts, quantify variability in output, and so on. On the full ggplot2 example file, we could run 5 passes evaluating several different models for revising ggplot2 code:

```{r}

library(ellmer)

eval <- evaluate(

  "tests/evalthat/test-ggplot2.R",

  across = tibble(chat = c(

    chat_openai(model = "gpt-4o", echo = FALSE),

    chat_openai(model = "gpt-4o-mini", echo = FALSE),

    chat_claude(model = "claude-3-5-sonnet-latest", echo = FALSE),

    chat_ollama(model = "qwen2.5-coder:14b", echo = FALSE)

  )),

  repeats = 5

)

```

![](inst/ex_eval.gif)

Evaluation functions return a data frame with information on the evaluation results for further analysis:

```{r, eval = TRUE, include = FALSE}

if (!should_eval) {

  eval <- qs::qread("inst/ex_ggplot2.rds")

  eval <- tibble::as_tibble(eval)

} else {

  qs::qsave(eval, "inst/ex_ggplot2.rds")

}

```

```{r, eval = TRUE}

eval

```

Visualizing this example output:

```{r, include = FALSE}

library(tidyverse)

plot <- eval %>% 

  ggplot() +

  aes(x = pct, fill = chat) +

  geom_histogram(position = "identity", alpha = .5) +

  xlim(c(0, 100)) +

  labs(

    fill = "Model",

    x = "Score",

    y = "Count",

    title = "LLM performance in adjusting ggplot2 code",

    caption = 

      "We need a harder eval, huh?"

  ) +

  scale_fill_viridis_d(end = .7) +

  theme(legend.position = c(0.3, 0.75))

plot

ggsave("inst/ex_plot.png", width = 5.5, height = 4, plot)

```

```{r, eval = TRUE, echo = FALSE, fig.alt="A ggplot2 histogram, showing distributions of performance on the task 'translating R erroring code to cli' for three different models: Claude 3.5 Sonner, GPT-4o, and GPT-4o-mini. They all pass with flying colors, probably indicating the need for a harder eval."}

knitr::include_graphics("inst/ex_plot.png")

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/simonpcouch/evalthat

Awesome Lists containing this project

README