https://github.com/simonpcouch/evalthat
testthat-style LLM evaluation for R
https://github.com/simonpcouch/evalthat
Last synced: about 2 months ago
JSON representation
testthat-style LLM evaluation for R
- Host: GitHub
- URL: https://github.com/simonpcouch/evalthat
- Owner: simonpcouch
- License: other
- Created: 2024-11-25T20:38:12.000Z (3 months ago)
- Default Branch: main
- Last Pushed: 2024-12-12T15:19:59.000Z (2 months ago)
- Last Synced: 2024-12-12T15:31:55.236Z (2 months ago)
- Language: R
- Homepage:
- Size: 4.88 MB
- Stars: 2
- Watchers: 2
- Forks: 0
- Open Issues: 18
-
Metadata Files:
- Readme: README.Rmd
- License: LICENSE
Awesome Lists containing this project
- jimsghstars - simonpcouch/evalthat - testthat-style LLM evaluation for R (R)
README
---
output: github_document
---```{r, include = FALSE}
should_eval <- getOption(".evalthat_eval_readme", default = FALSE)knitr::opts_chunk$set(
eval = should_eval,
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```# evalthat
[](https://lifecycle.r-lib.org/articles/stages.html#experimental)
[](https://CRAN.R-project.org/package=evalthat)evalthat provides a testthat-style framework for LLM evaluation in R. If you can write unit tests, you can compare performance across various LLMs, improve your prompts using evidence, and quantify variability in model output.
## Installation
You can install the development version of evalthat like so:
``` r
# install.packages("pak")
pak::pak("simonpcouch/evalthat")
```## Example
evalthat code looks a lot like testthat code. Here's an example:
```r
chat <- getOption(
"chat",
default = list(elmer::chat_claude("claude-3-5-sonnet-latest", echo = FALSE))
)[[1]]evaluating(model = str(chat))
test_that("model can make a basic histogram", {
input <- input(
"Write ggplot code to plot a histogram of the mpg variable in mtcars.
Return only the plotting code, no backticks and no exposition."
)
output <- output(chat$chat(input))
# check that output was syntactically code R code
expect_r_code(output)
# match keywords to affirm intended functionality
expect_match(output, "ggplot(", fixed = TRUE)
expect_match(output, "aes(", fixed = TRUE)
expect_match(output, "geom_histogram(", fixed = TRUE)
# flag output for manual grading
target <- "ggplot(mtcars) + aes(x = mpg) + geom_histogram()"
grade_human(input, output, target)
# grade using an LLM---either instantaneously using the current model or
# flag for later grading with a different model
grade_model(input, output, target)
})
```testthat users will notice a couple changes:
* The `evaluating()` function is sort of like `context()`, and logs metadata about the experiment.
* The functions `input()` and `output()` flag "what went into the model?" and "what came out?"
* In addition to the regular `expect_*()` functions from testthat, the package supplies a number of new expectation functions that are helpful in evaluating R code contained in a character string (as it will be when outputted from elmer or its extensions). Those that begin with `expect_*()` are automated, those that begin with `grade_*()` are less-so.Running the above test file results in a persistent _result file_—think of it like a snapshot. evalthat supplies a number of helpers for working with result files, allowing you to compare performance across various models, iterate on prompts, quantify variability in output, and so on. On the full ggplot2 example file, we could run 5 passes evaluating several different models for revising ggplot2 code:
```{r}
library(elmer)temp <- list(temperature = 1)
eval <- evaluate_across(
"tests/evalthat/test-ggplot2.R",
tibble(chat = c(
chat_openai(model = "gpt-4o", api_args = temp, echo = FALSE),
chat_openai(model = "gpt-4o-mini", api_args = temp, echo = FALSE),
chat_claude(model = "claude-3-5-sonnet-latest", echo = FALSE))
),
repeats = 5
)
```
Evaluation functions return a data frame with information on the evaluation results for further analysis:
```{r, eval = TRUE, include = FALSE}
if (!should_eval) {
eval <- qs::qread("inst/ex_ggplot2.rds")
eval <- tibble::as_tibble(eval)
} else {
qs::qsave(eval, "inst/ex_ggplot2.rds")
}
``````{r, eval = TRUE}
eval
```Visualizing this example output:
```{r, include = FALSE}
library(tidyverse)plot <- eval %>%
ggplot() +
aes(x = pct, fill = model) +
geom_histogram(position = "identity", alpha = .5) +
xlim(c(0, 100)) +
labs(
fill = "Model",
x = "Score",
y = "Count",
title = "LLM performance in adjusting ggplot2 code",
caption =
"We need a harder eval, huh?"
) +
scale_fill_viridis_d(end = .7) +
theme(legend.position = c(0.3, 0.75))plot
ggsave("inst/ex_plot.png", width = 5.5, height = 4, plot)
``````{r, eval = TRUE, echo = FALSE, fig.alt="A ggplot2 histogram, showing distributions of performance on the task 'translating R erroring code to cli' for three different models: Claude 3.5 Sonner, GPT-4o, and GPT-4o-mini. They all pass with flying colors, probably indicating the need for a harder eval."}
knitr::include_graphics("inst/ex_plot.png")
```