https://github.com/hop-top/ben
https://github.com/hop-top/ben
bench benchmark benchmark-suite benchmarking
Last synced: about 1 month ago
JSON representation
- Host: GitHub
- URL: https://github.com/hop-top/ben
- Owner: hop-top
- Created: 2026-03-28T16:47:04.000Z (3 months ago)
- Default Branch: main
- Last Pushed: 2026-05-20T20:11:03.000Z (about 1 month ago)
- Last Synced: 2026-05-20T20:35:55.995Z (about 1 month ago)
- Topics: bench, benchmark, benchmark-suite, benchmarking
- Language: Go
- Homepage: https://hop.top/ben
- Size: 20.7 MB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: docs/contributing.md
- Agents: AGENTS.md
Awesome Lists containing this project
README
# ben
> [!WARNING]
> **🚧 Do Not Use — History Will Be Rewritten 🚧**
>
> This repo is undergoing major restructuring as we selectively
> open-source internal tools built at
> [Idea Crafters LLC](https://ideacrafters.com). Git history **will be
> force-pushed and rewritten** multiple times. Do not fork, clone, or
> depend on this repo in any capacity until we tag a stable release.
General-purpose benchmarking tool — answers "which approach is better, and by how much?"
for any measurable task: tools, implementations, deps, LLM calls, agents.
---
## Install
```
go install hop.top/ben/cmd/ben@latest
```
ben depends on `hop.top/kit kit/v0.4.0-alpha.3`, pinned in `go.mod`
with no local override. Local development against unreleased kit
revisions uses a `replace` directive in `go.mod` (commented-out
example near the bottom of the file):
```go
// replace hop.top/kit => ../kit
```
Uncomment, point at your kit checkout, and `go mod tidy`.
---
## Quick start
```sh
# Inline run: compare two CLI tools on a task
ben run --task "Find HTTP handlers" --candidates xray,grep --metric latency_ms,quality_score \
--scorer weighted:latency_ms=0.3,quality_score=0.7 --input repo=.
# Suite file: run a named, repeatable benchmark
ben run --suite .ben/suites/codebase-indexing.yaml
# Compare two historical runs
ben compare 01HX...abc 01HX...def
# List last 10 runs for a suite
ben list --suite codebase-indexing --last 10
# Show one run by id
ben show 01HX...abc
```
---
## Commands
| Command | Description |
|---------------------------------|-------------------------------------------------------|
| `ben run` | Run benchmark suite or inline task against candidates |
| `ben list` | List recent runs from local storage |
| `ben show ` | Show details of one run |
| `ben compare ` | Diff two run results side-by-side |
| `ben suite list` | List known suites (global + project-local) |
| `ben suite show ` | Show suite spec details |
| `ben registry push ` | Push a run to the shared registry |
| `ben registry pull` | Pull community baselines for a suite |
| `ben config path` / `paths` | Inspect ben config file precedence |
| `ben spec` | Emit machine-readable capability manifest |
---
## Adapters
| Adapter | How ben runs the candidate |
|------------|--------------------------------------------------------------------------|
| `cli` | Spawns a shell command; captures stdout/stderr, exit code, latency |
| `llm` | Calls an LLM via API; captures tokens, cost, output |
| `eva` | Wraps `eva run` as a ben candidate for standard eval suites |
| binary | Any `ben-adapter-*` binary on PATH; communicates via stdio JSON protocol |
---
## Metrics
| Metric | Source | Description |
|------------------|----------|-----------------------------------------------|
| `latency_ms` | built-in | Wall-clock execution time in milliseconds |
| `exit_code` | built-in | Process exit code (cli adapter) |
| `output_size` | built-in | Byte length of stdout output |
| `tokens` | llm | Total tokens consumed (prompt + completion) |
| `cost_usd` | llm | Estimated cost in USD |
| `quality_score` | plugin | 0–1 relevance score; requires llm_judge plugin|
---
## Scorers
| Scorer | Description |
|-------------------------------|------------------------------------------------------|
| `single:` | Rank by one metric; lowest wins for cost/latency |
| `weighted:=,...` | Weighted sum across metrics; highest score wins |
| `raw` | No ranking; emit raw metrics only; winner=null |
Examples:
```
--scorer single:latency_ms
--scorer weighted:latency_ms=0.3,cost_usd=0.2,quality_score=0.5
--scorer raw
```
---
## Spec file
```yaml
name: codebase-indexing
description: Compare xray vs grep for initial codebase orientation
version: 1
task:
prompt: "Find all HTTP handler functions in this repo"
input:
repo: ./testdata/sample-repo
candidates:
- name: xray
adapter: cli
cmd: "xray explore --search {{input.prompt}} --path {{input.repo}}"
- name: grep
adapter: cli
cmd: "grep -r 'func.*Handler' {{input.repo}}"
metrics:
- latency_ms
- quality_score
scorer:
strategy: weighted
weights:
latency_ms: 0.3
quality_score: 0.7
```
---
## Plugin protocol
Binary plugins are auto-discovered as `ben-adapter-` or `ben-reporter-` on PATH.
Ben communicates via newline-delimited JSON over stdio: it writes a request JSON object to the
plugin's stdin and reads the response from stdout. Adapter plugins receive
`{"action":"run","candidate":{...},"input":{...}}` and must respond with
`{"metrics":{...},"output":"..."}`. Reporter plugins receive `{"run":{...}}` and write
formatted output to stdout. Naming convention: use the adapter/reporter name as the suffix,
e.g. `ben-adapter-docker`, `ben-reporter-markdown`.
---
## Agent usage
ben is designed for programmatic use mid-task:
```sh
# Machine-readable output; all logs to stderr
ben run --suite my-suite --format json --quiet
# Parse winner directly
ben run ... --format json | jq .winner
```
- `--format json` — emits valid JSON to stdout; diagnostics to stderr only
- `--quiet` — suppresses stderr; clean for pipelines
- Exit `0` — successful run (candidate failures are in the result, not exit code)
- Exit `1` — ben error (bad config, missing adapter, etc.)
- `winner` field — primary decision signal for agents; `null` when scorer is `raw`
---
## Storage
Global (cross-project):
```
~/.local/share/ben/
runs/ # persisted run results
registry/ # local registry index + cache
suites/ # global suite specs
```
Project-local (detected automatically when `.ben/` exists in cwd):
```
.ben/
suites/ # project-scoped suite specs
runs/ # project-scoped run results
```
Ben prefers project-local storage when `.ben/` is present; falls back to global.
---
## Configuration
Ben loads config from three layers, highest precedence first:
| Layer | Path |
|---------|-----------------------------------|
| project | `./.ben/config.yaml` |
| user | `$XDG_CONFIG_HOME/ben/config.yaml`|
| system | `/etc/ben/config.yaml` |
Run `ben config paths --format json` to see the active chain. The
`-c ` flag overrides the discovery chain entirely (kit semantics
— `-c` wins over any previously discovered file).
The project-layer path is caller-context-aware via the `KIT_INVOKED_AS`
env var (exported by callers like tlc or hop before exec'ing ben):
| `KIT_INVOKED_AS` | Project config path |
|-------------------|-------------------------|
| (unset/standalone)| `./.ben/config.yaml` |
| `hop` | `./.hop/ben.yaml` |
| `tlc` | `./.tlc/ben.yaml` |
Only one project-layer entry wins per invocation (kit constraint).
---
## Release process
Release pipeline mirrors the `hop-top/.github` reusable workflows:
- `release-please.yml` watches `main`, opens a standing release PR
that bumps the version + assembles the changelog.
- Merging that PR cuts a `ben/v` tag.
- The tag push fires `publish.yml` (Go module mirror to
`hop-top/ben`) and `goreleaser-on-tag.yml` (cross-platform binaries
+ Homebrew tap + Scoop bucket entries) in parallel.
Prerelease channel is seeded at `0.2.0-alpha.0`. See
[`.github/RELEASE-BOOTSTRAP.md`](.github/RELEASE-BOOTSTRAP.md) for
the manual web-side steps (mirror-repo creation, GitHub App
installation, org secrets) required before the first cut.
---
## Troubleshooting
**`compile: version "go1.26.1" does not match go tool version "go1.26.2"`**
Cause: stale `GOROOT` exported from an earlier `mise` shell. Quick
workaround: `env -u GOROOT go test ./...`. Long-term fix: `mise use
go@` and respawn the shell.
---
## Contributing
See [docs/contributing.md](docs/contributing.md) for interfaces, how to add adapters/metrics/
scorers/reporters, and the PR checklist.