https://github.com/greynewell/evaldriven.org
Ship evals before you ship features.
https://github.com/greynewell/evaldriven.org
ai-engineering ai-evaluation ai-quality ai-safety ai-testing automation benchmarking best-practices ci-cd continuous-evaluation devops eval-driven-development evaluation llm-evaluation machine-learning manifesto methodology quality-assurance software-engineering testing
Last synced: 29 days ago
JSON representation
Ship evals before you ship features.
- Host: GitHub
- URL: https://github.com/greynewell/evaldriven.org
- Owner: greynewell
- License: cc0-1.0
- Created: 2026-02-15T18:01:37.000Z (about 1 month ago)
- Default Branch: main
- Last Pushed: 2026-02-17T21:32:20.000Z (about 1 month ago)
- Last Synced: 2026-02-20T14:48:06.995Z (about 1 month ago)
- Topics: ai-engineering, ai-evaluation, ai-quality, ai-safety, ai-testing, automation, benchmarking, best-practices, ci-cd, continuous-evaluation, devops, eval-driven-development, evaluation, llm-evaluation, machine-learning, manifesto, methodology, quality-assurance, software-engineering, testing
- Language: Nunjucks
- Homepage: https://evaldriven.org
- Size: 63.5 KB
- Stars: 12
- Watchers: 3
- Forks: 4
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Eval-Driven Development
What we can build matters less than what we can prove.
AI writes code. The engineer defines "working," measures it, enforces it. **Eval-Driven Development**: every probabilistic system starts with a correctness spec. Nothing ships without automated proof it passes.
## Principles
### 1. Evaluation is the product
Build evals first. Code is generated. Evals are engineered.
### 2. Define correctness before you write a prompt
Can't express "correct" as a deterministic function? Not ready to build. Every task needs an eval. Every eval needs a threshold. Every threshold needs a justification.
### 3. Probabilistic systems require statistical proof
One passing test proves nothing about a stochastic system. Sample sizes, confidence intervals, regression baselines. Distributions, not anecdotes.
### 4. Evals run in CI
Evals that don't run on every change don't exist. Next to lint, type-check, build.
### 5. Evaluation drives architecture
Can't independently evaluate a component? Can't independently trust it. Design for measurability.
### 6. Cost is a metric
Token spend, latency, compute. Correct but unaffordable is a failed eval.
### 7. Human judgment doesn't scale
Every manual review is a missing eval. Extract judgment into a rubric, automate it, evaluate the evaluator.
### 8. Ship the eval, not the demo
Demos prove something works once. Evals prove it works under distribution shift.
### 9. Version your evals
Definitions, datasets, thresholds, results. Version control. Changelogs. Document why.
### 10. The eval gap is the opportunity
"Works on my machine" vs. "passes at p < 0.05." That gap is where defensible products get built.