https://github.com/justi/ruby_llm-contract
Know which LLM model to use, what it costs, and when accuracy drops. Companion gem for ruby_llm.
https://github.com/justi/ruby_llm-contract
ai anthropic cost-tracking eval llm model-comparison openai rails regression-testing ruby ruby-llm
Last synced: 12 days ago
JSON representation
Know which LLM model to use, what it costs, and when accuracy drops. Companion gem for ruby_llm.
- Host: GitHub
- URL: https://github.com/justi/ruby_llm-contract
- Owner: justi
- License: mit
- Created: 2026-03-20T21:50:17.000Z (about 1 month ago)
- Default Branch: main
- Last Pushed: 2026-04-06T14:56:11.000Z (19 days ago)
- Last Synced: 2026-04-06T16:25:18.849Z (19 days ago)
- Topics: ai, anthropic, cost-tracking, eval, llm, model-comparison, openai, rails, regression-testing, ruby, ruby-llm
- Language: Ruby
- Size: 619 KB
- Stars: 20
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE
Awesome Lists containing this project
README
# ruby_llm-contract
The missing link between LLM cost and quality. Stop choosing between "cheap but wrong" and "accurate but expensive" — get both. Contracts, model escalation, and data-driven recommendations for [ruby_llm](https://github.com/crmne/ruby_llm).
```
YOU WRITE THE GEM HANDLES YOU GET
───────── ─────────────── ───────
validate { |o| ... } catch bad answers — combined Zero garbage
with retry_policy, auto-retry in production
retry_policy start cheap, escalate only Pay for the cheapest
models: %w[nano mini full] when validation fails model that works
max_cost 0.01 estimate tokens, check price, No surprise bills
refuse before calling LLM
output_schema { ... } send JSON schema to provider, Zero parsing code
validate response client-side
define_eval { ... } test cases + baselines, Regressions caught
run in CI with real LLM before deploy
recommend(candidates: [...]) evaluate all configs, pick Optimal model +
cheapest that passes retry chain
```
## Before and after
```
┌─────────────────────────────────────────────────────────────────┐
│ BEFORE: pick one model, hope for the best │
│ │
│ expensive model → accurate, but you overpay on every call │
│ cheap model → fast, but wrong answers slip to production │
│ prompt change → "looks good to me" → deploy → users suffer │
└─────────────────────────────────────────────────────────────────┘
⬇ add ruby_llm-contract
┌─────────────────────────────────────────────────────────────────┐
│ YOU DEFINE A CONTRACT │
│ │
│ output_schema { string :priority } ← valid structure │
│ validate("valid priority") { |o| ... } ← business rules │
│ retry_policy models: %w[nano mini full] ← escalation chain │
│ max_cost 0.01 ← budget cap │
└───────────────────────────┬─────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ THE GEM HANDLES THE REST │
│ │
│ request ──→ ┌──────┐ ┌──────────┐ │
│ │ nano │─→ │ contract │──→ ✓ pass → done │
│ └──────┘ └────┬─────┘ │
│ │ ✗ fail │
│ ▼ │
│ ┌──────┐ ┌──────────┐ │
│ │ mini │─→ │ contract │──→ ✓ pass → done │
│ └──────┘ └────┬─────┘ │
│ │ ✗ fail │
│ ▼ │
│ ┌──────┐ ┌──────────┐ │
│ │ full │─→ │ contract │──→ ✓ pass → done │
│ └──────┘ └──────────┘ │
└───────────────────────────┬─────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ YOU GET │
│ │
│ ✓ Valid output guaranteed — schema + business rules enforced │
│ ✓ Cheapest model that works — most requests stay on nano │
│ ✓ Cost, latency, tokens — tracked on every call │
│ ✓ Eval scores per model — data instead of gut feeling │
│ ✓ Regressions caught — before deploy, not after │
│ ✓ Recommendation — "use nano+mini, drop full, save $X/mo" │
└─────────────────────────────────────────────────────────────────┘
```
## 30-second version
```ruby
class ClassifyTicket < RubyLLM::Contract::Step::Base
prompt "Classify this support ticket by priority and category.\n\n{input}"
output_schema do
string :priority, enum: %w[low medium high urgent]
string :category
end
validate("urgent needs justification") { |o, input| o[:priority] != "urgent" || input.length > 20 }
retry_policy models: %w[gpt-4.1-nano gpt-4.1-mini gpt-4.1]
end
result = ClassifyTicket.run("I was charged twice")
result.parsed_output # => {priority: "high", category: "billing"}
result.trace[:model] # => "gpt-4.1-nano" (first model that passed)
result.trace[:cost] # => 0.000032
```
Bad JSON? Retried automatically. Wrong answer? Escalated to a smarter model. Schema violated? Caught client-side. The contract guarantees every response meets your rules — you pay for the cheapest model that passes.
## Install
```ruby
gem "ruby_llm-contract"
```
```ruby
RubyLLM.configure { |c| c.openai_api_key = ENV["OPENAI_API_KEY"] }
RubyLLM::Contract.configure { |c| c.default_model = "gpt-4.1-mini" }
```
Works with any ruby_llm provider (OpenAI, Anthropic, Gemini, etc).
## Save money with model escalation
Without a contract, you use gpt-4.1 for everything because you can't tell when a cheaper model gets it wrong. With a contract, you start on nano and only escalate when the answer fails the contract:
```ruby
retry_policy models: %w[gpt-4.1-nano gpt-4.1-mini gpt-4.1]
```
```
Attempt 1: gpt-4.1-nano → contract failed ($0.0001)
Attempt 2: gpt-4.1-mini → contract passed ($0.0004)
gpt-4.1 → never called ($0.00)
```
Most requests succeed on the cheapest model. You pay full price only for the ones that need it. How many? Run `compare_models` and find out.
## Know which model to use — with data
Don't guess. Define test cases, compare models, get numbers:
```ruby
ClassifyTicket.define_eval("regression") do
add_case "billing", input: "I was charged twice", expected: { priority: "high" }
add_case "feature", input: "Add dark mode please", expected: { priority: "low" }
add_case "outage", input: "Database is down", expected: { priority: "urgent" }
end
comparison = ClassifyTicket.compare_models("regression",
models: %w[gpt-4.1-nano gpt-4.1-mini gpt-4.1])
```
```
Candidate Score Cost Avg Latency
---------------------------------------------------------
gpt-4.1-nano 0.67 $0.0001 48ms
gpt-4.1-mini 1.00 $0.0004 92ms
gpt-4.1 1.00 $0.0021 210ms
Cheapest at 100%: gpt-4.1-mini
```
Nano fails on edge cases. Mini and full both score 100% — but mini is **5x cheaper**. Now you know.
## Let the gem tell you what to do
Don't read tables — get a recommendation. Supports `model + reasoning_effort` combinations:
```ruby
rec = ClassifyTicket.recommend("regression",
candidates: [
{ model: "gpt-4.1-nano" },
{ model: "gpt-4.1-mini" },
{ model: "gpt-5-mini", reasoning_effort: "low" },
{ model: "gpt-5-mini", reasoning_effort: "high" },
],
min_score: 0.95
)
rec.best # => { model: "gpt-4.1-mini" }
rec.retry_chain # => [{ model: "gpt-4.1-nano" }, { model: "gpt-4.1-mini" }]
rec.to_dsl # => "retry_policy models: %w[gpt-4.1-nano gpt-4.1-mini]"
rec.savings # => savings vs your current model (if configured)
```
Copy `rec.to_dsl` into your step. Done.
## Catch regressions before users do
A model update silently dropped your accuracy? A prompt tweak broke an edge case? You'll know before deploying:
```ruby
# Save a baseline once:
report = ClassifyTicket.run_eval("regression", context: { model: "gpt-4.1-nano" })
report.save_baseline!(model: "gpt-4.1-nano")
# In CI — block merge if anything regressed:
expect(ClassifyTicket).to pass_eval("regression")
.with_context(model: "gpt-4.1-nano")
.without_regressions
```
```ruby
diff = report.compare_with_baseline(model: "gpt-4.1-nano")
diff.regressed? # => true
diff.regressions # => [{case: "outage", baseline: {passed: true}, current: {passed: false}}]
diff.score_delta # => -0.33
```
No more "it worked in the playground". Regressions are caught in CI, not production.
## A/B test your prompts
Changed a prompt? Compare old vs new on the same dataset with regression safety:
```ruby
diff = ClassifyTicketV2.compare_with(ClassifyTicketV1,
eval: "regression", model: "gpt-4.1-mini")
diff.safe_to_switch? # => true (no regressions)
diff.improvements # => [{case: "outage", ...}]
diff.score_delta # => +0.33
```
```ruby
# CI gate:
expect(ClassifyTicketV2).to pass_eval("regression")
.compared_with(ClassifyTicketV1)
.with_minimum_score(0.8)
```
## Chain steps with fail-fast
Pipeline stops at the first contract failure — no wasted tokens on downstream steps:
```ruby
class TicketPipeline < RubyLLM::Contract::Pipeline::Base
step ClassifyTicket, as: :classify
step RouteToTeam, as: :route
step DraftResponse, as: :draft
end
result = TicketPipeline.run("I was charged twice")
result.outputs_by_step[:classify] # => {priority: "high", category: "billing"}
result.trace.total_cost # => $0.000128
```
## Gate merges on quality and cost
```ruby
# RSpec — block merge if accuracy drops or cost spikes
expect(ClassifyTicket).to pass_eval("regression")
.with_minimum_score(0.8)
.with_maximum_cost(0.01)
# Rake — run all evals across all steps
RubyLLM::Contract::RakeTask.new do |t|
t.minimum_score = 0.8
t.maximum_cost = 0.05
end
# bundle exec rake ruby_llm_contract:eval
```
## Docs
| Guide | |
|-------|-|
| [Getting Started](docs/guide/getting_started.md) | Features walkthrough, model escalation, eval |
| [Eval-First](docs/guide/eval_first.md) | Practical workflow for prompt engineering with datasets, baselines, and A/B gates |
| [Best Practices](docs/guide/best_practices.md) | 6 patterns for bulletproof validates |
| [Output Schema](docs/guide/output_schema.md) | Full schema reference + constraints |
| [Pipeline](docs/guide/pipeline.md) | Multi-step composition, timeout, fail-fast |
| [Testing](docs/guide/testing.md) | Test adapter, RSpec matchers |
| [Migration](docs/guide/migration.md) | Adopting the gem in existing Rails apps |
## Roadmap
**v0.6 (current):** "What should I do?" — `Step.recommend` returns optimal model, reasoning effort, and retry chain. Per-attempt `reasoning_effort` in retry policies.
**v0.5:** Prompt A/B testing with `compare_with`. Soft observations with `observe`.
**v0.4:** Eval history, batch concurrency, pipeline per-step eval, Minitest, structured logging.
**v0.3:** Baseline regression detection, migration guide.
## License
MIT