Adoption-friction release. No runtime behavior changes — every delta is in docs/, examples/, or spec/integration/ (plus the version.rb / Gemfile.lock bumps). Upgrading from 0.7.2 picks up the expanded guide set, the new runnable showcases, and one extra integration spec.
- New guide:
docs/guide/why.md— four production failure modes the gem exists for (schema-valid logically wrong, silent prompt regression, sampling variance on fixed-temperature models, runaway cost). Opens from a concrete incident each time; designed for readers who have not yet felt the pain the gem relieves. - New guide:
docs/guide/rails_integration.md— seven Rails-specific FAQs with runnable snippets: where step classes live (app/contracts/), initializer setup, background jobs,around_callobservability, RSpec/Minitest stubs, error handling in controllers, CI gate wiring. - README adoption-friction pass — added a short "Do I need this?" block after Install, a reading-order hint (
README → why.md → getting_started.md), and outcome-based labels in the docs index ("Prevent silent prompt regressions" instead of "Eval-First", etc.). - TL;DR box at the top of every guide — single-sentence orientation for readers who land via search; "Skip if" clause added where real confusion exists (
eval_first.md,testing.md,migration.md). - API coverage gaps closed —
estimate_cost/estimate_eval_cost,max_cost on_unknown_pricing: :warn,run_eval(..., concurrency:),around_calltesting patterns now documented ingetting_started.md,eval_first.md,testing.md. - Industry-standard terminology —
temperature-locked→fixed-temperature,variance-induced→sampling variance,severity signals→severity keywords,takeaway drift→tone/takeaways mismatch. docs/architecture.mdrefresh — diagram now reflects the current class layout: addedStep::RetryPolicy,Pipeline::Result,Eval::AggregatedReport,Eval::BaselineDiff,Eval::PromptDiffComparator,Eval::EvalHistory,Eval::RetryOptimizer,OptimizeRakeTask. Replaced the outdatedEval::TraitEvaluatorentry withEval::ExpectationEvaluator.- Business framing added to guides — every guide opens with a concrete production scenario or "why it matters" hook before the API reference.
The previous 12-file set mixed a private Reddit promo planner, customer support, meetings, keyword extraction, and translation. The new set is seven runnable files, each answering one adopter question on the README's SummarizeArticle case.
| # | File | Answers |
|---|---|---|
| 00 | 00_basics.rb |
How do I start? (seven incremental layers + real-LLM pointer) |
| 01 | 01_fallback_showcase.rb |
Show me the gem in 30 seconds (zero API keys) |
| 02 | 02_real_llm_minimal.rb |
How do I plug in a real LLM? (~30 lines) |
| 03 | 03_summarize_with_keywords.rb |
How does the contract evolve? (growing prompt) |
| 04 | 04_summarize_and_translate.rb |
Pipeline composition + pipeline-level run_eval |
| 05 | 05_eval_dataset.rb |
How do I stop silent prompt regressions? |
| 06 | 06_retry_variants.rb |
attempts: 3, reasoning_effort escalation, cross-provider (Ollama → Anthropic → OpenAI) |
Every file carries an "Expected output" block in its header so readers see the result without running the script. The docs/ideas/ directory is now fully untracked (already in .gitignore; one stray file removed from version control).
- Schema pitfall fixed in 5 files —
array :x do; string :y; ...; endsilently producesitems: stringand drops every declaration after the first, matching the documented pitfall inspec/ruby_llm/contract/nested_schema_spec.rb:71. Every affected array block is now wrapped inobject do...end. examples/05_eval_dataset.rb(pre-renumber:09_eval_dataset.rb)result[:passed]→result.passed?— the previous code called[]on anEval::CaseResultand raisedNoMethodErrorat runtime.
- New
spec/integration/pipeline_eval_spec.rb— three cases guaranteeing pipeline-levelrun_evalstays functional: happy path, final-step mismatch, and fail-fast propagation when an intermediatevalidaterejects. Closes the "09 STEP 5 pipeline evaluation" known issue flagged in the 0.7.2 release. The fail-fast case assertsstep_status == :validation_failedand the validate's label indetails, so a regression that short-circuits on schema instead of validate would fail loudly.
examples/01_classify_threads.rb,02_generate_comment.rb,03_target_audience.rb,10_reddit_full_showcase.rb,spec/integration/reddit_pipeline_spec.rb— Reddit Promo Planner was a separate private project; its examples do not belong in the gem's public repo.examples/02_output_schema.rb— fully covered bydocs/guide/output_schema.md; deleting avoids duplication.
- Terminal output labels renamed for consistency with README narrative.
print_summarynow printsHardest eval(wasConstraining eval),Suggested fallback list(wasSuggested chain), and the production-mode table usesfirst-attempt/fallback %as column headers (wassingle-shot/escalation). Programmatic metric names unchanged:single_shot_cost,single_shot_latency_ms,escalation_rate.RetryOptimizer::Resultexposeshardest_evalas an alias forconstraining_eval.
docs/guide/optimizing_retry_policy.mdrewritten — 17.7k → 6.4k characters. Continues theSummarizeArticlenarrative from README. Offline mode clearly positioned as wiring-check; real optimization runs viaLIVE=1 RUNS=3. Output samples match actualprint_summaryformat.docs/guide/getting_started.mdrewritten — 8.7k → 6.1k. Every example usesSummarizeArticle. Evals + CI gates section moved before Budget caps. Structured Prompts / Dynamic Prompts / "Already using ruby_llm?" / Reasoning effort sections removed; content delegated toprompt_ast.mdand README.docs/guide/eval_first.mdrefined — 6.3k → 5.0k. Switched toSummarizeArticlecase. Team workflow section compressed with links back togetting_started.mdfor the matcher chain.docs/guide/testing.mdrefined — 10.7k → 7.4k. Switched toSummarizeArticlecase. Threshold gating / Rake task / baseline walkthrough / prompt A/B sections delegated back togetting_started.mdandeval_first.md.docs/guide/output_schema.mdDSL bug fix — the Supported constraints table documented JSON Schema camelCase keys (minLength,minItems,additionalProperties) that are not valid DSL arguments. Every copy-paste from the previous table would have raisedArgumentError. Switched to snake_case (min_length,min_items,additional_properties) as the DSL actually expects; added a short note on the internal camelCase conversion.docs/guide/best_practices.md,pipeline.md,migration.mdsanity pass — terminology alignment (model escalation → model fallback where narrative;escalateDSL method unchanged) andSummarizeArticlecase where the guide is not inherently multi-step.
Step::Base#run_onceno longer swallows adapter-phaseArgumentErroras:input_error. The previous blanketrescue ArgumentErrorwas there to convert DSL misconfiguration (e.g. missingprompt) into an:input_errorResult. Side effect: programmer bugs in adapter code that raisedArgumentError(wrong arity, bad config argument) were silently coerced into:input_errorand retried as if the user had given bad input. Now the rescue is narrowed to the Runner-construction phase only — DSL configuration errors still produce:input_error(theprompt has not been setcase is regression-tested), butArgumentErrorraised from adapter code duringRunner#callpropagates to the caller. Input-type validation failures continue to produce:input_errorthroughInputValidator's own scoped rescue, unchanged.
:adapter_errorremoved fromDEFAULT_RETRY_ON. New default:[:validation_failed, :parse_error].ruby_llmalready retries transport errors (RateLimitError,ServerError,ServiceUnavailableError,OverloadedError, timeouts) at the Faraday layer, so the previous default re-ran the same model on errors the HTTP middleware already retried with backoff. To restore pre-0.7 behavior:retry_on :validation_failed, :parse_error, :adapter_error. Recommended pattern: pair:adapter_errorwithescalate "model_a", "model_b"— a different model/provider can bypass what transport retry could not.AdapterCallernarrowsrescuefromStandardErrortoRubyLLM::Error+Faraday::Error. Provider errors and transport errors that escape ruby_llm's Faraday retry middleware (Faraday::TimeoutError,Faraday::ConnectionFailed) still produce:adapter_erroras before. Programmer errors that are neither (NoMethodError, adapter code bugs) now propagate instead of being silently converted to:adapter_errorand retried. Known limitation: adapter code raisingArgumentErroris still coerced into:input_errorbyStep::Base#run_once(which rescuesArgumentErrorfor input-type validation). Disambiguating adapter-ArgumentError vs input-validation-ArgumentError requires arun_oncerefactor and is tracked as a follow-up.
If you rely on the old behavior, opt in explicitly:
retry_policy do
attempts 3
retry_on :validation_failed, :parse_error, :adapter_error
endOr better, with a model fallback chain:
retry_policy do
escalate "gpt-4.1-nano", "gpt-4.1-mini"
retry_on :validation_failed, :parse_error, :adapter_error
endproduction_mode:oncompare_modelsandoptimize_retry_policy— measures retry-aware, end-to-end cost per successful output. Passproduction_mode: { fallback: "gpt-5-mini" }and each candidate runs with a runtime-injected[candidate, fallback]retry chain. The report exposesescalation_rate,single_shot_cost, andeffective_costso "the cheaper candidate" decision matches production cost rather than first-attempt cost.- New Report metrics —
escalation_rate,single_shot_cost,effective_cost,single_shot_latency_ms,effective_latency_ms,latency_percentiles(p50/p95/max).AggregatedReportaverages all of them acrossruns:. - Extended
ModelComparison#table— whenproduction_mode:is set, renders aChaincolumn (candidate → fallback) withsingle-shot,escalation,effective cost,latency,score. Edge casecandidate == fallbackrenders as a single model and—in the escalation column, with retry injection skipped entirely soeffective == single-shotby construction, not by coincidence. context[:retry_policy_override]— new context key that nullifies or replaces class-levelretry_policyfor a single call. Used internally by production-mode injection; safe to use directly when you need a transient override that doesn't mutate the step class.
- Single-fallback (2-tier) chains only. Multi-tier chains can be inspected post-hoc via
trace.attemptsbut aren't summarized in the optimize table. - Costs with
runs: 3 + production_mode: { fallback: "gpt-5-mini" }are ≈3× a single-shot eval plus the actual retry attempts — not 6×. Production-mode metrics come from a single pass. - Step-only. Calling
compare_modelswithproduction_mode:on aPipeline::Basesubclass raisesArgumentError— retry injection is Step-level and pipeline-wide fallback semantics aren't defined yet. Benchmark individual steps.
- Guide: Production-mode cost measurement — API, metric interpretation, 2-tier scope note.
runs:parameter oncompare_modelsandoptimize_retry_policy— runs each candidate N times per eval and aggregates the mean score, mean cost per run, and mean latency. Reduces sampling variance in live mode where LLM outputs are non-deterministic (gpt-5 family enforcestemperature=1.0server-side, so a single unlucky sample can misclassify a viable candidate as "failing"). Defaultruns: 1— backward compatible.RUNS=Nonrake ruby_llm_contract:optimize— CLI flag for variance-aware optimization.Eval::AggregatedReport— duck-typeReportexposingscore(mean),score_min/score_max(spread),total_cost(mean per run),pass_rate(clean-pass count x/N), andclean_passes.- Guide: Reducing variance with
runs:— when to use it and why.
Step.optimize_retry_policy— runscompare_modelson ALL evals for the step, builds a score matrix, identifies the constraining eval, and suggests a retry chain. Chain's last model always passes all evals (safe fallback).rake ruby_llm_contract:optimize— one-command retry chain optimization. Prints score table, constraining eval, suggested chain, and copy-paste DSL.- Offline by default —
optimizeusessample_response(zero API calls) unlessLIVE=1orPROVIDER=is set. EVAL_DIRS=support — non-Rails setups can specify eval file directories.- Guide: Optimizing retry_policy — full procedure with prerequisites, troubleshooting, and real-world example.
- Chain semantics aligned with
retry_executor— retry fires onvalidation_failed/parse_error, not on low eval score. Disjoint eval coverage (A passes e1, B passes e2, neither passes both) correctly returns empty chain. - Removed ActiveSupport dependency from rake task (
.presence→.empty?). - Added
require "set"for non-Rails environments.
- Multi-provider operator tooling — rake tasks support
PROVIDER=openai|anthropic|ollama,CANDIDATES=model@effort,..., andREASONING_EFFORT=low|medium|high. rake ruby_llm_contract:recommend— wrapsStep.recommendwith CLI interface, prints best config, retry chain, DSL, rationale, and savings.- Ollama support —
PROVIDER=ollamawith configurableOLLAMA_API_BASE.
"What should I do?" — model + configuration recommendation.
Step.recommend—ClassifyTicket.recommend("eval", candidates: [...], min_score: 0.95)runs eval on all candidates and returns aRecommendationwith optimal model, retry chain, rationale, savings vs current config, andto_dslcode output.- Candidates as configurations —
candidates:accepts{ model:, reasoning_effort: }hashes, not just model name strings.gpt-5-miniwithreasoning_effort: "low"is a different candidate than with"high". compare_modelsextended — newcandidates:parameter alongside existingmodels:(backward compatible). Candidate labels include reasoning effort in output table.- Per-attempt
reasoning_effortin retry policies —escalateaccepts config hashes:escalate({ model: "gpt-4.1-nano" }, { model: "gpt-5-mini", reasoning_effort: "high" }). Each attempt gets its own reasoning_effort forwarded to the provider. pass_rate_ratio— numeric float (0.0–1.0) onReportandReportStats, complementing the stringpass_rate("3/5").- History entries enriched —
save_history!acceptsreasoning_effort:and storesmodel,reasoning_effort,pass_rate_ratioin JSONL entries.
v0.2: "Which model?" → compare_models (snapshot)
v0.3: "Did it change?" → baseline regression (binary)
v0.4: "Show me the trend" → eval history (time series)
v0.5: "Which prompt is better?" → compare_with (A/B testing)
v0.6: "What should I do?" → recommend (actionable advice)
reasoning_effortforwarded to provider —context: { reasoning_effort: "low" }now passed throughwith_paramsto the LLM. Previously accepted as a known context key but silently ignored by the RubyLLM adapter.
Data-Driven Prompt Engineering.
observeDSL — soft observations that log but never fail.observe("scores differ") { |o| o[:a] != o[:b] }. Results inresult.observations. Logged viaContract.loggerwhen they fail. Runs only when validation passes.compare_with— prompt A/B testing.StepV2.compare_with(StepV1, eval: "regression", model: "nano")returnsPromptDiffwithimprovements,regressions,score_delta,safe_to_switch?. ReusesBaselineDiffinternally.- RSpec
compared_withchain —expect(StepV2).to pass_eval("x").compared_with(StepV1).without_regressionsblocks merge if new prompt regresses any case.
v0.2: "Which model?" → compare_models (snapshot)
v0.3: "Did it change?" → baseline regression (binary)
v0.4: "Show me the trend" → eval history (time series)
v0.5: "Which prompt is better?" → compare_with (A/B testing)
Audit hardening — 18 bugs fixed across 4 audit rounds.
- RakeTask history before abort —
track_historynow saves all reports (pass and fail) before gating, so failed runs appear in eval history. - RSpec/Minitest stub scoping — block form
stub_stepuses thread-local overrides with real cleanup. Non-blockstub_all_stepsauto-restored by RSpecaround(:each)hook and Minitestsetup/teardown. - StepAdapterOverride — handles
context: niland respects string key"adapter". Moved tocontract.rbso both test frameworks share one mechanism. - max_cost fail closed output estimate — preflight uses 1x input tokens as output estimate when
max_outputnot set, preventing cost bypass for output-expensive models. - reset_configuration! clears overrides —
step_adapter_overridesnow cleared on reset. - CostCalculator.register_model — validates
Numeric,finite?, non-negative. Rejects NaN, Infinity, strings, nil. - Pipeline token_budget — rejects negative and zero values (parity with
timeout_ms). - track_history model fallback — uses step DSL
model, thendefault_modelwhen context has no model. Handles string key"model". - estimate_cost / estimate_eval_cost — falls back to step DSL model when no explicit model arg given.
- stub_steps string keys — both RSpec and Minitest normalize string-keyed options with
transform_keys(:to_sym). - DSL
:defaultreset —model(:default),temperature(:default),max_cost(:default)reset inherited parent values.
stub_steps(plural) — stub multiple steps with different responses in one block. No nesting needed. Works in RSpec and Minitest:stub_steps( ClassifyTicket => { response: { priority: "high" } }, RouteToTeam => { response: { team: "billing" } } ) { TicketPipeline.run("test") }
Production feedback release.
stub_stepblock form —stub_step(Step, response: x) { test }auto-resets adapter after block. Works in RSpec and Minitest. Eliminates leaked test state.- Minitest per-step routing —
stub_step(StepA, ...)now actually routes to StepA only (was setting global adapter, ignoring step class). track_historyin RakeTask —t.track_history = trueauto-appends every eval run (pass and fail) to.eval_history/. Drift detection without manualsave_history!calls.max_costfail closed — unknown model pricing now refuses the call instead of silently skipping. Seton_unknown_pricing: :warnfor old behavior.CostCalculator.register_model— register pricing for custom/fine-tuned models:register_model("ft:gpt-4o", input_per_1m: 3.0, output_per_1m: 6.0).
- RakeTask lazy context —
t.contextnow accepts a Proc, resolved at task runtime (after:environment). Fixes adapter not being available at Rake load time in Rails apps.
- RakeTask
:environmentfix — usesdefined?(::Rails)instead ofRake::Task.task_defined?(:environment). Works in Rails 8 without manualRake::Task.enhance. - Concurrent eval deterministic —
clone_for_concurrencyprotocol,ContextHelpersextracted. - README — added eval history, concurrency, quality tracking examples.
Observability & Scale — see what changed, run it fast, debug it easily.
- Structured logging —
Contract.configure { |c| c.logger = Rails.logger }. Auto-logs model, status, latency, tokens, cost on everystep.run. - Batch eval concurrency —
run_eval("regression", concurrency: 4). Parallel case execution via Concurrent::Future. 4x faster CI for large eval suites. - Eval history & trending —
report.save_history!appends to JSONL.report.eval_historyreturnsEvalHistorywithscore_trend,drift?, run-by-run scores. - Pipeline per-step eval —
add_case(..., step_expectations: { classify: { priority: "high" } }). See which step in a pipeline regressed. - Minitest support —
assert_satisfies_contract,assert_eval_passes,stub_stepfor Minitest users.require "ruby_llm/contract/minitest".
v0.2: "Which model?" → compare_models (snapshot)
v0.3: "Did it change?" → baseline regression (binary)
v0.4: "Show me the trend" → eval history (time series)
"Which step changed?" → pipeline per-step eval
"Run it fast" → batch concurrency
- Trait missing key = error —
expected_traits: { title: 0..5 }on output{}now fails instead of silently passing. - nil input in dynamic prompts —
run(nil)withprompt { |input| ... }correctly passes nil to block. - Defensive sample pre-validation —
sample_responseuses the same parser as runtime (handles code fences, BOM, prose around JSON). - Baseline diff excludes skipped — self-compare with skipped cases no longer shows artificial score delta.
- Zeitwerk eval/ ignore —
eager_load_contract_dirs!ignoreseval/subdirs before eager load.
- Recursive array/object validation — nested arrays (
array of array of string) validated recursively. Object items validated even without:properties(e.g.additionalProperties: false). - Deep symbolize in sample pre-validation — array samples with string keys (
[{"name" => "Alice"}]) correctly symbolized before schema validation.
- String constraints in SchemaValidator —
minLength/maxLengthenforced for root and nested strings. - Array item validation — scalar items (string, integer) validated against items schema type and constraints.
- Non-JSON sample_response fails fast —
sample_response("hello")with object schema raises ArgumentError at definition time instead of silently passing. max_tokensin KNOWN_CONTEXT_KEYS — no more spurious "Unknown context keys" warning.- Duplicate models deduplicated —
compare_models(models: ["m", "m"])runs model once.
- SchemaValidator validates non-object roots — boolean, integer, number, array root schemas now enforce type, min/max, enum, minItems/maxItems. Previously only object schemas were validated.
- Removed passing cases = regression —
regressed?returns true when baseline had passing cases that are now missing. Prevents gate bypass by deleting eval cases. - JSON string sample_response fixed —
sample_response('{"name":"Alice"}')correctly parsed for pre-validation instead of double-encoding. context[:max_tokens]forwarded — overrides step'smax_outputfor adapter call AND budget precheck.
- Skipped cases visible in regression diff — baseline PASS → current SKIP now detected as regression by
without_regressionsandfail_on_regression. - Skip only on missing adapter — eval runner no longer masks evaluator errors as SKIP. Only "No adapter configured" triggers skip.
- Array/Hash sample pre-validation —
sample_response([{...}])correctly validated against schema instead of silently skipping. assume_model_exists: falseforwarded — booleanfalseno longer dropped by truthiness check in adapter options.- Duplicate case names caught at definition —
add_case/verifywith same name raises immediately, not at run time.
- Array response preserved —
Adapters::RubyLLMno longer stringifies Array content. Steps withoutput_type Arraywork correctly. - Falsy prompt input —
run(false)andbuild_messages(false)passfalseto dynamic prompt blocks instead of falling back toinstance_eval. retry_onflatten —retry_on([:a, :b])no longer wraps in nested array.- Builder reset —
Prompt::Builderresets nodes on each build (no accumulation on reuse). - Pipeline false output —
output: falseno longer shows "(no output)" in pretty_print.
Fixes from persona_tool production deployment (4 services migrated).
- Proc/Lambda in
expected_traits—expected_traits: { score: ->(v) { v > 3 } }now works. - Zeitwerk eager-load —
load_evals!eager-loadsapp/contracts/andapp/steps/before loading eval files. Fixes uninitialized constant errors in Rake tasks. - Falsy values —
expected: false,input: false,sample_response(nil)all handled correctly. - Context key forwarding —
provider:andassume_model_exists:forwarded to adapter.schema:andmax_tokens:are step-level only (no split-brain). - Deep-freeze immutability — constructors never mutate caller's data.
Baseline regression detection — know when quality drops before users do.
report.save_baseline!— serialize eval results to.eval_baselines/(JSON, git-tracked)report.compare_with_baseline— returnsBaselineDiffwith regressions, improvements, score_delta, new/removed casesdiff.regressed?— true when any previously-passing case now failswithout_regressionsRSpec chain —expect(Step).to pass_eval("x").without_regressions- RakeTask
fail_on_regression— blocks CI when regressions detected - RakeTask
save_baseline— auto-save after successful run - Migration guide —
docs/guide/migration.mdwith 7 patterns for adopting the gem in existing Rails apps
- 1086 tests, 0 failures
Production hardening from senior Rails review panel.
around_callpropagates exceptions — no longer silently swallows DB errors, timeouts, etc. User who wants swallowing can rescue in their block.- Nil section content skipped —
section "X", nilno longer renders"null"to the LLM. Section is omitted entirely. - Range support in
expected:—expected: { score: 1..5 }works inadd_case. Previously only Regexp was supported. Trace#dig—trace.dig(:usage, :input_tokens)works on both Step and Pipeline traces.
Fixes from first real-world integration (persona_tool).
around_callfires per-run — not per-attempt. With retry_policy, callback fires once with final result. Signature:around_call { |step, input, result| ... }Result#tracealwaysTraceobject — never bare Hash.result.trace.modelworks on success AND failure.around_callexception safe — warns and returns result instead of crashing.modelDSL —model "gpt-4o-mini"per-step. Priority: context > step DSL > global config.- Test adapter
raw_outputalways String — Hash/Array normalized to.to_json. Trace#dig—trace.dig(:usage, :input_tokens)works.
Production DX improvements from first real-world integration (persona_tool).
temperatureDSL —temperature 0.3in step definition, overridable viacontext: { temperature: 0.7 }. RubyLLM handles per-model normalization natively.around_callhook — callback for logging, metrics, observability. Replaces need for custom middleware.build_messagespublic — inspect rendered prompt without running the step.stub_stepRSpec helper —stub_step(MyStep, response: { ... })reduces test boilerplate. Auto-included viarequire "ruby_llm/contract/rspec".estimate_cost/estimate_eval_cost— predict spend before API calls.
- Reload lifecycle —
load_evals!clears definitions before re-loading. Railtie hooksconfig.to_preparefor development reload.define_evalwarns on duplicate name (suppressed during reload). - Pipeline eval cost — uses
Pipeline::Trace#total_cost(all steps), not just last step. - Adapter isolation —
compare_modelsandrun_all_own_evalsdeep-dup context per run. - Offline mode — cases without adapter return
:skippedinstead of crashing. Skipped cases excluded from score. expected_traitsreachable fromdefine_evalDSL viaadd_case.verifyraises when both positional andexpect:keyword provided.best_forexcludes zero-score models from recommendation.print_summaryreplacespretty_print(avoidsKernel#pretty_printshadow).CaseResult#to_hround-trips correctly (name:key).
- All 5 guides updated for v0.2 API
- Symbol keys documented
- Retry model priority documented
- Test adapter format documented
- 1077 tests, 0 failures
- 3 architecture review rounds, 32 findings fixed
Contracts for LLM quality. Know which model to use, what it costs, and when accuracy drops.
report.resultsreturnsCaseResultobjects instead of hashes. Useresult.name,result.passed?,result.scoreinstead ofresult[:case_name],result[:passed].CaseResult#to_hfor backward compat.report.print_summaryreplacesreport.pretty_print(avoids shadowingKernel#pretty_print).
add_caseindefine_eval—add_case "billing", input: "...", expected: { priority: "high" }with partial matching. Supportsexpected_traits:for regex/range matching.CaseResultvalue objects —result.name,result.passed?,result.output,result.expected,result.mismatches(structured diff),result.cost,result.duration_ms.report.failures— returns only failed cases.report.skippedcounts skipped (offline) cases.- Model comparison —
Step.compare_models("eval", models: %w[nano mini full])runs same eval across models. Returns table with score/cost/latency per model.comparison.best_for(min_score: 0.95)returns cheapest model meeting threshold. - Cost tracking —
report.total_cost,report.avg_latency_ms, per-caseresult.cost. Pipeline eval uses total pipeline cost, not just last step. - Cost prediction —
Step.estimate_cost(input:, model:)andStep.estimate_eval_cost("eval", models: [...])predict spend before API calls. - CI gating —
pass_eval("regression").with_minimum_score(0.8).with_maximum_cost(0.01). RakeTask with suite-levelminimum_scoreandmaximum_cost. RubyLLM::Contract.run_all_evals— discovers all Steps/Pipelines with evals, runs them all. Includes inherited evals.RubyLLM::Contract::RakeTask—rake ruby_llm_contract:evalwithminimum_score,maximum_cost,fail_on_empty,eval_dirs.- Rails Railtie — auto-loads eval files via
config.after_initialize+config.to_prepare(supports development reload). - Offline mode — cases without adapter return
:skippedinstead of crashing. Skipped cases excluded from score/passed. - Safe
define_eval— warns on duplicate name; suppressed during reload.
- P1: Eval files not autoloaded by Rails — Railtie uses
load(not Zeitwerk). Hooks into reloader for dev. - P2: report.results returns raw Hashes — now returns
CaseResultobjects. - P3: No way to run all evals at once —
Contract.run_all_evals+ Rake task. - P4: String vs symbol key mismatch — warns when
validateorverifyproc returns nil. - Pipeline eval cost — uses
Pipeline::Trace#total_cost(all steps), not just last step. - Reload lifecycle —
load_evals!clears definitions before re-loading. Registry filters stale hosts. - Adapter isolation —
compare_modelsandrun_all_own_evalsdeep-dup context per run.
Model Score Cost Avg Latency
---------------------------------------------------------
gpt-4.1-nano 0.67 $0.000032 687ms
gpt-4.1-mini 1.00 $0.000102 1070ms
- 1077 tests, 0 failures
- 3 architecture review rounds, 32 findings fixed
- Verified with real OpenAI API (gpt-4.1-nano, gpt-4.1-mini)
Initial release.
- Step abstraction —
RubyLLM::Contract::Step::Basewith prompt DSL, typed input/output - Output schema — declarative structure via ruby_llm-schema, sent to provider for enforcement
- Validate — business logic checks (1-arity and 2-arity with input cross-validation)
- Retry with model escalation — start cheap, auto-escalate on contract failure or network error
- Preflight limits —
max_input,max_cost,max_outputrefuse before calling the LLM - Pipeline — multi-step composition with fail-fast, timeout, token budget
- Eval — offline contract verification with
define_eval,run_eval, zero-verify auto-case - Adapters — RubyLLM (production), Test (deterministic specs)
- RSpec matchers —
satisfy_contract,pass_eval - Structured trace — model, latency, tokens, cost, attempt log per step
- 1005 tests, 0 failures
- 42 bugs found and fixed via 10 rounds of adversarial testing
- 0 RuboCop offenses
- Parser handles: markdown code fences, UTF-8 BOM, JSON extraction from prose
- SchemaValidator: full nested validation, additionalProperties, minItems/maxItems, minLength/maxLength
- Deep-frozen parsed_output prevents mutation via shared references