Skip to content

Latest commit

 

History

History
469 lines (331 loc) · 34.2 KB

File metadata and controls

469 lines (331 loc) · 34.2 KB

Changelog

0.7.3 (2026-04-24)

Adoption-friction release. No runtime behavior changes — every delta is in docs/, examples/, or spec/integration/ (plus the version.rb / Gemfile.lock bumps). Upgrading from 0.7.2 picks up the expanded guide set, the new runnable showcases, and one extra integration spec.

Documentation

  • New guide: docs/guide/why.md — four production failure modes the gem exists for (schema-valid logically wrong, silent prompt regression, sampling variance on fixed-temperature models, runaway cost). Opens from a concrete incident each time; designed for readers who have not yet felt the pain the gem relieves.
  • New guide: docs/guide/rails_integration.md — seven Rails-specific FAQs with runnable snippets: where step classes live (app/contracts/), initializer setup, background jobs, around_call observability, RSpec/Minitest stubs, error handling in controllers, CI gate wiring.
  • README adoption-friction pass — added a short "Do I need this?" block after Install, a reading-order hint (README → why.md → getting_started.md), and outcome-based labels in the docs index ("Prevent silent prompt regressions" instead of "Eval-First", etc.).
  • TL;DR box at the top of every guide — single-sentence orientation for readers who land via search; "Skip if" clause added where real confusion exists (eval_first.md, testing.md, migration.md).
  • API coverage gaps closedestimate_cost / estimate_eval_cost, max_cost on_unknown_pricing: :warn, run_eval(..., concurrency:), around_call testing patterns now documented in getting_started.md, eval_first.md, testing.md.
  • Industry-standard terminologytemperature-lockedfixed-temperature, variance-inducedsampling variance, severity signalsseverity keywords, takeaway drifttone/takeaways mismatch.
  • docs/architecture.md refresh — diagram now reflects the current class layout: added Step::RetryPolicy, Pipeline::Result, Eval::AggregatedReport, Eval::BaselineDiff, Eval::PromptDiffComparator, Eval::EvalHistory, Eval::RetryOptimizer, OptimizeRakeTask. Replaced the outdated Eval::TraitEvaluator entry with Eval::ExpectationEvaluator.
  • Business framing added to guides — every guide opens with a concrete production scenario or "why it matters" hook before the API reference.

Examples — consolidated on SummarizeArticle, renumbered 00-06

The previous 12-file set mixed a private Reddit promo planner, customer support, meetings, keyword extraction, and translation. The new set is seven runnable files, each answering one adopter question on the README's SummarizeArticle case.

# File Answers
00 00_basics.rb How do I start? (seven incremental layers + real-LLM pointer)
01 01_fallback_showcase.rb Show me the gem in 30 seconds (zero API keys)
02 02_real_llm_minimal.rb How do I plug in a real LLM? (~30 lines)
03 03_summarize_with_keywords.rb How does the contract evolve? (growing prompt)
04 04_summarize_and_translate.rb Pipeline composition + pipeline-level run_eval
05 05_eval_dataset.rb How do I stop silent prompt regressions?
06 06_retry_variants.rb attempts: 3, reasoning_effort escalation, cross-provider (Ollama → Anthropic → OpenAI)

Every file carries an "Expected output" block in its header so readers see the result without running the script. The docs/ideas/ directory is now fully untracked (already in .gitignore; one stray file removed from version control).

Examples — bug fixes carried along

  • Schema pitfall fixed in 5 filesarray :x do; string :y; ...; end silently produces items: string and drops every declaration after the first, matching the documented pitfall in spec/ruby_llm/contract/nested_schema_spec.rb:71. Every affected array block is now wrapped in object do...end.
  • examples/05_eval_dataset.rb (pre-renumber: 09_eval_dataset.rb) result[:passed]result.passed? — the previous code called [] on an Eval::CaseResult and raised NoMethodError at runtime.

Testing

  • New spec/integration/pipeline_eval_spec.rb — three cases guaranteeing pipeline-level run_eval stays functional: happy path, final-step mismatch, and fail-fast propagation when an intermediate validate rejects. Closes the "09 STEP 5 pipeline evaluation" known issue flagged in the 0.7.2 release. The fail-fast case asserts step_status == :validation_failed and the validate's label in details, so a regression that short-circuits on schema instead of validate would fail loudly.

Deleted (private-project cleanup)

  • examples/01_classify_threads.rb, 02_generate_comment.rb, 03_target_audience.rb, 10_reddit_full_showcase.rb, spec/integration/reddit_pipeline_spec.rb — Reddit Promo Planner was a separate private project; its examples do not belong in the gem's public repo.
  • examples/02_output_schema.rb — fully covered by docs/guide/output_schema.md; deleting avoids duplication.

0.7.2 (2026-04-22)

Changed

  • Terminal output labels renamed for consistency with README narrative. print_summary now prints Hardest eval (was Constraining eval), Suggested fallback list (was Suggested chain), and the production-mode table uses first-attempt / fallback % as column headers (was single-shot / escalation). Programmatic metric names unchanged: single_shot_cost, single_shot_latency_ms, escalation_rate. RetryOptimizer::Result exposes hardest_eval as an alias for constraining_eval.

Documentation

  • docs/guide/optimizing_retry_policy.md rewritten — 17.7k → 6.4k characters. Continues the SummarizeArticle narrative from README. Offline mode clearly positioned as wiring-check; real optimization runs via LIVE=1 RUNS=3. Output samples match actual print_summary format.
  • docs/guide/getting_started.md rewritten — 8.7k → 6.1k. Every example uses SummarizeArticle. Evals + CI gates section moved before Budget caps. Structured Prompts / Dynamic Prompts / "Already using ruby_llm?" / Reasoning effort sections removed; content delegated to prompt_ast.md and README.
  • docs/guide/eval_first.md refined — 6.3k → 5.0k. Switched to SummarizeArticle case. Team workflow section compressed with links back to getting_started.md for the matcher chain.
  • docs/guide/testing.md refined — 10.7k → 7.4k. Switched to SummarizeArticle case. Threshold gating / Rake task / baseline walkthrough / prompt A/B sections delegated back to getting_started.md and eval_first.md.
  • docs/guide/output_schema.md DSL bug fix — the Supported constraints table documented JSON Schema camelCase keys (minLength, minItems, additionalProperties) that are not valid DSL arguments. Every copy-paste from the previous table would have raised ArgumentError. Switched to snake_case (min_length, min_items, additional_properties) as the DSL actually expects; added a short note on the internal camelCase conversion.
  • docs/guide/best_practices.md, pipeline.md, migration.md sanity pass — terminology alignment (model escalation → model fallback where narrative; escalate DSL method unchanged) and SummarizeArticle case where the guide is not inherently multi-step.

0.7.1 (2026-04-22)

Changed (behavioral, follow-up to v0.7.0)

  • Step::Base#run_once no longer swallows adapter-phase ArgumentError as :input_error. The previous blanket rescue ArgumentError was there to convert DSL misconfiguration (e.g. missing prompt) into an :input_error Result. Side effect: programmer bugs in adapter code that raised ArgumentError (wrong arity, bad config argument) were silently coerced into :input_error and retried as if the user had given bad input. Now the rescue is narrowed to the Runner-construction phase only — DSL configuration errors still produce :input_error (the prompt has not been set case is regression-tested), but ArgumentError raised from adapter code during Runner#call propagates to the caller. Input-type validation failures continue to produce :input_error through InputValidator's own scoped rescue, unchanged.

0.7.0 (2026-04-21)

Breaking changes

  • :adapter_error removed from DEFAULT_RETRY_ON. New default: [:validation_failed, :parse_error]. ruby_llm already retries transport errors (RateLimitError, ServerError, ServiceUnavailableError, OverloadedError, timeouts) at the Faraday layer, so the previous default re-ran the same model on errors the HTTP middleware already retried with backoff. To restore pre-0.7 behavior: retry_on :validation_failed, :parse_error, :adapter_error. Recommended pattern: pair :adapter_error with escalate "model_a", "model_b" — a different model/provider can bypass what transport retry could not.
  • AdapterCaller narrows rescue from StandardError to RubyLLM::Error + Faraday::Error. Provider errors and transport errors that escape ruby_llm's Faraday retry middleware (Faraday::TimeoutError, Faraday::ConnectionFailed) still produce :adapter_error as before. Programmer errors that are neither (NoMethodError, adapter code bugs) now propagate instead of being silently converted to :adapter_error and retried. Known limitation: adapter code raising ArgumentError is still coerced into :input_error by Step::Base#run_once (which rescues ArgumentError for input-type validation). Disambiguating adapter-ArgumentError vs input-validation-ArgumentError requires a run_once refactor and is tracked as a follow-up.

Migration

If you rely on the old behavior, opt in explicitly:

retry_policy do
  attempts 3
  retry_on :validation_failed, :parse_error, :adapter_error
end

Or better, with a model fallback chain:

retry_policy do
  escalate "gpt-4.1-nano", "gpt-4.1-mini"
  retry_on :validation_failed, :parse_error, :adapter_error
end

0.6.4 (2026-04-20)

Features

  • production_mode: on compare_models and optimize_retry_policy — measures retry-aware, end-to-end cost per successful output. Pass production_mode: { fallback: "gpt-5-mini" } and each candidate runs with a runtime-injected [candidate, fallback] retry chain. The report exposes escalation_rate, single_shot_cost, and effective_cost so "the cheaper candidate" decision matches production cost rather than first-attempt cost.
  • New Report metricsescalation_rate, single_shot_cost, effective_cost, single_shot_latency_ms, effective_latency_ms, latency_percentiles (p50/p95/max). AggregatedReport averages all of them across runs:.
  • Extended ModelComparison#table — when production_mode: is set, renders a Chain column (candidate → fallback) with single-shot, escalation, effective cost, latency, score. Edge case candidate == fallback renders as a single model and in the escalation column, with retry injection skipped entirely so effective == single-shot by construction, not by coincidence.
  • context[:retry_policy_override] — new context key that nullifies or replaces class-level retry_policy for a single call. Used internally by production-mode injection; safe to use directly when you need a transient override that doesn't mutate the step class.

Scope

  • Single-fallback (2-tier) chains only. Multi-tier chains can be inspected post-hoc via trace.attempts but aren't summarized in the optimize table.
  • Costs with runs: 3 + production_mode: { fallback: "gpt-5-mini" } are ≈3× a single-shot eval plus the actual retry attempts — not 6×. Production-mode metrics come from a single pass.
  • Step-only. Calling compare_models with production_mode: on a Pipeline::Base subclass raises ArgumentError — retry injection is Step-level and pipeline-wide fallback semantics aren't defined yet. Benchmark individual steps.

Documentation

0.6.3 (2026-04-20)

Features

  • runs: parameter on compare_models and optimize_retry_policy — runs each candidate N times per eval and aggregates the mean score, mean cost per run, and mean latency. Reduces sampling variance in live mode where LLM outputs are non-deterministic (gpt-5 family enforces temperature=1.0 server-side, so a single unlucky sample can misclassify a viable candidate as "failing"). Default runs: 1 — backward compatible.
  • RUNS=N on rake ruby_llm_contract:optimize — CLI flag for variance-aware optimization.
  • Eval::AggregatedReport — duck-type Report exposing score (mean), score_min/score_max (spread), total_cost (mean per run), pass_rate (clean-pass count x/N), and clean_passes.
  • Guide: Reducing variance with runs: — when to use it and why.

0.6.2 (2026-04-18)

Features

  • Step.optimize_retry_policy — runs compare_models on ALL evals for the step, builds a score matrix, identifies the constraining eval, and suggests a retry chain. Chain's last model always passes all evals (safe fallback).
  • rake ruby_llm_contract:optimize — one-command retry chain optimization. Prints score table, constraining eval, suggested chain, and copy-paste DSL.
  • Offline by defaultoptimize uses sample_response (zero API calls) unless LIVE=1 or PROVIDER= is set.
  • EVAL_DIRS= support — non-Rails setups can specify eval file directories.
  • Guide: Optimizing retry_policy — full procedure with prerequisites, troubleshooting, and real-world example.

Fixes

  • Chain semantics aligned with retry_executor — retry fires on validation_failed/parse_error, not on low eval score. Disjoint eval coverage (A passes e1, B passes e2, neither passes both) correctly returns empty chain.
  • Removed ActiveSupport dependency from rake task (.presence.empty?).
  • Added require "set" for non-Rails environments.

0.6.1 (2026-04-17)

Features

  • Multi-provider operator tooling — rake tasks support PROVIDER=openai|anthropic|ollama, CANDIDATES=model@effort,..., and REASONING_EFFORT=low|medium|high.
  • rake ruby_llm_contract:recommend — wraps Step.recommend with CLI interface, prints best config, retry chain, DSL, rationale, and savings.
  • Ollama supportPROVIDER=ollama with configurable OLLAMA_API_BASE.

0.6.0 (2026-04-12)

"What should I do?" — model + configuration recommendation.

Features

  • Step.recommendClassifyTicket.recommend("eval", candidates: [...], min_score: 0.95) runs eval on all candidates and returns a Recommendation with optimal model, retry chain, rationale, savings vs current config, and to_dsl code output.
  • Candidates as configurationscandidates: accepts { model:, reasoning_effort: } hashes, not just model name strings. gpt-5-mini with reasoning_effort: "low" is a different candidate than with "high".
  • compare_models extended — new candidates: parameter alongside existing models: (backward compatible). Candidate labels include reasoning effort in output table.
  • Per-attempt reasoning_effort in retry policiesescalate accepts config hashes: escalate({ model: "gpt-4.1-nano" }, { model: "gpt-5-mini", reasoning_effort: "high" }). Each attempt gets its own reasoning_effort forwarded to the provider.
  • pass_rate_ratio — numeric float (0.0–1.0) on Report and ReportStats, complementing the string pass_rate ("3/5").
  • History entries enrichedsave_history! accepts reasoning_effort: and stores model, reasoning_effort, pass_rate_ratio in JSONL entries.

Game changer continuity

v0.2: "Which model?"          → compare_models (snapshot)
v0.3: "Did it change?"        → baseline regression (binary)
v0.4: "Show me the trend"     → eval history (time series)
v0.5: "Which prompt is better?" → compare_with (A/B testing)
v0.6: "What should I do?"     → recommend (actionable advice)

0.5.2 (2026-04-06)

Features

  • reasoning_effort forwarded to providercontext: { reasoning_effort: "low" } now passed through with_params to the LLM. Previously accepted as a known context key but silently ignored by the RubyLLM adapter.

0.5.0 (2026-03-25)

Data-Driven Prompt Engineering.

Features

  • observe DSL — soft observations that log but never fail. observe("scores differ") { |o| o[:a] != o[:b] }. Results in result.observations. Logged via Contract.logger when they fail. Runs only when validation passes.
  • compare_with — prompt A/B testing. StepV2.compare_with(StepV1, eval: "regression", model: "nano") returns PromptDiff with improvements, regressions, score_delta, safe_to_switch?. Reuses BaselineDiff internally.
  • RSpec compared_with chainexpect(StepV2).to pass_eval("x").compared_with(StepV1).without_regressions blocks merge if new prompt regresses any case.

Game changer continuity

v0.2: "Which model?"          → compare_models (snapshot)
v0.3: "Did it change?"        → baseline regression (binary)
v0.4: "Show me the trend"     → eval history (time series)
v0.5: "Which prompt is better?" → compare_with (A/B testing)

0.4.5 (2026-03-24)

Audit hardening — 18 bugs fixed across 4 audit rounds.

Fixes

  • RakeTask history before aborttrack_history now saves all reports (pass and fail) before gating, so failed runs appear in eval history.
  • RSpec/Minitest stub scoping — block form stub_step uses thread-local overrides with real cleanup. Non-block stub_all_steps auto-restored by RSpec around(:each) hook and Minitest setup/teardown.
  • StepAdapterOverride — handles context: nil and respects string key "adapter". Moved to contract.rb so both test frameworks share one mechanism.
  • max_cost fail closed output estimate — preflight uses 1x input tokens as output estimate when max_output not set, preventing cost bypass for output-expensive models.
  • reset_configuration! clears overridesstep_adapter_overrides now cleared on reset.
  • CostCalculator.register_model — validates Numeric, finite?, non-negative. Rejects NaN, Infinity, strings, nil.
  • Pipeline token_budget — rejects negative and zero values (parity with timeout_ms).
  • track_history model fallback — uses step DSL model, then default_model when context has no model. Handles string key "model".
  • estimate_cost / estimate_eval_cost — falls back to step DSL model when no explicit model arg given.
  • stub_steps string keys — both RSpec and Minitest normalize string-keyed options with transform_keys(:to_sym).
  • DSL :default resetmodel(:default), temperature(:default), max_cost(:default) reset inherited parent values.

0.4.4 (2026-03-24)

  • stub_steps (plural) — stub multiple steps with different responses in one block. No nesting needed. Works in RSpec and Minitest:
    stub_steps(
      ClassifyTicket => { response: { priority: "high" } },
      RouteToTeam => { response: { team: "billing" } }
    ) { TicketPipeline.run("test") }

0.4.3 (2026-03-24)

Production feedback release.

Features

  • stub_step block formstub_step(Step, response: x) { test } auto-resets adapter after block. Works in RSpec and Minitest. Eliminates leaked test state.
  • Minitest per-step routingstub_step(StepA, ...) now actually routes to StepA only (was setting global adapter, ignoring step class).
  • track_history in RakeTaskt.track_history = true auto-appends every eval run (pass and fail) to .eval_history/. Drift detection without manual save_history! calls.
  • max_cost fail closed — unknown model pricing now refuses the call instead of silently skipping. Set on_unknown_pricing: :warn for old behavior.
  • CostCalculator.register_model — register pricing for custom/fine-tuned models: register_model("ft:gpt-4o", input_per_1m: 3.0, output_per_1m: 6.0).

0.4.2 (2026-03-24)

  • RakeTask lazy contextt.context now accepts a Proc, resolved at task runtime (after :environment). Fixes adapter not being available at Rake load time in Rails apps.

0.4.1 (2026-03-24)

  • RakeTask :environment fix — uses defined?(::Rails) instead of Rake::Task.task_defined?(:environment). Works in Rails 8 without manual Rake::Task.enhance.
  • Concurrent eval deterministicclone_for_concurrency protocol, ContextHelpers extracted.
  • README — added eval history, concurrency, quality tracking examples.

0.4.0 (2026-03-24)

Observability & Scale — see what changed, run it fast, debug it easily.

Features

  • Structured loggingContract.configure { |c| c.logger = Rails.logger }. Auto-logs model, status, latency, tokens, cost on every step.run.
  • Batch eval concurrencyrun_eval("regression", concurrency: 4). Parallel case execution via Concurrent::Future. 4x faster CI for large eval suites.
  • Eval history & trendingreport.save_history! appends to JSONL. report.eval_history returns EvalHistory with score_trend, drift?, run-by-run scores.
  • Pipeline per-step evaladd_case(..., step_expectations: { classify: { priority: "high" } }). See which step in a pipeline regressed.
  • Minitest supportassert_satisfies_contract, assert_eval_passes, stub_step for Minitest users. require "ruby_llm/contract/minitest".

Game changer continuity

v0.2: "Which model?"          → compare_models (snapshot)
v0.3: "Did it change?"        → baseline regression (binary)
v0.4: "Show me the trend"     → eval history (time series)
      "Which step changed?"   → pipeline per-step eval
      "Run it fast"           → batch concurrency

0.3.7 (2026-03-24)

  • Trait missing key = errorexpected_traits: { title: 0..5 } on output {} now fails instead of silently passing.
  • nil input in dynamic promptsrun(nil) with prompt { |input| ... } correctly passes nil to block.
  • Defensive sample pre-validationsample_response uses the same parser as runtime (handles code fences, BOM, prose around JSON).
  • Baseline diff excludes skipped — self-compare with skipped cases no longer shows artificial score delta.
  • Zeitwerk eval/ ignoreeager_load_contract_dirs! ignores eval/ subdirs before eager load.

0.3.6 (2026-03-24)

  • Recursive array/object validation — nested arrays (array of array of string) validated recursively. Object items validated even without :properties (e.g. additionalProperties: false).
  • Deep symbolize in sample pre-validation — array samples with string keys ([{"name" => "Alice"}]) correctly symbolized before schema validation.

0.3.5 (2026-03-24)

  • String constraints in SchemaValidatorminLength/maxLength enforced for root and nested strings.
  • Array item validation — scalar items (string, integer) validated against items schema type and constraints.
  • Non-JSON sample_response fails fastsample_response("hello") with object schema raises ArgumentError at definition time instead of silently passing.
  • max_tokens in KNOWN_CONTEXT_KEYS — no more spurious "Unknown context keys" warning.
  • Duplicate models deduplicatedcompare_models(models: ["m", "m"]) runs model once.

0.3.4 (2026-03-24)

  • SchemaValidator validates non-object roots — boolean, integer, number, array root schemas now enforce type, min/max, enum, minItems/maxItems. Previously only object schemas were validated.
  • Removed passing cases = regressionregressed? returns true when baseline had passing cases that are now missing. Prevents gate bypass by deleting eval cases.
  • JSON string sample_response fixedsample_response('{"name":"Alice"}') correctly parsed for pre-validation instead of double-encoding.
  • context[:max_tokens] forwarded — overrides step's max_output for adapter call AND budget precheck.

0.3.3 (2026-03-23)

  • Skipped cases visible in regression diff — baseline PASS → current SKIP now detected as regression by without_regressions and fail_on_regression.
  • Skip only on missing adapter — eval runner no longer masks evaluator errors as SKIP. Only "No adapter configured" triggers skip.
  • Array/Hash sample pre-validationsample_response([{...}]) correctly validated against schema instead of silently skipping.
  • assume_model_exists: false forwarded — boolean false no longer dropped by truthiness check in adapter options.
  • Duplicate case names caught at definitionadd_case/verify with same name raises immediately, not at run time.

0.3.2 (2026-03-23)

  • Array response preservedAdapters::RubyLLM no longer stringifies Array content. Steps with output_type Array work correctly.
  • Falsy prompt inputrun(false) and build_messages(false) pass false to dynamic prompt blocks instead of falling back to instance_eval.
  • retry_on flattenretry_on([:a, :b]) no longer wraps in nested array.
  • Builder resetPrompt::Builder resets nodes on each build (no accumulation on reuse).
  • Pipeline false outputoutput: false no longer shows "(no output)" in pretty_print.

0.3.1 (2026-03-23)

Fixes from persona_tool production deployment (4 services migrated).

  • Proc/Lambda in expected_traitsexpected_traits: { score: ->(v) { v > 3 } } now works.
  • Zeitwerk eager-loadload_evals! eager-loads app/contracts/ and app/steps/ before loading eval files. Fixes uninitialized constant errors in Rake tasks.
  • Falsy valuesexpected: false, input: false, sample_response(nil) all handled correctly.
  • Context key forwardingprovider: and assume_model_exists: forwarded to adapter. schema: and max_tokens: are step-level only (no split-brain).
  • Deep-freeze immutability — constructors never mutate caller's data.

0.3.0 (2026-03-23)

Baseline regression detection — know when quality drops before users do.

Features

  • report.save_baseline! — serialize eval results to .eval_baselines/ (JSON, git-tracked)
  • report.compare_with_baseline — returns BaselineDiff with regressions, improvements, score_delta, new/removed cases
  • diff.regressed? — true when any previously-passing case now fails
  • without_regressions RSpec chainexpect(Step).to pass_eval("x").without_regressions
  • RakeTask fail_on_regression — blocks CI when regressions detected
  • RakeTask save_baseline — auto-save after successful run
  • Migration guidedocs/guide/migration.md with 7 patterns for adopting the gem in existing Rails apps

Stats

  • 1086 tests, 0 failures

0.2.3 (2026-03-23)

Production hardening from senior Rails review panel.

  • around_call propagates exceptions — no longer silently swallows DB errors, timeouts, etc. User who wants swallowing can rescue in their block.
  • Nil section content skippedsection "X", nil no longer renders "null" to the LLM. Section is omitted entirely.
  • Range support in expected:expected: { score: 1..5 } works in add_case. Previously only Regexp was supported.
  • Trace#digtrace.dig(:usage, :input_tokens) works on both Step and Pipeline traces.

0.2.2 (2026-03-23)

Fixes from first real-world integration (persona_tool).

  • around_call fires per-run — not per-attempt. With retry_policy, callback fires once with final result. Signature: around_call { |step, input, result| ... }
  • Result#trace always Trace object — never bare Hash. result.trace.model works on success AND failure.
  • around_call exception safe — warns and returns result instead of crashing.
  • model DSLmodel "gpt-4o-mini" per-step. Priority: context > step DSL > global config.
  • Test adapter raw_output always String — Hash/Array normalized to .to_json.
  • Trace#digtrace.dig(:usage, :input_tokens) works.

0.2.1 (2026-03-23)

Production DX improvements from first real-world integration (persona_tool).

Features

  • temperature DSLtemperature 0.3 in step definition, overridable via context: { temperature: 0.7 }. RubyLLM handles per-model normalization natively.
  • around_call hook — callback for logging, metrics, observability. Replaces need for custom middleware.
  • build_messages public — inspect rendered prompt without running the step.
  • stub_step RSpec helperstub_step(MyStep, response: { ... }) reduces test boilerplate. Auto-included via require "ruby_llm/contract/rspec".
  • estimate_cost / estimate_eval_cost — predict spend before API calls.

Fixes

  • Reload lifecycleload_evals! clears definitions before re-loading. Railtie hooks config.to_prepare for development reload. define_eval warns on duplicate name (suppressed during reload).
  • Pipeline eval cost — uses Pipeline::Trace#total_cost (all steps), not just last step.
  • Adapter isolationcompare_models and run_all_own_evals deep-dup context per run.
  • Offline mode — cases without adapter return :skipped instead of crashing. Skipped cases excluded from score.
  • expected_traits reachable from define_eval DSL via add_case.
  • verify raises when both positional and expect: keyword provided.
  • best_for excludes zero-score models from recommendation.
  • print_summary replaces pretty_print (avoids Kernel#pretty_print shadow).
  • CaseResult#to_h round-trips correctly (name: key).

Docs

  • All 5 guides updated for v0.2 API
  • Symbol keys documented
  • Retry model priority documented
  • Test adapter format documented

Stats

  • 1077 tests, 0 failures
  • 3 architecture review rounds, 32 findings fixed

0.2.0 (2026-03-23)

Contracts for LLM quality. Know which model to use, what it costs, and when accuracy drops.

Breaking changes

  • report.results returns CaseResult objects instead of hashes. Use result.name, result.passed?, result.score instead of result[:case_name], result[:passed]. CaseResult#to_h for backward compat.
  • report.print_summary replaces report.pretty_print (avoids shadowing Kernel#pretty_print).

Features

  • add_case in define_evaladd_case "billing", input: "...", expected: { priority: "high" } with partial matching. Supports expected_traits: for regex/range matching.
  • CaseResult value objectsresult.name, result.passed?, result.output, result.expected, result.mismatches (structured diff), result.cost, result.duration_ms.
  • report.failures — returns only failed cases. report.skipped counts skipped (offline) cases.
  • Model comparisonStep.compare_models("eval", models: %w[nano mini full]) runs same eval across models. Returns table with score/cost/latency per model. comparison.best_for(min_score: 0.95) returns cheapest model meeting threshold.
  • Cost trackingreport.total_cost, report.avg_latency_ms, per-case result.cost. Pipeline eval uses total pipeline cost, not just last step.
  • Cost predictionStep.estimate_cost(input:, model:) and Step.estimate_eval_cost("eval", models: [...]) predict spend before API calls.
  • CI gatingpass_eval("regression").with_minimum_score(0.8).with_maximum_cost(0.01). RakeTask with suite-level minimum_score and maximum_cost.
  • RubyLLM::Contract.run_all_evals — discovers all Steps/Pipelines with evals, runs them all. Includes inherited evals.
  • RubyLLM::Contract::RakeTaskrake ruby_llm_contract:eval with minimum_score, maximum_cost, fail_on_empty, eval_dirs.
  • Rails Railtie — auto-loads eval files via config.after_initialize + config.to_prepare (supports development reload).
  • Offline mode — cases without adapter return :skipped instead of crashing. Skipped cases excluded from score/passed.
  • Safe define_eval — warns on duplicate name; suppressed during reload.

Fixes

  • P1: Eval files not autoloaded by Rails — Railtie uses load (not Zeitwerk). Hooks into reloader for dev.
  • P2: report.results returns raw Hashes — now returns CaseResult objects.
  • P3: No way to run all evals at onceContract.run_all_evals + Rake task.
  • P4: String vs symbol key mismatch — warns when validate or verify proc returns nil.
  • Pipeline eval cost — uses Pipeline::Trace#total_cost (all steps), not just last step.
  • Reload lifecycleload_evals! clears definitions before re-loading. Registry filters stale hosts.
  • Adapter isolationcompare_models and run_all_own_evals deep-dup context per run.

Verified with real API

Model                      Score       Cost  Avg Latency
---------------------------------------------------------
gpt-4.1-nano                0.67    $0.000032      687ms
gpt-4.1-mini                1.00    $0.000102     1070ms

Stats

  • 1077 tests, 0 failures
  • 3 architecture review rounds, 32 findings fixed
  • Verified with real OpenAI API (gpt-4.1-nano, gpt-4.1-mini)

0.1.0 (2026-03-20)

Initial release.

Features

  • Step abstractionRubyLLM::Contract::Step::Base with prompt DSL, typed input/output
  • Output schema — declarative structure via ruby_llm-schema, sent to provider for enforcement
  • Validate — business logic checks (1-arity and 2-arity with input cross-validation)
  • Retry with model escalation — start cheap, auto-escalate on contract failure or network error
  • Preflight limitsmax_input, max_cost, max_output refuse before calling the LLM
  • Pipeline — multi-step composition with fail-fast, timeout, token budget
  • Eval — offline contract verification with define_eval, run_eval, zero-verify auto-case
  • Adapters — RubyLLM (production), Test (deterministic specs)
  • RSpec matcherssatisfy_contract, pass_eval
  • Structured trace — model, latency, tokens, cost, attempt log per step

Robustness

  • 1005 tests, 0 failures
  • 42 bugs found and fixed via 10 rounds of adversarial testing
  • 0 RuboCop offenses
  • Parser handles: markdown code fences, UTF-8 BOM, JSON extraction from prose
  • SchemaValidator: full nested validation, additionalProperties, minItems/maxItems, minLength/maxLength
  • Deep-frozen parsed_output prevents mutation via shared references