feat(loops): surface tokenUsage on LoopResult + reportLoopUsage bridge for runProfileMatrix#74
Merged
Conversation
…Usage bridge
runLoop tracked per-call tokensIn/tokensOut (extractLlmCallEvent) but only
aggregated costUsd — token counts were dropped before reaching Iteration or
LoopResult. A runProfileMatrix/runCampaign dispatch wrapping runLoop could
report cost but had no tokens to report, so agent-eval's backend-integrity
guard (assertRealBackend, which keys on tokenUsage) would misread a real run
as a stub and throw.
- Iteration + LoopResult gain tokenUsage: { input, output }, summed across
every llm_call event (per iteration) and across iterations (LoopResult).
- reportLoopUsage(cost, result) forwards a finished loop's cost + tokens into
a campaign cost meter in one call — the trivial consumption path for the new
runProfileMatrix primitive. Typed structurally so loops stay free of an
agent-eval import.
Extends the existing cost-aggregation test to assert token aggregation +
reportLoopUsage forwarding. Full suite 381 green.
Consumes the published runProfileMatrix + token-capture release. 7-minor jump verified: typecheck + build + full suite (381) green.
…pter
The seam critique found reportLoopUsage had one consumer (a test) and zero
products: wiring runLoop into runProfileMatrix/runCampaign required hand-building
ExecCtx, hand-adapting the campaign trace, and remembering to forward usage
(forgetting the last yields a {0,0} stub cell). loopDispatch collapses all three
into one typed call:
const dispatch = loopDispatch({ sandboxClient, toLoopOptions })
await runProfileMatrix({ profiles, scenarios, dispatch, judges, commitSha })
It builds the ExecCtx, forwards loop.* trace events into the campaign's scoped
trace (campaignTraceToLoopEmitter), runs runLoop, reports cost+tokens via
reportLoopUsage internally, and returns winner.output. loopCampaignDispatch is
the runCampaign (no-profile) variant. AgentProfile imported from agent-eval
(the eval-harness type ProfileDispatchFn keys on), NOT sandbox's — closes the
name-collision footgun at this call site.
Tests: returns winner artifact + reports exact usage + forwards trace spans;
usage still flows on a validator-failing run (must not read as a stub).
Full suite 383 green.
…ard dependency Version-discipline fix (boundary critique, VERSIONING 3/10). agent-eval was the lone hard dependency while sandbox + agent-knowledge are already peers. A hard dep lets pnpm install a SECOND, divergent agent-eval tree with an incompatible RunRecord/DefaultVerdict; today only pnpm.overrides prevents it. As a peer (>=0.61.0 <1.0.0, required — not optional), a consumer running a stale or divergent substrate gets a loud unmet-peer warning instead of a silent split tree. agent-eval moves to devDependencies for agent-runtime's own build/test. Typecheck + full suite (383) green with the peer layout.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Makes
runLoopsurface the data agent-eval's newrunProfileMatrixneeds to run its backend-integrity guard — answering "how does agent-runtime consume the new primitive."Why
runLoopalready extracted per-calltokensIn/tokensOutfromllm_callevents, but only aggregatedcostUsd— the token counts were dropped before reachingIterationorLoopResult. A product'srunProfileMatrix/runCampaigndispatch that wrapsrunLoopcould report cost but had no tokens to report, so agent-eval'sassertRealBackend(which keys ontokenUsage) would misread a real run as a stub and throw.Changes
Iteration+LoopResultgaintokenUsage: { input, output }, summed per-iteration (acrossllm_callevents) and across iterations (on the result). Structurally matches agent-eval'sRunTokenUsage/CampaignTokenUsage.reportLoopUsage(cost, result)— forwards a finished loop's cost and tokens into a campaign cost meter in one call. This is the trivial consumption path:UsageSink) soloops/stays free of an agent-eval import.Verification
reportLoopUsageforwarding — the regression being "a real run misread as a stub."Note: agent-runtime does not import
runProfileMatrix(products do); its role is to surface the data. The@tangle-network/agent-evaldep bump (^0.54.0→ latest) is a separate follow-up oncerunProfileMatrixis published.