Skip to content

feat(loops): surface tokenUsage on LoopResult + reportLoopUsage bridge for runProfileMatrix#74

Merged
tangletools merged 5 commits into
mainfrom
feat/loop-token-usage-for-profile-matrix
May 30, 2026
Merged

feat(loops): surface tokenUsage on LoopResult + reportLoopUsage bridge for runProfileMatrix#74
tangletools merged 5 commits into
mainfrom
feat/loop-token-usage-for-profile-matrix

Conversation

@tangletools
Copy link
Copy Markdown
Contributor

What

Makes runLoop surface the data agent-eval's new runProfileMatrix needs to run its backend-integrity guard — answering "how does agent-runtime consume the new primitive."

Why

runLoop already extracted per-call tokensIn/tokensOut from llm_call events, but only aggregated costUsd — the token counts were dropped before reaching Iteration or LoopResult. A product's runProfileMatrix/runCampaign dispatch that wraps runLoop could report cost but had no tokens to report, so agent-eval's assertRealBackend (which keys on tokenUsage) would misread a real run as a stub and throw.

Changes

  • Iteration + LoopResult gain tokenUsage: { input, output }, summed per-iteration (across llm_call events) and across iterations (on the result). Structurally matches agent-eval's RunTokenUsage/CampaignTokenUsage.
  • New reportLoopUsage(cost, result) — forwards a finished loop's cost and tokens into a campaign cost meter in one call. This is the trivial consumption path:
    const dispatch: ProfileDispatchFn<S, A> = async (profile, scenario, ctx) => {
      const result = await runLoop({ ...optsFor(profile, scenario), ctx: loopCtx })
      reportLoopUsage(ctx, result)   // cost + tokens → integrity guard sees real activity
      return result.winner?.output as A
    }
    Typed structurally (UsageSink) so loops/ stays free of an agent-eval import.

Verification

  • typecheck + build green; full suite 381/381.
  • Extended the existing cost-aggregation test to assert token aggregation (per-iteration + total) and reportLoopUsage forwarding — the regression being "a real run misread as a stub."

Note: agent-runtime does not import runProfileMatrix (products do); its role is to surface the data. The @tangle-network/agent-eval dep bump (^0.54.0 → latest) is a separate follow-up once runProfileMatrix is published.

drewstone added 5 commits May 30, 2026 08:47
…Usage bridge

runLoop tracked per-call tokensIn/tokensOut (extractLlmCallEvent) but only
aggregated costUsd — token counts were dropped before reaching Iteration or
LoopResult. A runProfileMatrix/runCampaign dispatch wrapping runLoop could
report cost but had no tokens to report, so agent-eval's backend-integrity
guard (assertRealBackend, which keys on tokenUsage) would misread a real run
as a stub and throw.

- Iteration + LoopResult gain tokenUsage: { input, output }, summed across
  every llm_call event (per iteration) and across iterations (LoopResult).
- reportLoopUsage(cost, result) forwards a finished loop's cost + tokens into
  a campaign cost meter in one call — the trivial consumption path for the new
  runProfileMatrix primitive. Typed structurally so loops stay free of an
  agent-eval import.

Extends the existing cost-aggregation test to assert token aggregation +
reportLoopUsage forwarding. Full suite 381 green.
Consumes the published runProfileMatrix + token-capture release. 7-minor
jump verified: typecheck + build + full suite (381) green.
…pter

The seam critique found reportLoopUsage had one consumer (a test) and zero
products: wiring runLoop into runProfileMatrix/runCampaign required hand-building
ExecCtx, hand-adapting the campaign trace, and remembering to forward usage
(forgetting the last yields a {0,0} stub cell). loopDispatch collapses all three
into one typed call:

  const dispatch = loopDispatch({ sandboxClient, toLoopOptions })
  await runProfileMatrix({ profiles, scenarios, dispatch, judges, commitSha })

It builds the ExecCtx, forwards loop.* trace events into the campaign's scoped
trace (campaignTraceToLoopEmitter), runs runLoop, reports cost+tokens via
reportLoopUsage internally, and returns winner.output. loopCampaignDispatch is
the runCampaign (no-profile) variant. AgentProfile imported from agent-eval
(the eval-harness type ProfileDispatchFn keys on), NOT sandbox's — closes the
name-collision footgun at this call site.

Tests: returns winner artifact + reports exact usage + forwards trace spans;
usage still flows on a validator-failing run (must not read as a stub).
Full suite 383 green.
…ard dependency

Version-discipline fix (boundary critique, VERSIONING 3/10). agent-eval was the
lone hard dependency while sandbox + agent-knowledge are already peers. A hard
dep lets pnpm install a SECOND, divergent agent-eval tree with an incompatible
RunRecord/DefaultVerdict; today only pnpm.overrides prevents it. As a peer
(>=0.61.0 <1.0.0, required — not optional), a consumer running a stale or
divergent substrate gets a loud unmet-peer warning instead of a silent split
tree. agent-eval moves to devDependencies for agent-runtime's own build/test.
Typecheck + full suite (383) green with the peer layout.
@tangletools tangletools merged commit fea51a4 into main May 30, 2026
1 check failed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants