feat(loops): surface tokenUsage on LoopResult + reportLoopUsage bridge for runProfileMatrix by tangletools · Pull Request #74 · tangle-network/agent-runtime

tangletools · 2026-05-30T14:47:59Z

What

Makes runLoop surface the data agent-eval's new runProfileMatrix needs to run its backend-integrity guard — answering "how does agent-runtime consume the new primitive."

Why

runLoop already extracted per-call tokensIn/tokensOut from llm_call events, but only aggregated costUsd — the token counts were dropped before reaching Iteration or LoopResult. A product's runProfileMatrix/runCampaign dispatch that wraps runLoop could report cost but had no tokens to report, so agent-eval's assertRealBackend (which keys on tokenUsage) would misread a real run as a stub and throw.

Changes

Iteration + LoopResult gain tokenUsage: { input, output }, summed per-iteration (across llm_call events) and across iterations (on the result). Structurally matches agent-eval's RunTokenUsage/CampaignTokenUsage.

New reportLoopUsage(cost, result) — forwards a finished loop's cost and tokens into a campaign cost meter in one call. This is the trivial consumption path:

const dispatch: ProfileDispatchFn<S, A> = async (profile, scenario, ctx) => {
  const result = await runLoop({ ...optsFor(profile, scenario), ctx: loopCtx })
  reportLoopUsage(ctx, result)   // cost + tokens → integrity guard sees real activity
  return result.winner?.output as A
}

Typed structurally (UsageSink) so loops/ stays free of an agent-eval import.

Verification

typecheck + build green; full suite 381/381.
Extended the existing cost-aggregation test to assert token aggregation (per-iteration + total) and reportLoopUsage forwarding — the regression being "a real run misread as a stub."

Note: agent-runtime does not import runProfileMatrix (products do); its role is to surface the data. The @tangle-network/agent-eval dep bump (^0.54.0 → latest) is a separate follow-up once runProfileMatrix is published.

…Usage bridge runLoop tracked per-call tokensIn/tokensOut (extractLlmCallEvent) but only aggregated costUsd — token counts were dropped before reaching Iteration or LoopResult. A runProfileMatrix/runCampaign dispatch wrapping runLoop could report cost but had no tokens to report, so agent-eval's backend-integrity guard (assertRealBackend, which keys on tokenUsage) would misread a real run as a stub and throw. - Iteration + LoopResult gain tokenUsage: { input, output }, summed across every llm_call event (per iteration) and across iterations (LoopResult). - reportLoopUsage(cost, result) forwards a finished loop's cost + tokens into a campaign cost meter in one call — the trivial consumption path for the new runProfileMatrix primitive. Typed structurally so loops stay free of an agent-eval import. Extends the existing cost-aggregation test to assert token aggregation + reportLoopUsage forwarding. Full suite 381 green.

Consumes the published runProfileMatrix + token-capture release. 7-minor jump verified: typecheck + build + full suite (381) green.

…pter The seam critique found reportLoopUsage had one consumer (a test) and zero products: wiring runLoop into runProfileMatrix/runCampaign required hand-building ExecCtx, hand-adapting the campaign trace, and remembering to forward usage (forgetting the last yields a {0,0} stub cell). loopDispatch collapses all three into one typed call: const dispatch = loopDispatch({ sandboxClient, toLoopOptions }) await runProfileMatrix({ profiles, scenarios, dispatch, judges, commitSha }) It builds the ExecCtx, forwards loop.* trace events into the campaign's scoped trace (campaignTraceToLoopEmitter), runs runLoop, reports cost+tokens via reportLoopUsage internally, and returns winner.output. loopCampaignDispatch is the runCampaign (no-profile) variant. AgentProfile imported from agent-eval (the eval-harness type ProfileDispatchFn keys on), NOT sandbox's — closes the name-collision footgun at this call site. Tests: returns winner artifact + reports exact usage + forwards trace spans; usage still flows on a validator-failing run (must not read as a stub). Full suite 383 green.

…ard dependency Version-discipline fix (boundary critique, VERSIONING 3/10). agent-eval was the lone hard dependency while sandbox + agent-knowledge are already peers. A hard dep lets pnpm install a SECOND, divergent agent-eval tree with an incompatible RunRecord/DefaultVerdict; today only pnpm.overrides prevents it. As a peer (>=0.61.0 <1.0.0, required — not optional), a consumer running a stale or divergent substrate gets a loud unmet-peer warning instead of a silent split tree. agent-eval moves to devDependencies for agent-runtime's own build/test. Typecheck + full suite (383) green with the peer layout.

…nt-eval peer-dep

drewstone added 5 commits May 30, 2026 08:47

chore(deps): bump @tangle-network/agent-eval ^0.54.0 → ^0.61.0

9cbd686

Consumes the published runProfileMatrix + token-capture release. 7-minor jump verified: typecheck + build + full suite (381) green.

chore(release): 0.32.0 — loopDispatch adapter + tokenUsage seam + age…

ffc89ce

…nt-eval peer-dep

tangletools merged commit fea51a4 into main May 30, 2026
1 check failed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(loops): surface tokenUsage on LoopResult + reportLoopUsage bridge for runProfileMatrix#74

feat(loops): surface tokenUsage on LoopResult + reportLoopUsage bridge for runProfileMatrix#74
tangletools merged 5 commits into
mainfrom
feat/loop-token-usage-for-profile-matrix

tangletools commented May 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tangletools commented May 30, 2026

What

Why

Changes

Verification

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants