feat: cc.billing — exact, cache-tier, per-call token attribution (OPIK-6873) by jverre · Pull Request #21 · comet-ml/opik-claude-code-plugin

jverre · 2026-06-11T14:36:10Z

Summary

Rebuilds the plugin's spend attribution around one product schema, cc.billing: per-LLM-call, cache-tier-aware token attribution that is exact against API usage at every level (span → trace → any SUM across traces). This is the data contract for Opik's AI Spend dashboard (OPIK-6870): lane values, breakdowns, stacked definition/usage bars, counts and FE-side pricing all come from this block.

cc.billing:
  llm_calls: 3
  model: claude-fable-5
  totals: {total, input, cache_read, cache_creation, output}   # == Σ original_usage.* exactly
  lanes:
    <laneKey>:                                                 # fixed JSON path per lane
      total, input, cache_read, cache_creation, output
      items: [{name, kind: definition|usage, count, total, input, cache_read, cache_creation, output}]
    unattributed: {...}                                        # explicit remainder — what makes sums exact

How it works

Anthropic's prompt caching is prefix-based, so each call's usage splits the request positionally: [0,R) cache_read, [R,R+W) cache_creation, tail fresh input. Per call, the request is laid out as ordered pieces (static config prefix, then the conversation in transcript order), reconciled to the call's measured usage (usage-derived pieces never rescaled; undershoot lands in unattributed, placed at the tail where unobserved content actually sits), and cut by position. Output is booked from per-block attributed tokens. Tier tokens are therefore per-call billing events: plain SUM aggregation reproduces the API bill exactly — no cumulative weighting, no scale factors.

Also in this PR

Bug fix — 2x usage double-count: the transcript repeats message.usage on every entry of a multi-block message; two summary paths summed per entry. Found via a three-way audit (transcript vs span aggregation vs /context); reproduced on fresh production data (logged 202 for a true 101, exactly 2x). Fixed by counting once per message.id.
count_tokens anchoring: exact piece sizes via the free endpoint, authenticated with Claude Code's own OAuth credential (keychain / .credentials.json). Persistent (model, sha256) cache → static config is measured once ever; budgeted, detached at turn end. Live measurement showed ratio estimates running 25–55% under on table/code-heavy content — exactly the mass previously stuck in unattributed.
Static overhead itemization: the dynamic environment block (cwd, platform, git status) is reconstructed locally and carved out as its own item — repos with long git status visibly pay for it per request; deferred-catalog names split built-in vs MCP; optional per-release Components table + capture procedure in docs/builtin-calibration.md.
Payload consolidation: the per-lane composition emitters (skills/tools/memory/agents summaries, user_prompts, tool_results, file_attachments, prior_assistant, thinking, assistant_text, output_tokens) are removed — every UI surface they fed now reads cc.billing. Net −900 lines. Trace metadata is identity + git + billing + cc_builtin (+ context_runtime).
Skill load identity: one loaded[] event per load with stable unique ids (slash:<idx> for slash commands) so repeat loads count correctly.
Removed after validation: a call-1 residual self-calibration for the bundled system prompt — it absorbed all unobserved request content and inflated static overhead 3.6x vs /context ground truth (commit history documents the dead end).

Validation

Invariant	Where
`Σ lanes (incl. unattributed) == input/cache_read/cache_creation/output` per call & per trace	`billing_test.go` + two real sessions, exact to the token
Cache-prefix chain: `read(k+1) == read(k) + write(k)`	real-session audit (850k-context session, token-perfect)
Usage booked once per `message.id` (the 2x regression)	`cumulative_test.go`
Skill bodies never leak into user_prompts; repeat loads keep distinct ids	`cumulative_test.go`
`by_size`/`by_content`-style partitions sum to lane totals	`billing_test.go`
Anchoring budget/baseline mechanics, cache merge, anchor preference	`count_tokens_test.go`

Full suite green. Consumer-side queries (Opik BE) are documented on OPIK-6870.

🤖 Generated with Claude Code

…counts (OPIK-6873) Token sums for user_prompts, tool_results and file_attachments switch from turn-only to cumulative-to-date: every prior prompt/result/attachment is replayed in each request, so SUM of the per-trace value across a session's traces now yields billing-weighted attribution (size × turns it rides in). Counts stay new-this-turn so the same SUM yields true item counts instead of a quadratic blow-up. - user_prompts: add by_size[] {bucket, tokens (cumulative), count (new-this-turn)}, bucketing each prompt individually. - prior_assistant: per-block attribution — text + thinking only, computed from AttributedOutputTokens; tool_use share excluded (it belongs to the tool lanes; counting it here double-attributed). New by_content {assistant_text, thinking}. - skills.loaded: one event per load (latest-wins collapsed repeat loads, under-counting tokens and loads); slash-command loads get a stable synthetic "slash:<entry idx>" tool_use_id so consumers can dedupe load events across the cumulative per-trace arrays. Validation (cumulative_test.go): replay formula Σ size × (N − i + 1), count conservation (Σ per-turn == whole-session), by_size/by_content partition their lane totals, tool_use share + lane total == measured usage.output_tokens. Also verified against two real transcripts (805 entries / 1 turn and 1,111 entries / 29 turns): monotone cumulative tokens, exact count conservation (29 prompts, 200 tool calls). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…in row Audit of a real 3-turn session (OPIK-6873) against per-call API usage found two collection errors: - The transcript repeats the same message.usage on every entry of a multi-block message (one entry per content block, shared message.id). assistantOutputTotals and cumulativeMessagesTokens summed per ENTRY, inflating prior-assistant totals and the context snapshot's messages category ~2x (observed: 2,302 reported vs 1,151 real). Both now count once per message id. The per-block attribution path (DeduplicateUsage) was already correct. - cc_builtin matched 2.1.173 sessions against the 2.1.150 row, where built-in tool schemas were still always-on (17.6k). 2.1.173 defers most of the catalog (1.1k always-on), so static overhead was overstated 4.3x. Added the 2.1.173 row captured from /context. After both fixes the audited session's lane composition reconciles with its /context ground truth per category, not just in aggregate. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Lays each LLM call's request out as ordered pieces (static config prefix, then the conversation in transcript order) and cuts by position against the call's usage: [0,R) cache_read, [R,R+W) cache_creation, tail fresh — prompt caching is prefix-based, so position determines the billing tier. Output is booked from per-block attributed tokens. Exactness contract, validated per call and therefore per trace (trace usage is the sum of its spans): by_lane columns — including an explicit `unattributed` row that absorbs unparsed content (system reminders, request envelope) and estimation drift — sum exactly to input_tokens, cache_read, cache_creation and output_tokens. Usage-derived pieces are never rescaled; estimates shrink proportionally only on overshoot. Tier tokens are per-call billing events: additive across calls, traces and periods, so plain SUM aggregation reproduces the API bill exactly — no cumulative weighting needed for the billed view. Verified against two real sessions (5 calls each): totals and lane sums match measured usage to the token on all four columns. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…paths Replace the by_lane/by_entity arrays with cc.billing.lanes.<laneKey>, so the backend's composition query stays one fixed JSON path per cell: SUM(JSONExtractInt(metadata,'cc','billing','lanes','skills','cache_read')) and breakdowns reuse the existing generic ARRAY JOIN pattern over ...,'lanes','<lane>','items' with label field `name`. `total` is precomputed per lane and per item (sum of the four columns) so the lane-card/Sankey value is also a single path. Exactness contract and tests unchanged. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Per-lane composition emitters (skills/tools/memory/agents summaries, user_prompts, tool_results, file_attachments, prior_assistant, thinking, assistant_text, output_tokens) are removed from the trace payload: every UI surface they fed now comes from cc.billing, which additionally gives the cache-tier split and is exact against API usage. The trace metadata is now identity + git + billing + cc_builtin (+ context_runtime patched later). cc.billing items gain what the UI panels need: - kind: "definition" (always-on config) vs "usage" (conversation-driven) — drives the stacked bars and the unused badge (definition > 0, usage = 0) - count: NEW events this turn per entity (prompts, calls, loads, files) — additive, so plain SUM yields true counts - the skill menu attachment is split into per-skill definition pieces (parseSkillListingMenu), and MCP instruction deltas split per server, so per-entity definition cost is available without the composition view Extractors that only fed the removed sections are deleted; the ones billing itself uses (memory, agents, tools, skill identity machinery, the per-span context snapshot) stay. Regressions ported to billing level: usage booked once per message.id (the 2x double-count audit), skill bodies excluded from user_prompts, distinct slash-load ids. Verified on a real session: totals exact to usage on all four columns; skills lane = 129 per-skill definition items + load usage items. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…t_tokens) Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…point Measures the exact token cost of content we possess (skill menu blocks, skill bodies, memory files, agent frontmatter, MCP instructions, prompts, string tool results) against /v1/messages/count_tokens, authenticated with Claude Code's own OAuth credential (keychain / .credentials.json; falls back to ANTHROPIC_API_KEY). Free endpoint, no user setup. - Persistent cache keyed (model, sha256) at ~/.opik-token-counts.json: hash-stable config content is measured once ever; steady-state API traffic ~0. Budgeted (10 calls/turn, biggest-first), runs in a detached child at turn end (same pattern as the /context fetcher); the next flush picks anchors up via measuredOrEstimate. - cc_builtin self-calibration: on a session's cache-cold first call the bundled system prompt + builtin schemas are derived as the residual of call-1 usage minus everything attributable, stored per CC version, and preferred over the hand-maintained table. - Anchors only improve the SPLIT — per-call usage stays the exact total — shrinking `unattributed` toward genuinely unobserved content. - Opt-out: OPIK_CC_DISABLE_TOKEN_COUNT=true. Live validation on a real session: ratio estimates ran 25-55% UNDER measured on table/code-heavy content (a 17.9k-char /context dump: estimated 4,173 vs measured 9,373) — exactly the mass previously stuck in unattributed. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

The residual (call-1 usage minus attributable pieces) absorbs ALL unobserved request content — system reminders, deferred-tool name listings — plus the estimation drift of everything else at call 1. Validated against /context ground truth: it stored 24.4k for a bundled block that actually costs 6.8k, inflating static_overhead ~3.6x and cannibalizing the `unattributed` lane entirely. The version table + count_tokens anchors remain the source for the bundled block; unseen mass stays in `unattributed` where it's visible. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…split, components table Tier 1 (runtime): the bundled prompt's dynamic environment block (cwd, platform, git status snapshot) is locally reconstructible — rebuild it, size it via measuredOrEstimate (anchored when stable), and emit it as its own static_overhead item, with the per-version remainder as `core_prompt`. Repos with long git status visibly pay for it on every request. The deferred-tools catalog deltas now split per name: built-in names land in static_overhead/deferred_tool_names, mcp__ names stay in mcp_servers/catalog_deltas. Tier 2 (per-release): ccBuiltinConstants gains an optional Components map — a named itemization of the bundled block (identity/harness rules, memory instructions, session guidance, per-tool schemas) produced by the capture procedure in docs/builtin-calibration.md. When present it replaces the two-entity split. Invariant: Σ components == prompt + schemas constants. Verified end-to-end: BE breakdown endpoint serves the new items through the generic cc.billing items query with no backend changes. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

jverre and others added 10 commits June 11, 2026 15:35

refactor: rename billing field fresh -> input (matches the API's inpu…

9aea1bd

…t_tokens) Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

feat: stamp model on cc.billing for tier-column pricing

82e267f

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

jverre changed the title ~~feat: billing-weighted lane accounting — cumulative tokens, per-turn counts (OPIK-6873)~~ feat: cc.billing — exact, cache-tier, per-call token attribution (OPIK-6873) Jun 11, 2026

jverre merged commit e25635a into main Jun 11, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: cc.billing — exact, cache-tier, per-call token attribution (OPIK-6873)#21

feat: cc.billing — exact, cache-tier, per-call token attribution (OPIK-6873)#21
jverre merged 10 commits into
mainfrom
jacques/OPIK-6873-cumulative-lane-accounting

jverre commented Jun 11, 2026 •

edited by atlassian Bot

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jverre commented Jun 11, 2026 • edited by atlassian Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

How it works

Also in this PR

Validation

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jverre commented Jun 11, 2026 •

edited by atlassian Bot

Loading