feat: cc.billing — exact, cache-tier, per-call token attribution (OPIK-6873)#21
Merged
Merged
Conversation
…counts (OPIK-6873)
Token sums for user_prompts, tool_results and file_attachments switch from
turn-only to cumulative-to-date: every prior prompt/result/attachment is
replayed in each request, so SUM of the per-trace value across a session's
traces now yields billing-weighted attribution (size × turns it rides in).
Counts stay new-this-turn so the same SUM yields true item counts instead
of a quadratic blow-up.
- user_prompts: add by_size[] {bucket, tokens (cumulative), count
(new-this-turn)}, bucketing each prompt individually.
- prior_assistant: per-block attribution — text + thinking only, computed
from AttributedOutputTokens; tool_use share excluded (it belongs to the
tool lanes; counting it here double-attributed). New by_content
{assistant_text, thinking}.
- skills.loaded: one event per load (latest-wins collapsed repeat loads,
under-counting tokens and loads); slash-command loads get a stable
synthetic "slash:<entry idx>" tool_use_id so consumers can dedupe load
events across the cumulative per-trace arrays.
Validation (cumulative_test.go): replay formula Σ size × (N − i + 1),
count conservation (Σ per-turn == whole-session), by_size/by_content
partition their lane totals, tool_use share + lane total == measured
usage.output_tokens. Also verified against two real transcripts (805
entries / 1 turn and 1,111 entries / 29 turns): monotone cumulative
tokens, exact count conservation (29 prompts, 200 tool calls).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…in row Audit of a real 3-turn session (OPIK-6873) against per-call API usage found two collection errors: - The transcript repeats the same message.usage on every entry of a multi-block message (one entry per content block, shared message.id). assistantOutputTotals and cumulativeMessagesTokens summed per ENTRY, inflating prior-assistant totals and the context snapshot's messages category ~2x (observed: 2,302 reported vs 1,151 real). Both now count once per message id. The per-block attribution path (DeduplicateUsage) was already correct. - cc_builtin matched 2.1.173 sessions against the 2.1.150 row, where built-in tool schemas were still always-on (17.6k). 2.1.173 defers most of the catalog (1.1k always-on), so static overhead was overstated 4.3x. Added the 2.1.173 row captured from /context. After both fixes the audited session's lane composition reconciles with its /context ground truth per category, not just in aggregate. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Lays each LLM call's request out as ordered pieces (static config prefix, then the conversation in transcript order) and cuts by position against the call's usage: [0,R) cache_read, [R,R+W) cache_creation, tail fresh — prompt caching is prefix-based, so position determines the billing tier. Output is booked from per-block attributed tokens. Exactness contract, validated per call and therefore per trace (trace usage is the sum of its spans): by_lane columns — including an explicit `unattributed` row that absorbs unparsed content (system reminders, request envelope) and estimation drift — sum exactly to input_tokens, cache_read, cache_creation and output_tokens. Usage-derived pieces are never rescaled; estimates shrink proportionally only on overshoot. Tier tokens are per-call billing events: additive across calls, traces and periods, so plain SUM aggregation reproduces the API bill exactly — no cumulative weighting needed for the billed view. Verified against two real sessions (5 calls each): totals and lane sums match measured usage to the token on all four columns. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…paths Replace the by_lane/by_entity arrays with cc.billing.lanes.<laneKey>, so the backend's composition query stays one fixed JSON path per cell: SUM(JSONExtractInt(metadata,'cc','billing','lanes','skills','cache_read')) and breakdowns reuse the existing generic ARRAY JOIN pattern over ...,'lanes','<lane>','items' with label field `name`. `total` is precomputed per lane and per item (sum of the four columns) so the lane-card/Sankey value is also a single path. Exactness contract and tests unchanged. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Per-lane composition emitters (skills/tools/memory/agents summaries, user_prompts, tool_results, file_attachments, prior_assistant, thinking, assistant_text, output_tokens) are removed from the trace payload: every UI surface they fed now comes from cc.billing, which additionally gives the cache-tier split and is exact against API usage. The trace metadata is now identity + git + billing + cc_builtin (+ context_runtime patched later). cc.billing items gain what the UI panels need: - kind: "definition" (always-on config) vs "usage" (conversation-driven) — drives the stacked bars and the unused badge (definition > 0, usage = 0) - count: NEW events this turn per entity (prompts, calls, loads, files) — additive, so plain SUM yields true counts - the skill menu attachment is split into per-skill definition pieces (parseSkillListingMenu), and MCP instruction deltas split per server, so per-entity definition cost is available without the composition view Extractors that only fed the removed sections are deleted; the ones billing itself uses (memory, agents, tools, skill identity machinery, the per-span context snapshot) stay. Regressions ported to billing level: usage booked once per message.id (the 2x double-count audit), skill bodies excluded from user_prompts, distinct slash-load ids. Verified on a real session: totals exact to usage on all four columns; skills lane = 129 per-skill definition items + load usage items. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…t_tokens) Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…point Measures the exact token cost of content we possess (skill menu blocks, skill bodies, memory files, agent frontmatter, MCP instructions, prompts, string tool results) against /v1/messages/count_tokens, authenticated with Claude Code's own OAuth credential (keychain / .credentials.json; falls back to ANTHROPIC_API_KEY). Free endpoint, no user setup. - Persistent cache keyed (model, sha256) at ~/.opik-token-counts.json: hash-stable config content is measured once ever; steady-state API traffic ~0. Budgeted (10 calls/turn, biggest-first), runs in a detached child at turn end (same pattern as the /context fetcher); the next flush picks anchors up via measuredOrEstimate. - cc_builtin self-calibration: on a session's cache-cold first call the bundled system prompt + builtin schemas are derived as the residual of call-1 usage minus everything attributable, stored per CC version, and preferred over the hand-maintained table. - Anchors only improve the SPLIT — per-call usage stays the exact total — shrinking `unattributed` toward genuinely unobserved content. - Opt-out: OPIK_CC_DISABLE_TOKEN_COUNT=true. Live validation on a real session: ratio estimates ran 25-55% UNDER measured on table/code-heavy content (a 17.9k-char /context dump: estimated 4,173 vs measured 9,373) — exactly the mass previously stuck in unattributed. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The residual (call-1 usage minus attributable pieces) absorbs ALL unobserved request content — system reminders, deferred-tool name listings — plus the estimation drift of everything else at call 1. Validated against /context ground truth: it stored 24.4k for a bundled block that actually costs 6.8k, inflating static_overhead ~3.6x and cannibalizing the `unattributed` lane entirely. The version table + count_tokens anchors remain the source for the bundled block; unseen mass stays in `unattributed` where it's visible. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…split, components table Tier 1 (runtime): the bundled prompt's dynamic environment block (cwd, platform, git status snapshot) is locally reconstructible — rebuild it, size it via measuredOrEstimate (anchored when stable), and emit it as its own static_overhead item, with the per-version remainder as `core_prompt`. Repos with long git status visibly pay for it on every request. The deferred-tools catalog deltas now split per name: built-in names land in static_overhead/deferred_tool_names, mcp__ names stay in mcp_servers/catalog_deltas. Tier 2 (per-release): ccBuiltinConstants gains an optional Components map — a named itemization of the bundled block (identity/harness rules, memory instructions, session guidance, per-tool schemas) produced by the capture procedure in docs/builtin-calibration.md. When present it replaces the two-entity split. Invariant: Σ components == prompt + schemas constants. Verified end-to-end: BE breakdown endpoint serves the new items through the generic cc.billing items query with no backend changes. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Rebuilds the plugin's spend attribution around one product schema,
cc.billing: per-LLM-call, cache-tier-aware token attribution that is exact against API usage at every level (span → trace → any SUM across traces). This is the data contract for Opik's AI Spend dashboard (OPIK-6870): lane values, breakdowns, stacked definition/usage bars, counts and FE-side pricing all come from this block.How it works
Anthropic's prompt caching is prefix-based, so each call's usage splits the request positionally:
[0,R)cache_read,[R,R+W)cache_creation, tail fresh input. Per call, the request is laid out as ordered pieces (static config prefix, then the conversation in transcript order), reconciled to the call's measured usage (usage-derived pieces never rescaled; undershoot lands inunattributed, placed at the tail where unobserved content actually sits), and cut by position. Output is booked from per-block attributed tokens. Tier tokens are therefore per-call billing events: plain SUM aggregation reproduces the API bill exactly — no cumulative weighting, no scale factors.Also in this PR
message.usageon every entry of a multi-block message; two summary paths summed per entry. Found via a three-way audit (transcript vs span aggregation vs /context); reproduced on fresh production data (logged 202 for a true 101, exactly 2x). Fixed by counting once permessage.id.count_tokensanchoring: exact piece sizes via the free endpoint, authenticated with Claude Code's own OAuth credential (keychain /.credentials.json). Persistent(model, sha256)cache → static config is measured once ever; budgeted, detached at turn end. Live measurement showed ratio estimates running 25–55% under on table/code-heavy content — exactly the mass previously stuck inunattributed.Componentstable + capture procedure indocs/builtin-calibration.md.cc.billing. Net −900 lines. Trace metadata isidentity + git + billing + cc_builtin (+ context_runtime).loaded[]event per load with stable unique ids (slash:<idx>for slash commands) so repeat loads count correctly./contextground truth (commit history documents the dead end).Validation
Σ lanes (incl. unattributed) == input/cache_read/cache_creation/outputper call & per tracebilling_test.go+ two real sessions, exact to the tokenread(k+1) == read(k) + write(k)message.id(the 2x regression)cumulative_test.gocumulative_test.goby_size/by_content-style partitions sum to lane totalsbilling_test.gocount_tokens_test.goFull suite green. Consumer-side queries (Opik BE) are documented on OPIK-6870.
🤖 Generated with Claude Code