Skip to content

feat: cc.billing — exact, cache-tier, per-call token attribution (OPIK-6873)#21

Merged
jverre merged 10 commits into
mainfrom
jacques/OPIK-6873-cumulative-lane-accounting
Jun 11, 2026
Merged

feat: cc.billing — exact, cache-tier, per-call token attribution (OPIK-6873)#21
jverre merged 10 commits into
mainfrom
jacques/OPIK-6873-cumulative-lane-accounting

Conversation

@jverre

@jverre jverre commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator

Summary

Rebuilds the plugin's spend attribution around one product schema, cc.billing: per-LLM-call, cache-tier-aware token attribution that is exact against API usage at every level (span → trace → any SUM across traces). This is the data contract for Opik's AI Spend dashboard (OPIK-6870): lane values, breakdowns, stacked definition/usage bars, counts and FE-side pricing all come from this block.

cc.billing:
  llm_calls: 3
  model: claude-fable-5
  totals: {total, input, cache_read, cache_creation, output}   # == Σ original_usage.* exactly
  lanes:
    <laneKey>:                                                 # fixed JSON path per lane
      total, input, cache_read, cache_creation, output
      items: [{name, kind: definition|usage, count, total, input, cache_read, cache_creation, output}]
    unattributed: {...}                                        # explicit remainder — what makes sums exact

How it works

Anthropic's prompt caching is prefix-based, so each call's usage splits the request positionally: [0,R) cache_read, [R,R+W) cache_creation, tail fresh input. Per call, the request is laid out as ordered pieces (static config prefix, then the conversation in transcript order), reconciled to the call's measured usage (usage-derived pieces never rescaled; undershoot lands in unattributed, placed at the tail where unobserved content actually sits), and cut by position. Output is booked from per-block attributed tokens. Tier tokens are therefore per-call billing events: plain SUM aggregation reproduces the API bill exactly — no cumulative weighting, no scale factors.

Also in this PR

  • Bug fix — 2x usage double-count: the transcript repeats message.usage on every entry of a multi-block message; two summary paths summed per entry. Found via a three-way audit (transcript vs span aggregation vs /context); reproduced on fresh production data (logged 202 for a true 101, exactly 2x). Fixed by counting once per message.id.
  • count_tokens anchoring: exact piece sizes via the free endpoint, authenticated with Claude Code's own OAuth credential (keychain / .credentials.json). Persistent (model, sha256) cache → static config is measured once ever; budgeted, detached at turn end. Live measurement showed ratio estimates running 25–55% under on table/code-heavy content — exactly the mass previously stuck in unattributed.
  • Static overhead itemization: the dynamic environment block (cwd, platform, git status) is reconstructed locally and carved out as its own item — repos with long git status visibly pay for it per request; deferred-catalog names split built-in vs MCP; optional per-release Components table + capture procedure in docs/builtin-calibration.md.
  • Payload consolidation: the per-lane composition emitters (skills/tools/memory/agents summaries, user_prompts, tool_results, file_attachments, prior_assistant, thinking, assistant_text, output_tokens) are removed — every UI surface they fed now reads cc.billing. Net −900 lines. Trace metadata is identity + git + billing + cc_builtin (+ context_runtime).
  • Skill load identity: one loaded[] event per load with stable unique ids (slash:<idx> for slash commands) so repeat loads count correctly.
  • Removed after validation: a call-1 residual self-calibration for the bundled system prompt — it absorbed all unobserved request content and inflated static overhead 3.6x vs /context ground truth (commit history documents the dead end).

Validation

Invariant Where
Σ lanes (incl. unattributed) == input/cache_read/cache_creation/output per call & per trace billing_test.go + two real sessions, exact to the token
Cache-prefix chain: read(k+1) == read(k) + write(k) real-session audit (850k-context session, token-perfect)
Usage booked once per message.id (the 2x regression) cumulative_test.go
Skill bodies never leak into user_prompts; repeat loads keep distinct ids cumulative_test.go
by_size/by_content-style partitions sum to lane totals billing_test.go
Anchoring budget/baseline mechanics, cache merge, anchor preference count_tokens_test.go

Full suite green. Consumer-side queries (Opik BE) are documented on OPIK-6870.

🤖 Generated with Claude Code

jverre and others added 10 commits June 11, 2026 15:35
…counts (OPIK-6873)

Token sums for user_prompts, tool_results and file_attachments switch from
turn-only to cumulative-to-date: every prior prompt/result/attachment is
replayed in each request, so SUM of the per-trace value across a session's
traces now yields billing-weighted attribution (size × turns it rides in).
Counts stay new-this-turn so the same SUM yields true item counts instead
of a quadratic blow-up.

- user_prompts: add by_size[] {bucket, tokens (cumulative), count
  (new-this-turn)}, bucketing each prompt individually.
- prior_assistant: per-block attribution — text + thinking only, computed
  from AttributedOutputTokens; tool_use share excluded (it belongs to the
  tool lanes; counting it here double-attributed). New by_content
  {assistant_text, thinking}.
- skills.loaded: one event per load (latest-wins collapsed repeat loads,
  under-counting tokens and loads); slash-command loads get a stable
  synthetic "slash:<entry idx>" tool_use_id so consumers can dedupe load
  events across the cumulative per-trace arrays.

Validation (cumulative_test.go): replay formula Σ size × (N − i + 1),
count conservation (Σ per-turn == whole-session), by_size/by_content
partition their lane totals, tool_use share + lane total == measured
usage.output_tokens. Also verified against two real transcripts (805
entries / 1 turn and 1,111 entries / 29 turns): monotone cumulative
tokens, exact count conservation (29 prompts, 200 tool calls).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…in row

Audit of a real 3-turn session (OPIK-6873) against per-call API usage
found two collection errors:

- The transcript repeats the same message.usage on every entry of a
  multi-block message (one entry per content block, shared message.id).
  assistantOutputTotals and cumulativeMessagesTokens summed per ENTRY,
  inflating prior-assistant totals and the context snapshot's messages
  category ~2x (observed: 2,302 reported vs 1,151 real). Both now count
  once per message id. The per-block attribution path (DeduplicateUsage)
  was already correct.

- cc_builtin matched 2.1.173 sessions against the 2.1.150 row, where
  built-in tool schemas were still always-on (17.6k). 2.1.173 defers most
  of the catalog (1.1k always-on), so static overhead was overstated 4.3x.
  Added the 2.1.173 row captured from /context.

After both fixes the audited session's lane composition reconciles with
its /context ground truth per category, not just in aggregate.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Lays each LLM call's request out as ordered pieces (static config prefix,
then the conversation in transcript order) and cuts by position against the
call's usage: [0,R) cache_read, [R,R+W) cache_creation, tail fresh — prompt
caching is prefix-based, so position determines the billing tier. Output is
booked from per-block attributed tokens.

Exactness contract, validated per call and therefore per trace (trace usage
is the sum of its spans): by_lane columns — including an explicit
`unattributed` row that absorbs unparsed content (system reminders, request
envelope) and estimation drift — sum exactly to input_tokens, cache_read,
cache_creation and output_tokens. Usage-derived pieces are never rescaled;
estimates shrink proportionally only on overshoot.

Tier tokens are per-call billing events: additive across calls, traces and
periods, so plain SUM aggregation reproduces the API bill exactly — no
cumulative weighting needed for the billed view.

Verified against two real sessions (5 calls each): totals and lane sums
match measured usage to the token on all four columns.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…paths

Replace the by_lane/by_entity arrays with cc.billing.lanes.<laneKey>, so the
backend's composition query stays one fixed JSON path per cell:

  SUM(JSONExtractInt(metadata,'cc','billing','lanes','skills','cache_read'))

and breakdowns reuse the existing generic ARRAY JOIN pattern over
...,'lanes','<lane>','items' with label field `name`. `total` is precomputed
per lane and per item (sum of the four columns) so the lane-card/Sankey value
is also a single path. Exactness contract and tests unchanged.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Per-lane composition emitters (skills/tools/memory/agents summaries,
user_prompts, tool_results, file_attachments, prior_assistant, thinking,
assistant_text, output_tokens) are removed from the trace payload: every UI
surface they fed now comes from cc.billing, which additionally gives the
cache-tier split and is exact against API usage. The trace metadata is now
identity + git + billing + cc_builtin (+ context_runtime patched later).

cc.billing items gain what the UI panels need:
- kind: "definition" (always-on config) vs "usage" (conversation-driven) —
  drives the stacked bars and the unused badge (definition > 0, usage = 0)
- count: NEW events this turn per entity (prompts, calls, loads, files) —
  additive, so plain SUM yields true counts
- the skill menu attachment is split into per-skill definition pieces
  (parseSkillListingMenu), and MCP instruction deltas split per server, so
  per-entity definition cost is available without the composition view

Extractors that only fed the removed sections are deleted; the ones billing
itself uses (memory, agents, tools, skill identity machinery, the per-span
context snapshot) stay. Regressions ported to billing level: usage booked
once per message.id (the 2x double-count audit), skill bodies excluded from
user_prompts, distinct slash-load ids.

Verified on a real session: totals exact to usage on all four columns;
skills lane = 129 per-skill definition items + load usage items.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…t_tokens)

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…point

Measures the exact token cost of content we possess (skill menu blocks,
skill bodies, memory files, agent frontmatter, MCP instructions, prompts,
string tool results) against /v1/messages/count_tokens, authenticated with
Claude Code's own OAuth credential (keychain / .credentials.json; falls
back to ANTHROPIC_API_KEY). Free endpoint, no user setup.

- Persistent cache keyed (model, sha256) at ~/.opik-token-counts.json:
  hash-stable config content is measured once ever; steady-state API
  traffic ~0. Budgeted (10 calls/turn, biggest-first), runs in a detached
  child at turn end (same pattern as the /context fetcher); the next flush
  picks anchors up via measuredOrEstimate.
- cc_builtin self-calibration: on a session's cache-cold first call the
  bundled system prompt + builtin schemas are derived as the residual of
  call-1 usage minus everything attributable, stored per CC version, and
  preferred over the hand-maintained table.
- Anchors only improve the SPLIT — per-call usage stays the exact total —
  shrinking `unattributed` toward genuinely unobserved content.
- Opt-out: OPIK_CC_DISABLE_TOKEN_COUNT=true.

Live validation on a real session: ratio estimates ran 25-55% UNDER
measured on table/code-heavy content (a 17.9k-char /context dump:
estimated 4,173 vs measured 9,373) — exactly the mass previously stuck in
unattributed.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The residual (call-1 usage minus attributable pieces) absorbs ALL
unobserved request content — system reminders, deferred-tool name
listings — plus the estimation drift of everything else at call 1.
Validated against /context ground truth: it stored 24.4k for a bundled
block that actually costs 6.8k, inflating static_overhead ~3.6x and
cannibalizing the `unattributed` lane entirely.

The version table + count_tokens anchors remain the source for the
bundled block; unseen mass stays in `unattributed` where it's visible.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…split, components table

Tier 1 (runtime): the bundled prompt's dynamic environment block (cwd,
platform, git status snapshot) is locally reconstructible — rebuild it,
size it via measuredOrEstimate (anchored when stable), and emit it as its
own static_overhead item, with the per-version remainder as `core_prompt`.
Repos with long git status visibly pay for it on every request. The
deferred-tools catalog deltas now split per name: built-in names land in
static_overhead/deferred_tool_names, mcp__ names stay in
mcp_servers/catalog_deltas.

Tier 2 (per-release): ccBuiltinConstants gains an optional Components map —
a named itemization of the bundled block (identity/harness rules, memory
instructions, session guidance, per-tool schemas) produced by the capture
procedure in docs/builtin-calibration.md. When present it replaces the
two-entity split. Invariant: Σ components == prompt + schemas constants.

Verified end-to-end: BE breakdown endpoint serves the new items through the
generic cc.billing items query with no backend changes.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@jverre jverre changed the title feat: billing-weighted lane accounting — cumulative tokens, per-turn counts (OPIK-6873) feat: cc.billing — exact, cache-tier, per-call token attribution (OPIK-6873) Jun 11, 2026
@jverre jverre merged commit e25635a into main Jun 11, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant