feat: trace-stats meter + cross-repo cold-vs-trace benchmark (v0.8.0) by kgohil · Pull Request #26 · kgohil/trace-my-code

kgohil · 2026-06-26T20:11:07Z

What

Adds trace-stats — a built-in effectiveness meter (the /ctx-stats analog) — and grounds the skill's value in a cross-repo cold-vs-trace benchmark (5 planning tasks, 2 repos, full claude -p telemetry). Refreshes the benefits, benchmark, and agent-loop image. Bumps the skill to v0.8.0.

Why

The skill claimed value ("read the map, not the territory; reuse instead of reinvent") but had no built-in way to measure whether a repo's trace is earning its keep, and its public proof leaned on a single n=1 A/B + private anecdotes. This makes the value measurable (the meter) and grounds it in real, reproducible numbers (the benchmark).

Changes

skills/trace-my-code/hooks/trace-stats.sh (new) — the meter:

Coverage — areas with an ARCHITECTURE.md vs significant source dirs
Map compression — trace tokens vs codebase tokens
Citation health — how many `path › symbol` citations still resolve (reuses the drift-hook check)
Freshness — Mode-B auto-refresh commits + open _TODO: confirm_ debt
Quality grade — claude-md-style A–F (citation accuracy, currency, conciseness, gotcha coverage)
Context footprint — how much smaller an area's doc is than its code
--json (CI) / --citations (list broken). Pure bash (3.2+), reads only, degrades cleanly on a trace-less repo. It scores the map; the cold-vs-trace savings come from the A/B below.

Cross-repo A/B benchmark (new) — 5 planning tasks, cold vs trace, across multi-tool-app and honojs/hono (trace bootstrapped on just the touched areas):

median −64% input · −33% cost · −59% wall time, ~⅓ the files/turns, same correct plan (5/5)
input drops more than cost (cache economics) — stated, not hidden; the opaque-domain run keeps the correctness win
raw per-task telemetry committed: benchmarks/runs/2026-06-cross-repo.csv

SKILL.md → v0.8.0 — "Measuring effectiveness" section; cross-repo proof-point; claude-md learnings folded in (terse / document-only-the-non-obvious Guardrail; Mode-A "harvest what was missing" reflection).

Benchmarks + README — lead with trace-stats + the cross-repo A/B; trace-stats reframed as a map-scorer, not a savings-estimator; real-repo session numbers refreshed (1:55 compression, 98% citation accuracy, grade 77).

Image — assets/agentic-loop.jpg stats strip → −64% input · −59% wall time · ⅓ the files · same correct plan (diagram otherwise unchanged).

Housekeeping — install.md §5, plugin manifests + CHANGELOG → 0.8.0.

Reviewer notes

trace-stats.sh is portable bash 3.2 (macOS), no deps, reads-only, no network.
Benchmark method: each arm a real claude -p --output-format json run; n=1/task (the 5-task spread is the variance, not error bars); planning-phase proxy; every task had a real reuse target. Honest limits are in benchmarks/README.md.
A–F weighting: citation accuracy 35% / currency 25% / conciseness 15% / gotcha 15% / coverage 10%.

…-md authoring discipline - trace-stats.sh — the /ctx-stats analog: coverage, map compression (trace tokens vs codebase tokens), citation health (resolving path-symbol citations), freshness (auto-refresh commits + open _TODO_ debt), a claude-md-style A-F quality grade, and estimated tokens saved per task. --json / --citations. Pure bash 3.2+, reads only, degrades cleanly on a trace-less repo. - SKILL.md (v0.8.0): "Measuring effectiveness" section, proven-on-a-real-repo benefits, and claude-md learnings folded into Guardrails (terse / document only the non-obvious) + a Mode-A "harvest what was missing" reflection trigger. - benchmarks/README: lead with the trace-stats meter + a real ~100k-line-repo session (1:55 compression, 92% citations, ~22x cheaper context per area); drop opaque private-project anecdotes; keep the controlled cold-vs-trace A/B. - install.md: section 5 (run trace-stats); plugin manifests + CHANGELOG -> 0.8.0.

…al use-case - agentic-loop.jpg: regenerated via gemini-3-pro-image-preview — example box now "add a hash-generator tool to the app"; stats bar now "-56% files read · ~22x cheaper context · 92% citations resolve · reuse, not reinvent". - README: image alt matches; Early-signal section leads with the trace-stats meter + the real ~100k-line-repo session (1:55 compression, 92% citations, ~22x cheaper), then the controlled cold-vs-trace A/B (jargon generalized).

…chmark consistency The "real result" (trace-stats) and the "controlled A/B" now use the SAME repo — ran a paired cold-vs-trace planning probe on the multi-tool-app ("plan adding a UUID Generator tool"): - cold (no trace): 17 files, 99,402 tok, 70s - trace: 4 files, 84,148 tok, 38s → -76% files, -15% tokens, -45% time Both reached the same correct plan, so the win here is pure efficiency; the private-repo run is kept as the correctness data point (cold built a parallel gate; map-only hallucinated a vendor). - README + benchmarks/README: A/B table swapped to the same-repo numbers. - agentic-loop.jpg: stat bar -56% -> -76% (regenerated), restored to 2752x1536.

Reviewer (correctly) asked why the A/B token delta is only -15% when files drop -76%: fixed per-agent overhead dominates, and a capable cold agent greps rather than bulk-reading, so its extra files are cheap. The "~22x cheaper context" / "~77k tokens saved per task" framing overclaimed — that's the map's COMPRESSION ratio (worst-case full-crawl ceiling), not a measured per-task saving. - trace-stats: "ESTIMATED SAVINGS" -> "CONTEXT FOOTPRINT" with an explicit ceiling-not-saving caveat (measured token delta ~-15%; files & time are the robust wins). - SKILL.md / README / benchmarks: relabel ~22x as "map ~22x smaller" (compression) and surface the measured -15% tokens / -76% files / -45% time.

…rger Reviewer point: reading more files = more INPUT tokens, but output is ~constant. The -15% is TOTAL subagent_tokens, dominated by an arm-independent fixed input (system prompt + tool schemas) + a ~constant output (the plan). The trace cuts the *variable input* (files read); that delta is real but diluted in the total, and the harness reports only total — so -76% files is the cleaner proxy for the context saving. Noted in README, benchmarks, and the trace-stats caveat.

`trace-stats --citations` on a real trace flagged 18 "broken" — but 14 were false: dotted member-access citations (`RootState.selectedModel`, `scripts.build`, `AiService.generateImage`) were squashed to a single token (`RootStateselectedModel`) that never greps. Match the LAST identifier instead (`selectedModel`), the bare member that's actually grep-able. On the test repo: 92% → 98% resolved (4 genuinely broken left). Same fix applied to doc-drift.sh so the drift hook stops warning on valid member-access citations.

Replace the single-task n=1 A/B with 5 planning tasks across multi-tool-app and honojs/hono. Median -64% input / -33% cost / -59% time, same correct plan; opaque-domain run keeps the correctness win. Refresh trace-stats numbers (98% citations, grade 77), reframe it as a map-scorer (savings come from the A/B), update the agent-loop stats strip, add raw per-task CSV.

trace-stats.sh: footprint footnote now cites the measured -64% input / -33% cost / -59% time (was the stale -15% / -76% / -45%); JSON exposes area_doc_tokens + area_code_tokens instead of the misleading est_tokens_saved_per_task key. CHANGELOG 0.8.0: 98% citations, "context footprint" not "estimated savings", and the cross-repo A/B summary.

…n image Reorder to lead with the crux: TL;DR -> What it solves -> Why this vs an auto-generated code graph (the differentiation: read a map, don't query a graph; curated+grounded; domain+why; cheap to refresh) -> Setup, then the rest. Pull trace-stats out of the README (lives in benchmarks/ now). Drop the stale -12% tokens line from before/after (relabel it the opaque-domain case). Regenerate the agent-loop stats strip to include -33% cost.

README + benchmarks + SKILL: cut fluff, hedging, and flourish; punchier sentences; all technical substance and numbers kept. benchmarks/README reordered crux-first — the cold-vs-trace result leads, methodology and trace-stats follow. No claim or figure changed.

…table README "The numbers": define the two arms explicitly and expand the median- only table into a full trace-disabled | trace-enabled | Δ table with real UUID-task absolutes (355,115 -> 143,329 input, $1.68 -> $1.15, 6 -> 1 files, 9 -> 2 turns), then the 5-task medians. Align benchmarks/README wording to the same terms; colloquial "cold agent" in the problem narrative -> "fresh agent".

…ipline Rename the meter trace-stats -> trace-eval (it grades quality, not just stats). Enhancements from /claude-md-improver + revise-claude-md: - Grade Patterns&extension-points coverage (15% of the rubric — the reuse-first section that stops reinvention); rebalanced weights. multi-tool-app: C/77 -> B/80. - "What to curate" worklist: weakest ARCHITECTURE docs worst-first, each tagged with the exact missing criterion (no Patterns section / broken citation / open _TODO_) — the report-before-edit reflex. - --gaps: significant dirs (>=3 src files) with no ARCHITECTURE.md, most-code-first (the bootstrap worklist). - Structured Mode-A reflection routing each harvested learning to its doc section (pattern->Patterns, gotcha->Gotchas, invariant->Invariants, vendor->External). - Explicit "don't capture the obvious" avoid-list in SKILL Guardrails + the ARCHITECTURE template. - Curate-worst-first wired into Mode 0 / bootstrap. JSON adds patterns_ok_pct + gotcha_ok_pct. Docs/CHANGELOG/install updated; grade refs refreshed to 80/B.

…SC2034)

…ot precision) Borrow from /caveman + /caveman-compress for the doc creation/maintenance piece: - word-level compression standard for doc prose (drop articles/filler/hedging, fragments fine, short synonyms) — makes trace-eval's conciseness score actionable. - the carve-out: never compress away invariants/absences, branch conditions, magic numbers + source, security gotchas, or citations/values/versions ("read-only"). Lands in SKILL Guardrails + the ARCHITECTURE template authoring rule. Skipped wiring the caveman-compress CLI — wrong tool for human-curated, drift-maintained docs.

kgohil added 7 commits June 26, 2026 15:27

kgohil changed the title ~~feat: trace-stats effectiveness meter + real-repo benchmark (v0.8.0)~~ feat: trace-stats meter + cross-repo cold-vs-trace benchmark (v0.8.0) Jun 27, 2026

kgohil added 7 commits June 26, 2026 20:17

fix(trace-eval): drop unused severity var in curate loop (shellcheck …

41aa8d4

…SC2034)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: trace-stats meter + cross-repo cold-vs-trace benchmark (v0.8.0)#26

feat: trace-stats meter + cross-repo cold-vs-trace benchmark (v0.8.0)#26
kgohil wants to merge 14 commits into
mainfrom
feat/trace-stats-effectiveness

kgohil commented Jun 26, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kgohil commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Why

Changes

Reviewer notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

kgohil commented Jun 26, 2026 •

edited

Loading