feat: trace-stats meter + cross-repo cold-vs-trace benchmark (v0.8.0)#26
Open
kgohil wants to merge 14 commits into
Open
feat: trace-stats meter + cross-repo cold-vs-trace benchmark (v0.8.0)#26kgohil wants to merge 14 commits into
kgohil wants to merge 14 commits into
Conversation
…-md authoring discipline - trace-stats.sh — the /ctx-stats analog: coverage, map compression (trace tokens vs codebase tokens), citation health (resolving path-symbol citations), freshness (auto-refresh commits + open _TODO_ debt), a claude-md-style A-F quality grade, and estimated tokens saved per task. --json / --citations. Pure bash 3.2+, reads only, degrades cleanly on a trace-less repo. - SKILL.md (v0.8.0): "Measuring effectiveness" section, proven-on-a-real-repo benefits, and claude-md learnings folded into Guardrails (terse / document only the non-obvious) + a Mode-A "harvest what was missing" reflection trigger. - benchmarks/README: lead with the trace-stats meter + a real ~100k-line-repo session (1:55 compression, 92% citations, ~22x cheaper context per area); drop opaque private-project anecdotes; keep the controlled cold-vs-trace A/B. - install.md: section 5 (run trace-stats); plugin manifests + CHANGELOG -> 0.8.0.
…al use-case - agentic-loop.jpg: regenerated via gemini-3-pro-image-preview — example box now "add a hash-generator tool to the app"; stats bar now "-56% files read · ~22x cheaper context · 92% citations resolve · reuse, not reinvent". - README: image alt matches; Early-signal section leads with the trace-stats meter + the real ~100k-line-repo session (1:55 compression, 92% citations, ~22x cheaper), then the controlled cold-vs-trace A/B (jargon generalized).
…chmark consistency
The "real result" (trace-stats) and the "controlled A/B" now use the SAME repo —
ran a paired cold-vs-trace planning probe on the multi-tool-app ("plan adding a
UUID Generator tool"):
- cold (no trace): 17 files, 99,402 tok, 70s
- trace: 4 files, 84,148 tok, 38s → -76% files, -15% tokens, -45% time
Both reached the same correct plan, so the win here is pure efficiency; the
private-repo run is kept as the correctness data point (cold built a parallel
gate; map-only hallucinated a vendor).
- README + benchmarks/README: A/B table swapped to the same-repo numbers.
- agentic-loop.jpg: stat bar -56% -> -76% (regenerated), restored to 2752x1536.
Reviewer (correctly) asked why the A/B token delta is only -15% when files drop -76%: fixed per-agent overhead dominates, and a capable cold agent greps rather than bulk-reading, so its extra files are cheap. The "~22x cheaper context" / "~77k tokens saved per task" framing overclaimed — that's the map's COMPRESSION ratio (worst-case full-crawl ceiling), not a measured per-task saving. - trace-stats: "ESTIMATED SAVINGS" -> "CONTEXT FOOTPRINT" with an explicit ceiling-not-saving caveat (measured token delta ~-15%; files & time are the robust wins). - SKILL.md / README / benchmarks: relabel ~22x as "map ~22x smaller" (compression) and surface the measured -15% tokens / -76% files / -45% time.
…rger Reviewer point: reading more files = more INPUT tokens, but output is ~constant. The -15% is TOTAL subagent_tokens, dominated by an arm-independent fixed input (system prompt + tool schemas) + a ~constant output (the plan). The trace cuts the *variable input* (files read); that delta is real but diluted in the total, and the harness reports only total — so -76% files is the cleaner proxy for the context saving. Noted in README, benchmarks, and the trace-stats caveat.
`trace-stats --citations` on a real trace flagged 18 "broken" — but 14 were false: dotted member-access citations (`RootState.selectedModel`, `scripts.build`, `AiService.generateImage`) were squashed to a single token (`RootStateselectedModel`) that never greps. Match the LAST identifier instead (`selectedModel`), the bare member that's actually grep-able. On the test repo: 92% → 98% resolved (4 genuinely broken left). Same fix applied to doc-drift.sh so the drift hook stops warning on valid member-access citations.
Replace the single-task n=1 A/B with 5 planning tasks across multi-tool-app and honojs/hono. Median -64% input / -33% cost / -59% time, same correct plan; opaque-domain run keeps the correctness win. Refresh trace-stats numbers (98% citations, grade 77), reframe it as a map-scorer (savings come from the A/B), update the agent-loop stats strip, add raw per-task CSV.
trace-stats.sh: footprint footnote now cites the measured -64% input / -33% cost / -59% time (was the stale -15% / -76% / -45%); JSON exposes area_doc_tokens + area_code_tokens instead of the misleading est_tokens_saved_per_task key. CHANGELOG 0.8.0: 98% citations, "context footprint" not "estimated savings", and the cross-repo A/B summary.
…n image Reorder to lead with the crux: TL;DR -> What it solves -> Why this vs an auto-generated code graph (the differentiation: read a map, don't query a graph; curated+grounded; domain+why; cheap to refresh) -> Setup, then the rest. Pull trace-stats out of the README (lives in benchmarks/ now). Drop the stale -12% tokens line from before/after (relabel it the opaque-domain case). Regenerate the agent-loop stats strip to include -33% cost.
README + benchmarks + SKILL: cut fluff, hedging, and flourish; punchier sentences; all technical substance and numbers kept. benchmarks/README reordered crux-first — the cold-vs-trace result leads, methodology and trace-stats follow. No claim or figure changed.
…table README "The numbers": define the two arms explicitly and expand the median- only table into a full trace-disabled | trace-enabled | Δ table with real UUID-task absolutes (355,115 -> 143,329 input, $1.68 -> $1.15, 6 -> 1 files, 9 -> 2 turns), then the 5-task medians. Align benchmarks/README wording to the same terms; colloquial "cold agent" in the problem narrative -> "fresh agent".
…ipline Rename the meter trace-stats -> trace-eval (it grades quality, not just stats). Enhancements from /claude-md-improver + revise-claude-md: - Grade Patterns&extension-points coverage (15% of the rubric — the reuse-first section that stops reinvention); rebalanced weights. multi-tool-app: C/77 -> B/80. - "What to curate" worklist: weakest ARCHITECTURE docs worst-first, each tagged with the exact missing criterion (no Patterns section / broken citation / open _TODO_) — the report-before-edit reflex. - --gaps: significant dirs (>=3 src files) with no ARCHITECTURE.md, most-code-first (the bootstrap worklist). - Structured Mode-A reflection routing each harvested learning to its doc section (pattern->Patterns, gotcha->Gotchas, invariant->Invariants, vendor->External). - Explicit "don't capture the obvious" avoid-list in SKILL Guardrails + the ARCHITECTURE template. - Curate-worst-first wired into Mode 0 / bootstrap. JSON adds patterns_ok_pct + gotcha_ok_pct. Docs/CHANGELOG/install updated; grade refs refreshed to 80/B.
…ot precision)
Borrow from /caveman + /caveman-compress for the doc creation/maintenance piece:
- word-level compression standard for doc prose (drop articles/filler/hedging,
fragments fine, short synonyms) — makes trace-eval's conciseness score actionable.
- the carve-out: never compress away invariants/absences, branch conditions, magic
numbers + source, security gotchas, or citations/values/versions ("read-only").
Lands in SKILL Guardrails + the ARCHITECTURE template authoring rule. Skipped wiring
the caveman-compress CLI — wrong tool for human-curated, drift-maintained docs.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Adds
trace-stats— a built-in effectiveness meter (the/ctx-statsanalog) — and grounds the skill's value in a cross-repo cold-vs-trace benchmark (5 planning tasks, 2 repos, fullclaude -ptelemetry). Refreshes the benefits, benchmark, and agent-loop image. Bumps the skill to v0.8.0.Why
The skill claimed value ("read the map, not the territory; reuse instead of reinvent") but had no built-in way to measure whether a repo's trace is earning its keep, and its public proof leaned on a single n=1 A/B + private anecdotes. This makes the value measurable (the meter) and grounds it in real, reproducible numbers (the benchmark).
Changes
skills/trace-my-code/hooks/trace-stats.sh(new) — the meter:ARCHITECTURE.mdvs significant source dirs`path › symbol`citations still resolve (reuses the drift-hook check)_TODO: confirm_debt--json(CI) /--citations(list broken). Pure bash (3.2+), reads only, degrades cleanly on a trace-less repo. It scores the map; the cold-vs-trace savings come from the A/B below.Cross-repo A/B benchmark (new) — 5 planning tasks, cold vs trace, across multi-tool-app and honojs/hono (trace bootstrapped on just the touched areas):
benchmarks/runs/2026-06-cross-repo.csvSKILL.md → v0.8.0 — "Measuring effectiveness" section; cross-repo proof-point; claude-md learnings folded in (terse / document-only-the-non-obvious Guardrail; Mode-A "harvest what was missing" reflection).
Benchmarks + README — lead with
trace-stats+ the cross-repo A/B; trace-stats reframed as a map-scorer, not a savings-estimator; real-repo session numbers refreshed (1:55 compression, 98% citation accuracy, grade 77).Image —
assets/agentic-loop.jpgstats strip → −64% input · −59% wall time · ⅓ the files · same correct plan (diagram otherwise unchanged).Housekeeping — install.md §5, plugin manifests + CHANGELOG → 0.8.0.
Reviewer notes
trace-stats.shis portable bash 3.2 (macOS), no deps, reads-only, no network.claude -p --output-format jsonrun; n=1/task (the 5-task spread is the variance, not error bars); planning-phase proxy; every task had a real reuse target. Honest limits are inbenchmarks/README.md.