SOW-0003: claude-code source adapter by ktsaou · Pull Request #28 · netdata/ai-viewer

ktsaou · 2026-05-29T11:47:39Z

Summary

The claude-code adapter — second of the three Phase-2 source adapters. Ingests ~/.claude/projects transcripts + subagents/ sidechains into the canonical model, mirroring aiagent_v3.

parser.go — pure JSONL decoder, every observed type + one SourceError per unknown type (wired into Scan at scanner.go:398 + Tail at :614).
mapper.go/ops.go — record→canonical per adapter-claude-code.md §5.4: turns, llm/tool/reasoning ops, compaction → first-class OpKind='compaction', sub-agent session synthesis (NativeID=<parent>:agent:<agentId>, ParentNativeID), metadata snapshots → property/log, no new turn on isCompactSummary.
cursor.go — durable per-relative-path byte offset (Bun/Node dup dirs don't collide); restart = zero dup/gap.
scanner.go/tailer.go — tree walk + sidechains + orphan-root; fsnotify tail with partial-line parking + new-dir catch-up. Sessions stay running (no terminal signal).
Registered "claude-code"; auto-discovery probe for ~/.claude/projects (+ $CLAUDE_CONFIG_DIR); sanitize-fixture.sh gains --format=claude_code; 7 golden fixtures.

Judgment call for reviewers: the goldens are synthetic-but-shape-verified (real compaction transcripts are multi-MB; goldens stay small/deterministic). Real-data breadth is covered by a backfill of the operator's ~/.claude/projects: 1,020 sessions, 0 source errors.

Test plan

build/vet/golangci/gosec 0; govulncheck 0 called
go test -race -count=1 ./internal/adapters/claude_code/... ./cmd/... pass; 83.5% coverage
fuzz 30s = 11.3M execs, 0 crashes; cursor-restart no-dup/no-gap; sub-agent + compaction + unknown-type-tolerance + probe tests
scan-secrets.sh PASS (478 files incl. new fixtures); real-data backfill 1,020 sessions / 0 errors

…ts + compaction) Adds internal/adapters/claude_code/, mirroring aiagent_v3, ingesting ~/.claude/projects transcripts + their subagents/ sidechains: - parser.go: pure JSONL decoder for the record envelope + every observed type, with one SourceError per unknown type (wired into Scan + Tail). - mapper.go/ops.go: record -> canonical per adapter-claude-code.md §5.4 — turns, llm/tool/reasoning ops, compaction as a first-class OpKind='compaction' (Ts/EndTs/BytesIn=preTokens/BytesOut=postTokens/Extras), sub-agent session synthesis (NativeID=<parent>:agent:<agentId>, ParentNativeID), metadata snapshots as property/log updates, no new turn on isCompactSummary. - cursor.go: durable per-relative-path byte offset (Bun/Node duplicate dirs do not collide); restart = zero dup, zero gap. - scanner.go/tailer.go: tree walk + sidechains + orphan-root; fsnotify tail with partial-line parking + new-directory catch-up. Sessions stay 'running' (no native terminal signal). - Registered as "claude-code"; auto-discovery probe for ~/.claude/projects (+ $CLAUDE_CONFIG_DIR) in cmd/ai-viewer-ingest; counts surfaced at startup. - scripts/sanitize-fixture.sh gains a --format=claude_code mode; 7 synthetic shape-verified golden fixtures under testdata/claude_code/. - Spec: adapter-claude-code.md compaction mapping aligned to OpCompaction. Gates green (build/vet/golangci/gosec 0; race tests 83.5% cover; fuzz 30s 0 crashes; scan-secrets PASS). Real-data backfill: 1,020 sessions, 0 source errors.

…ounded tail - ingester re-links ops.child_session_id from the stashed child native id once the referenced sidechain session lands; the parent Agent op is written before its child session exists, so the link is deferred to the resolver pass and the owning parent session is notified so an open detail view refetches. - Agent ops are finalized at scan/tail EOF instead of being left open. - compaction records now emit a LogEntry plus the full compaction metadata in Extras (token counts, pre/post summary), not just a status flag. - PayloadRef emission for tool I/O and file attachments resolved within the source root. - Tail resumes from the scan cursor and catches up appended lines instead of skipping rows written during the Scan-to-Tail handoff. - unknown record types are de-duplicated per variant rather than per occurrence. - scanner and tailer constrain traversal to the source root via symlink-eval containment.

…uniform containment Addresses defects found reviewing the prior fix round: - payload refs are now op-scoped so they never reference a non-existent op: the compaction summary attaches to the compaction op; a bare file attachment emits its filename/displayPath/type in the LogEntry extras instead of an orphan ref. The ingester's applyPayloadRef defensively verifies the op exists and, on a miss, surfaces a source parse error and skips the ref rather than letting a foreign-key violation roll back the whole batch. - Agent-op finalization is child-side (the format has no parent tool_result) and now survives the Scan->Tail boundary: a fully-consumed transcript is replayed to rebuild the per-file Agent-op state with emission suppressed, so a parent op emitted during Scan is finalized when its child completes during Tail. A child is finalized only on a quiescent EOF (a later flush/tick with no new append), never on the transient byte-EOF of a file still being written. - symlink containment is applied uniformly to the meta-file reads and the Tail transcript reads, not just Scan discovery. - an oversized line skips only that line and continues, instead of advancing the cursor to EOF and dropping every later record. - pr-link records accumulate into sessions.extras_json.prLinks[] rather than a singular field that later links overwrote. Adds an end-to-end test that ingests the compaction fixture through the real writer (the seam the per-package tests could not cover) plus orphan-ref guard tests. Specs updated in lockstep (adapter-claude-code.md, ingester.md).

…t marker Replaces the quiescent-EOF finalize heuristic (cycle counter + childAtEOF + sweep) with the format's actual completion signal: a subagent sidechain is complete when its last record is an assistant message whose first content block is text (verified against real transcripts: 397/400 sidechains end this way). - an Agent op finalizes when its child sidechain is fully read AND its terminal record is assistant-text, emitted gated on the resume offset so a catch-up replay re-reading the same terminal record emits no second finalize. Pairing is event-driven: a completed child whose parent op is not yet observed is parked and finalized once the parent op appears. A child terminated by a user or tool_use record stays running. The finalized set is consulted before emitting, so a child is finalized exactly once. - a late subagents/agent-*.meta.json now repairs the parent Agent op's child linkage: a meta change forces a from-zero re-read of the owning transcript so the op re-emits with the resolved child native id. - the compaction-summary payload is no longer dropped for a compaction op that occurs before any user turn (guard keys on op seq, not turn seq). - the projects root is symlink-resolved once per scan and threaded through the meta reads instead of being re-resolved per file. Corrects the b_subagent_sidechain fixture, whose child transcript ended with a synthetic system record that does not occur in real sidechains. Spec §8.1 / §5.4 / §9.2 updated to the terminal-marker model.

- the subagent completion marker now reflects the PHYSICAL last record: streamLines clears lastRecordAssistantText on the parse-error and skipped-record paths, so a child whose last line is a skipped no-op or malformed record is no longer treated as complete (it stays running). - transcript opens and meta reads now use the symlink-resolved path returned by the containment check rather than the original path, closing a check-then-open TOCTOU window. - a meta-sidecar read or JSON-parse failure on a present file now surfaces a source error instead of being silently skipped, so a malformed late .meta.json no longer drops sub-agent linkage repair without a trace. - a late .meta.json now also re-reads the affected child sidechain, so the child session's agent name is repaired, not just the parent op's child link. - parked completions (a child observed complete before its parent op is known) are persisted in the cursor and restored on restart, so a parent that first appears after a restart still finalizes. The finalize-emit gating is unchanged, so a replay still does not double-emit. Specs updated in lockstep (adapter-claude-code.md §6.1 and §8.1).

Sweeps every remaining instance of three bug classes whose first occurrences were fixed in prior rounds: - completion-marker flag: streamLines now updates lastRecordAssistantText on the oversized-line path too (not only the parse-error and skipped-record paths), so a child whose physical last line is oversized is not treated as complete. - parked-completion state is now bidirectional: a sub-agent re-read that is no longer complete (it grew a trailing tool_use/user record after a prior pass parked it) retracts its stale park, so it cannot finalize the parent op. The emit gate stays on the add branch only, so a replay of a still-complete child is neither re-added nor wrongly retracted. - path containment now covers every read: the tail meta-hash read and the orphan-root earliest-timestamp read open the symlink-resolved path (matching the transcript and meta reads), and a meta-hash read failure surfaces a source error instead of being silently swallowed. The set of already-finalized child sessions is persisted in the cursor alongside the parked set, so a late .meta.json re-read (or a restart) cannot re-emit a finalize for a child that already finalized. Spec updated in lockstep (adapter-claude-code.md §6.1, §7, §8.1, and the §5.1 sub-agent native-id wording).

…ning Replaces the late-meta repair that re-read transcripts from offset 0 with emission enabled — that path re-emitted SessionStarted/OpStarted/OpFinalized and double-counted the catalog rollups (which accumulate on conflict) on a .meta.json rewrite. The link is now resolved without re-emitting any catalog-counted event: - the parent Agent op stashes the tool_use id it already has (no .meta.json needed); the child sub-agent session stashes its own tool_use id from its meta; a new additive resolver pass links ops.child_session_id by matching the two at the DB layer. It matches nothing for ops without that stash, so the other adapters are unaffected. - a late child .meta.json now repairs the child session's agent name via a SessionUpdated event (which does not touch the catalog), not a transcript re-read. - ops.extras_json is merged with json_patch on conflict instead of being replaced wholesale, so a later re-emit of an op cannot erase the stashed join key the resolver still needs. - meta-sidecar reads are bounded (1 MiB cap; oversize surfaces a source error), and the tail watch + meta-hash walks now use the symlink-resolved projects root so a symlinked root is not silently skipped. The catalog's increment-on-conflict is an ingester-wide concern (it would also mis-count a defensive truncation rescan); tracked separately in SOW-0020. Spec wording for path containment softened to best-effort for a single-user read-only tool. Specs updated in lockstep (adapter-claude-code.md, ingester.md).

…tras, cursor keys) Completes the linkage repair so it holds in every ordering and under a symlinked projects root, with no shared-ingester side effects: - a late .meta.json now repairs the child session's tool_use id (not only its agent name) via the catalog-safe SessionUpdated, so a child whose transcript is read before its own meta still links to its parent op once the meta arrives. - the aiViewer join-key stash is preserved across a stash-free re-emit on BOTH the session and op upserts, grafted per key with json_set rather than json_patch — json_patch treats a JSON null as a delete, which could drop an adapter's null-valued extras; json_set only adds the named stash key when it is missing and never deletes another key. - tail cursor keys are derived against the same resolved root the scan cursor uses, so a symlinked projects root no longer produces a key that misses the cursor entry and re-reads a file from the start (which had re-emitted history and inflated the catalog rollups). - the tool_use-id resolver link is constrained to the op's structural child, so duplicate or forged tool_use ids in one source cannot cross-link. Removes stale comments referencing the deleted from-zero re-read path and fixes a spec self-contradiction (the parent Agent op carries the child native id only when the meta is present at map time). Specs updated in lockstep.

…xtras graft - the late .meta.json repair (a catalog-safe SessionUpdated carrying the child's agent name and tool_use id) is now factored into one shared function called by BOTH scan and tail, so a meta first observed during scan (parent and child transcripts already consumed in a prior run) is repaired instead of being silently recorded as seen and then skipped by tail. Unifying the path stops scan and tail from drifting apart. - the op extras graft, when the re-emitted op carries no extras, now keeps only the preserved aiViewer join-key stash rather than the entire previous extras blob, so a later stash-free re-emit no longer stale-preserves unrelated keys. Specs updated in lockstep; the b/f goldens each gain one idempotent SessionUpdated line from the unified scan-side repair.

…boundary gap A test case used a real MCP namespace from the operator's environment that embedded the operator's name; replaced it with a synthetic namespace. The secret scanner had missed it because the operator-name rule matched on \b, and \b treats '_' as a word character, so a name embedded after an underscore (e.g. an MCP namespace "<tool>_<name>") never hit a word boundary. The name rule now uses a non-alphanumeric token boundary on both sides ('_' is a delimiter), so such embeddings are caught while a longer word that merely shares the name as a prefix is not. Adds a regression case to the scanner self-test for the underscore form.

…alize docs - the meta/transcript/watch discovery walks (collectMetaPaths, markExistingDirty, addWatchTree) now surface a non-not-exist WalkDir error via onError instead of swallowing it, so an unreadable subtree, meta, or watch is visible in the Sources panel rather than silently missed. An absent directory is still not an error. - spec: the extras graft preserves only the toolUseId and childNativeId stash keys; parentNativeId is re-derived and written fresh by SessionStarted on every re-emit, not grafted (wording corrected to match the code). - spec: documented that the late-meta parent-op finalize has one ordering it does not cover (a meta observed only after its already-completed child was consumed), which the format does not produce — the .meta.json is written at subagent spawn and so predates child completion (verified on real data) — and the residual is a benign status lag that the next full scan finalizes.

…-10 docs - the discovery chain no longer aborts the whole source scan on one broken entry: an unreadable project dir, a broken session subtree, or a relpath failure on a single transcript is surfaced via onError (visible in /api/health and the Sources panel) and skipped, so discovery continues with the rest. Only the configured projects root itself being unreadable stays fatal. This matches the fail-soft handling of the other discovery walks and the ingester's advance-past-the-bad-batch model — one bad file can no longer zero out ingestion. - docs: broadened the late-meta finalize limitation to cover the live-tail spawn-race read-order case (an accepted, cosmetic, self-healing status lag, not data loss); corrected the scanner self-test comments to match the token-boundary rule; and replaced the upstream-source citation placeholders with the examined frozen-mirror commit.

Marks SOW-0003 completed and moves it to done/ after the adapter converged through the external-review rounds (final round unanimous, no blockers). Plus the two documentation nits from the final round: - corrected the errLineTooLong comment, which still said the scanner "skips to EOF" on an oversized line — it drains just that line and continues (the behavior was already correct; only the comment was stale). - clarified in the storage-layout section that the tool-results/ spill path is for oversized Bash/PowerShell tool OUTPUT (persistedOutputPath) and is distinct from a compact_file_reference, which points at the original project file (per §3.4). The catalog-rollup idempotency-under-re-emission concern is tracked separately as SOW-0020 (pending).

The shared aiViewer extras-graft expression is spliced into the sessions and ops UPSERT SQL via string interpolation, which gosec flags as G202 (SQL string concatenation). The only interpolated value is graftAiViewerExtras's output, built solely from a compile-time-constant column literal and the package-const aiViewerStashKeys slice; no caller or source-derived input ever reaches the SQL, so the finding is a false positive. Scope the suppression to G202 with a justification so the whole-repo `gosec -severity medium -confidence medium ./...` lint gate passes.

ktsaou added 2 commits May 29, 2026 14:46

ktsaou force-pushed the sow-0003-claude-code-adapter branch from 7cf0393 to d1399b6 Compare May 29, 2026 12:45

ktsaou added 12 commits May 29, 2026 16:40

ktsaou merged commit 7ead41a into master May 30, 2026
6 checks passed

ktsaou deleted the sow-0003-claude-code-adapter branch May 30, 2026 08:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SOW-0003: claude-code source adapter#28

SOW-0003: claude-code source adapter#28
ktsaou merged 14 commits into
masterfrom
sow-0003-claude-code-adapter

ktsaou commented May 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ktsaou commented May 29, 2026

Summary

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant