Skip to content

SOW-0003: claude-code source adapter#28

Merged
ktsaou merged 14 commits into
masterfrom
sow-0003-claude-code-adapter
May 30, 2026
Merged

SOW-0003: claude-code source adapter#28
ktsaou merged 14 commits into
masterfrom
sow-0003-claude-code-adapter

Conversation

@ktsaou
Copy link
Copy Markdown
Member

@ktsaou ktsaou commented May 29, 2026

Summary

The claude-code adapter — second of the three Phase-2 source adapters. Ingests ~/.claude/projects transcripts + subagents/ sidechains into the canonical model, mirroring aiagent_v3.

  • parser.go — pure JSONL decoder, every observed type + one SourceError per unknown type (wired into Scan at scanner.go:398 + Tail at :614).
  • mapper.go/ops.go — record→canonical per adapter-claude-code.md §5.4: turns, llm/tool/reasoning ops, compaction → first-class OpKind='compaction', sub-agent session synthesis (NativeID=<parent>:agent:<agentId>, ParentNativeID), metadata snapshots → property/log, no new turn on isCompactSummary.
  • cursor.go — durable per-relative-path byte offset (Bun/Node dup dirs don't collide); restart = zero dup/gap.
  • scanner.go/tailer.go — tree walk + sidechains + orphan-root; fsnotify tail with partial-line parking + new-dir catch-up. Sessions stay running (no terminal signal).
  • Registered "claude-code"; auto-discovery probe for ~/.claude/projects (+ $CLAUDE_CONFIG_DIR); sanitize-fixture.sh gains --format=claude_code; 7 golden fixtures.

Judgment call for reviewers: the goldens are synthetic-but-shape-verified (real compaction transcripts are multi-MB; goldens stay small/deterministic). Real-data breadth is covered by a backfill of the operator's ~/.claude/projects: 1,020 sessions, 0 source errors.

Test plan

  • build/vet/golangci/gosec 0; govulncheck 0 called
  • go test -race -count=1 ./internal/adapters/claude_code/... ./cmd/... pass; 83.5% coverage
  • fuzz 30s = 11.3M execs, 0 crashes; cursor-restart no-dup/no-gap; sub-agent + compaction + unknown-type-tolerance + probe tests
  • scan-secrets.sh PASS (478 files incl. new fixtures); real-data backfill 1,020 sessions / 0 errors

ktsaou added 2 commits May 29, 2026 14:46
…ts + compaction)

Adds internal/adapters/claude_code/, mirroring aiagent_v3, ingesting
~/.claude/projects transcripts + their subagents/ sidechains:

- parser.go: pure JSONL decoder for the record envelope + every observed type,
  with one SourceError per unknown type (wired into Scan + Tail).
- mapper.go/ops.go: record -> canonical per adapter-claude-code.md §5.4 — turns,
  llm/tool/reasoning ops, compaction as a first-class OpKind='compaction'
  (Ts/EndTs/BytesIn=preTokens/BytesOut=postTokens/Extras), sub-agent session
  synthesis (NativeID=<parent>:agent:<agentId>, ParentNativeID), metadata
  snapshots as property/log updates, no new turn on isCompactSummary.
- cursor.go: durable per-relative-path byte offset (Bun/Node duplicate dirs do
  not collide); restart = zero dup, zero gap.
- scanner.go/tailer.go: tree walk + sidechains + orphan-root; fsnotify tail with
  partial-line parking + new-directory catch-up. Sessions stay 'running' (no
  native terminal signal).
- Registered as "claude-code"; auto-discovery probe for ~/.claude/projects (+
  $CLAUDE_CONFIG_DIR) in cmd/ai-viewer-ingest; counts surfaced at startup.
- scripts/sanitize-fixture.sh gains a --format=claude_code mode; 7 synthetic
  shape-verified golden fixtures under testdata/claude_code/.
- Spec: adapter-claude-code.md compaction mapping aligned to OpCompaction.

Gates green (build/vet/golangci/gosec 0; race tests 83.5% cover; fuzz 30s 0
crashes; scan-secrets PASS). Real-data backfill: 1,020 sessions, 0 source errors.
…ounded tail

- ingester re-links ops.child_session_id from the stashed child native id once
  the referenced sidechain session lands; the parent Agent op is written before
  its child session exists, so the link is deferred to the resolver pass and the
  owning parent session is notified so an open detail view refetches.
- Agent ops are finalized at scan/tail EOF instead of being left open.
- compaction records now emit a LogEntry plus the full compaction metadata in
  Extras (token counts, pre/post summary), not just a status flag.
- PayloadRef emission for tool I/O and file attachments resolved within the
  source root.
- Tail resumes from the scan cursor and catches up appended lines instead of
  skipping rows written during the Scan-to-Tail handoff.
- unknown record types are de-duplicated per variant rather than per occurrence.
- scanner and tailer constrain traversal to the source root via symlink-eval
  containment.
@ktsaou ktsaou force-pushed the sow-0003-claude-code-adapter branch from 7cf0393 to d1399b6 Compare May 29, 2026 12:45
ktsaou added 12 commits May 29, 2026 16:40
…uniform containment

Addresses defects found reviewing the prior fix round:

- payload refs are now op-scoped so they never reference a non-existent op: the
  compaction summary attaches to the compaction op; a bare file attachment emits
  its filename/displayPath/type in the LogEntry extras instead of an orphan ref.
  The ingester's applyPayloadRef defensively verifies the op exists and, on a
  miss, surfaces a source parse error and skips the ref rather than letting a
  foreign-key violation roll back the whole batch.
- Agent-op finalization is child-side (the format has no parent tool_result) and
  now survives the Scan->Tail boundary: a fully-consumed transcript is replayed
  to rebuild the per-file Agent-op state with emission suppressed, so a parent op
  emitted during Scan is finalized when its child completes during Tail. A child
  is finalized only on a quiescent EOF (a later flush/tick with no new append),
  never on the transient byte-EOF of a file still being written.
- symlink containment is applied uniformly to the meta-file reads and the Tail
  transcript reads, not just Scan discovery.
- an oversized line skips only that line and continues, instead of advancing the
  cursor to EOF and dropping every later record.
- pr-link records accumulate into sessions.extras_json.prLinks[] rather than a
  singular field that later links overwrote.

Adds an end-to-end test that ingests the compaction fixture through the real
writer (the seam the per-package tests could not cover) plus orphan-ref guard
tests. Specs updated in lockstep (adapter-claude-code.md, ingester.md).
…t marker

Replaces the quiescent-EOF finalize heuristic (cycle counter + childAtEOF +
sweep) with the format's actual completion signal: a subagent sidechain is
complete when its last record is an assistant message whose first content block
is text (verified against real transcripts: 397/400 sidechains end this way).

- an Agent op finalizes when its child sidechain is fully read AND its terminal
  record is assistant-text, emitted gated on the resume offset so a catch-up
  replay re-reading the same terminal record emits no second finalize. Pairing is
  event-driven: a completed child whose parent op is not yet observed is parked
  and finalized once the parent op appears. A child terminated by a user or
  tool_use record stays running. The finalized set is consulted before emitting,
  so a child is finalized exactly once.
- a late subagents/agent-*.meta.json now repairs the parent Agent op's child
  linkage: a meta change forces a from-zero re-read of the owning transcript so
  the op re-emits with the resolved child native id.
- the compaction-summary payload is no longer dropped for a compaction op that
  occurs before any user turn (guard keys on op seq, not turn seq).
- the projects root is symlink-resolved once per scan and threaded through the
  meta reads instead of being re-resolved per file.

Corrects the b_subagent_sidechain fixture, whose child transcript ended with a
synthetic system record that does not occur in real sidechains. Spec §8.1 / §5.4
/ §9.2 updated to the terminal-marker model.
- the subagent completion marker now reflects the PHYSICAL last record: streamLines
  clears lastRecordAssistantText on the parse-error and skipped-record paths, so a
  child whose last line is a skipped no-op or malformed record is no longer treated
  as complete (it stays running).
- transcript opens and meta reads now use the symlink-resolved path returned by the
  containment check rather than the original path, closing a check-then-open
  TOCTOU window.
- a meta-sidecar read or JSON-parse failure on a present file now surfaces a source
  error instead of being silently skipped, so a malformed late .meta.json no longer
  drops sub-agent linkage repair without a trace.
- a late .meta.json now also re-reads the affected child sidechain, so the child
  session's agent name is repaired, not just the parent op's child link.
- parked completions (a child observed complete before its parent op is known) are
  persisted in the cursor and restored on restart, so a parent that first appears
  after a restart still finalizes. The finalize-emit gating is unchanged, so a
  replay still does not double-emit.

Specs updated in lockstep (adapter-claude-code.md §6.1 and §8.1).
Sweeps every remaining instance of three bug classes whose first occurrences
were fixed in prior rounds:

- completion-marker flag: streamLines now updates lastRecordAssistantText on the
  oversized-line path too (not only the parse-error and skipped-record paths), so
  a child whose physical last line is oversized is not treated as complete.
- parked-completion state is now bidirectional: a sub-agent re-read that is no
  longer complete (it grew a trailing tool_use/user record after a prior pass
  parked it) retracts its stale park, so it cannot finalize the parent op. The
  emit gate stays on the add branch only, so a replay of a still-complete child
  is neither re-added nor wrongly retracted.
- path containment now covers every read: the tail meta-hash read and the
  orphan-root earliest-timestamp read open the symlink-resolved path (matching the
  transcript and meta reads), and a meta-hash read failure surfaces a source error
  instead of being silently swallowed.

The set of already-finalized child sessions is persisted in the cursor alongside
the parked set, so a late .meta.json re-read (or a restart) cannot re-emit a
finalize for a child that already finalized.

Spec updated in lockstep (adapter-claude-code.md §6.1, §7, §8.1, and the §5.1
sub-agent native-id wording).
…ning

Replaces the late-meta repair that re-read transcripts from offset 0 with
emission enabled — that path re-emitted SessionStarted/OpStarted/OpFinalized and
double-counted the catalog rollups (which accumulate on conflict) on a .meta.json
rewrite. The link is now resolved without re-emitting any catalog-counted event:

- the parent Agent op stashes the tool_use id it already has (no .meta.json
  needed); the child sub-agent session stashes its own tool_use id from its meta;
  a new additive resolver pass links ops.child_session_id by matching the two at
  the DB layer. It matches nothing for ops without that stash, so the other
  adapters are unaffected.
- a late child .meta.json now repairs the child session's agent name via a
  SessionUpdated event (which does not touch the catalog), not a transcript
  re-read.
- ops.extras_json is merged with json_patch on conflict instead of being replaced
  wholesale, so a later re-emit of an op cannot erase the stashed join key the
  resolver still needs.
- meta-sidecar reads are bounded (1 MiB cap; oversize surfaces a source error),
  and the tail watch + meta-hash walks now use the symlink-resolved projects root
  so a symlinked root is not silently skipped.

The catalog's increment-on-conflict is an ingester-wide concern (it would also
mis-count a defensive truncation rescan); tracked separately in SOW-0020. Spec
wording for path containment softened to best-effort for a single-user read-only
tool. Specs updated in lockstep (adapter-claude-code.md, ingester.md).
…tras, cursor keys)

Completes the linkage repair so it holds in every ordering and under a symlinked
projects root, with no shared-ingester side effects:

- a late .meta.json now repairs the child session's tool_use id (not only its
  agent name) via the catalog-safe SessionUpdated, so a child whose transcript is
  read before its own meta still links to its parent op once the meta arrives.
- the aiViewer join-key stash is preserved across a stash-free re-emit on BOTH
  the session and op upserts, grafted per key with json_set rather than
  json_patch — json_patch treats a JSON null as a delete, which could drop an
  adapter's null-valued extras; json_set only adds the named stash key when it is
  missing and never deletes another key.
- tail cursor keys are derived against the same resolved root the scan cursor
  uses, so a symlinked projects root no longer produces a key that misses the
  cursor entry and re-reads a file from the start (which had re-emitted history
  and inflated the catalog rollups).
- the tool_use-id resolver link is constrained to the op's structural child, so
  duplicate or forged tool_use ids in one source cannot cross-link.

Removes stale comments referencing the deleted from-zero re-read path and fixes a
spec self-contradiction (the parent Agent op carries the child native id only
when the meta is present at map time). Specs updated in lockstep.
…xtras graft

- the late .meta.json repair (a catalog-safe SessionUpdated carrying the child's
  agent name and tool_use id) is now factored into one shared function called by
  BOTH scan and tail, so a meta first observed during scan (parent and child
  transcripts already consumed in a prior run) is repaired instead of being
  silently recorded as seen and then skipped by tail. Unifying the path stops
  scan and tail from drifting apart.
- the op extras graft, when the re-emitted op carries no extras, now keeps only
  the preserved aiViewer join-key stash rather than the entire previous extras
  blob, so a later stash-free re-emit no longer stale-preserves unrelated keys.

Specs updated in lockstep; the b/f goldens each gain one idempotent SessionUpdated
line from the unified scan-side repair.
…boundary gap

A test case used a real MCP namespace from the operator's environment that
embedded the operator's name; replaced it with a synthetic namespace. The secret
scanner had missed it because the operator-name rule matched on \b, and \b treats
'_' as a word character, so a name embedded after an underscore (e.g. an MCP
namespace "<tool>_<name>") never hit a word boundary. The name rule now uses a
non-alphanumeric token boundary on both sides ('_' is a delimiter), so such
embeddings are caught while a longer word that merely shares the name as a prefix
is not. Adds a regression case to the scanner self-test for the underscore form.
…alize docs

- the meta/transcript/watch discovery walks (collectMetaPaths, markExistingDirty,
  addWatchTree) now surface a non-not-exist WalkDir error via onError instead of
  swallowing it, so an unreadable subtree, meta, or watch is visible in the
  Sources panel rather than silently missed. An absent directory is still not an
  error.
- spec: the extras graft preserves only the toolUseId and childNativeId stash
  keys; parentNativeId is re-derived and written fresh by SessionStarted on every
  re-emit, not grafted (wording corrected to match the code).
- spec: documented that the late-meta parent-op finalize has one ordering it does
  not cover (a meta observed only after its already-completed child was consumed),
  which the format does not produce — the .meta.json is written at subagent spawn
  and so predates child completion (verified on real data) — and the residual is a
  benign status lag that the next full scan finalizes.
…-10 docs

- the discovery chain no longer aborts the whole source scan on one broken entry:
  an unreadable project dir, a broken session subtree, or a relpath failure on a
  single transcript is surfaced via onError (visible in /api/health and the
  Sources panel) and skipped, so discovery continues with the rest. Only the
  configured projects root itself being unreadable stays fatal. This matches the
  fail-soft handling of the other discovery walks and the ingester's
  advance-past-the-bad-batch model — one bad file can no longer zero out
  ingestion.
- docs: broadened the late-meta finalize limitation to cover the live-tail
  spawn-race read-order case (an accepted, cosmetic, self-healing status lag, not
  data loss); corrected the scanner self-test comments to match the token-boundary
  rule; and replaced the upstream-source citation placeholders with the examined
  frozen-mirror commit.
Marks SOW-0003 completed and moves it to done/ after the adapter converged through
the external-review rounds (final round unanimous, no blockers). Plus the two
documentation nits from the final round:

- corrected the errLineTooLong comment, which still said the scanner "skips to
  EOF" on an oversized line — it drains just that line and continues (the behavior
  was already correct; only the comment was stale).
- clarified in the storage-layout section that the tool-results/ spill path is for
  oversized Bash/PowerShell tool OUTPUT (persistedOutputPath) and is distinct from
  a compact_file_reference, which points at the original project file (per §3.4).

The catalog-rollup idempotency-under-re-emission concern is tracked separately as
SOW-0020 (pending).
The shared aiViewer extras-graft expression is spliced into the sessions
and ops UPSERT SQL via string interpolation, which gosec flags as G202
(SQL string concatenation). The only interpolated value is
graftAiViewerExtras's output, built solely from a compile-time-constant
column literal and the package-const aiViewerStashKeys slice; no caller
or source-derived input ever reaches the SQL, so the finding is a
false positive. Scope the suppression to G202 with a justification so
the whole-repo `gosec -severity medium -confidence medium ./...` lint
gate passes.
@ktsaou ktsaou merged commit 7ead41a into master May 30, 2026
6 checks passed
@ktsaou ktsaou deleted the sow-0003-claude-code-adapter branch May 30, 2026 08:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant