Skip to content

fix(ingest): op duration_us from persisted start_ts (+ migration 0005 backfill)#31

Merged
ktsaou merged 5 commits into
masterfrom
sow-0026-op-duration
May 31, 2026
Merged

fix(ingest): op duration_us from persisted start_ts (+ migration 0005 backfill)#31
ktsaou merged 5 commits into
masterfrom
sow-0026-op-duration

Conversation

@ktsaou
Copy link
Copy Markdown
Member

@ktsaou ktsaou commented May 31, 2026

Problem

ops.duration_us was persisted as 0 for every op. The ingest writer computed it as OpFinalizedEvent.EndTs - OpFinalizedEvent.Ts, but a finalize event sorts after its OpStarted (events are ordered by Ts), so Finalized.Ts is the op end (== EndTs) and the subtraction collapsed to ~0 for every spec-conformant adapter.

Impact: flattened the session-detail Duration column, the topology metric=duration node sizing, and the catalog_*.total_duration_us rollups. stats.go was unaffected (it already derives end_ts - start_ts directly), which is why two surfaces silently disagreed. data-model.md already specified end_ts - start_ts, so this was code-vs-spec drift.

The bug was masked because test fixtures set Finalized.Ts == OpStarted.Ts (the start), making EndTs - Ts coincidentally equal the real duration in tests only.

Fix

  • internal/ingest/writer.go applyOpFinalized: derive duration_us = EndTs - persisted start_ts; leave NULL for an orphan finalize (no recorded start). The existing op lookup is hoisted out of the pricing branch so start_ts is read once for both pricing and duration.
  • internal/store/migrations/0005_op_duration_backfill.sql (data-only, schema-version-neutral): backfill historical ops.duration_us and recompute catalog_models / catalog_tools.total_duration_us from the corrected values. Migrations are tracked by filename in _schema_migrations, so schema_meta.version stays 4 (no schema shape change) and the serve binary's exact-equality CheckSchema still passes.
  • Test fixtures made spec-conformant (Finalized.Ts = end); new tests pin the start_ts-derived duration, the orphan-finalize NULL case, and the migration backfill + rollup recompute.
  • Specs: data-model.md + ingester.md document the duration source and the migration.

Verification

go test -race (ingest + store) green; coverage 88.5% / 90.9%; gofmt/vet/golangci-lint/gosec clean; secret + attribution scans clean.

SOW: .agents/sow/current/SOW-0026-20260531-op-duration-from-start-ts.md

ktsaou added 5 commits May 31, 2026 06:05
…lize Ts

ops.duration_us was computed as OpFinalizedEvent.EndTs - OpFinalizedEvent.Ts.
A finalize event sorts after its OpStarted (events are ordered by Ts), so
Finalized.Ts is the op end (== EndTs) and the subtraction collapsed to ~0 for
every spec-conformant adapter, persisting duration_us = 0 everywhere. This
flattened the session-detail Duration column, the topology duration metric, and
the catalog_*.total_duration_us rollups. stats.go was unaffected because it
already derives end_ts - start_ts directly.

Derive duration from the persisted start_ts (the matching OpStarted's Ts) at
finalize; leave NULL for an orphan finalize with no recorded start. Hoist the
existing op lookup out of the pricing branch so start_ts is read once for both
pricing and duration.

Add migration 0005 (data-only, schema-version-neutral): backfill
ops.duration_us = end_ts - start_ts for historical rows and recompute
catalog_models / catalog_tools.total_duration_us from the corrected values.
Migrations are tracked by filename in _schema_migrations, so the marker
schema_meta.version stays 4 (no schema shape change), keeping the serve
binary's exact-equality CheckSchema satisfied.

The bug was masked by test fixtures that set Finalized.Ts == OpStarted.Ts (the
start); make them spec-conformant (Finalized.Ts = end) and add tests pinning
the start_ts-derived duration, the orphan-finalize NULL case, and the migration
backfill + rollup recompute.

Specs: data-model.md and ingester.md document the duration source and the
migration. SOW-0026.
…n recompute

Address review findings on the op-duration fix:

- end_ts and duration_us are written from a single validity gate via COALESCE,
  so a corrective re-finalize carrying EndTs=0 (or a clock-skewed EndTs <
  start_ts) preserves the previously recorded good end_ts and duration instead
  of clobbering end_ts to NULL while duration stayed set. Tests now assert
  end_ts preservation and cover the clock-skew re-finalize case.
- Migration 0005 catalog_models recompute mirrors the live path exactly: drop
  the always-true `o.name IS NOT NULL` and add explicit non-empty provider/model
  guards (no behavioral change; the join to non-empty PKs already enforced it).
- Add a migration-vs-live parity test: the 0005 recompute reproduces the totals
  the live incremental path accumulates for the same event set.
- Make the remaining ingest test fixtures spec-conformant (Finalized.Ts = end)
  so the old masking pattern is gone from the suite.
- Docs/comments: record migration 0005 in the data-model migration history;
  clarify the presenter SchemaVersion comment (schema-shape vs data-only
  migrations); correct the statsTotals comment to state it sums session
  end_ts - start_ts, distinct from ops.duration_us.

SOW-0026.
…ated DB

ai-viewer-serve runs no migrations and gates startup solely on
schema_meta.version (CheckSchema, exact-equality). With migration 0005 left
version-neutral, a pre-0005 v4 store passed serve startup and could hand out the
stale duration_us=0 rows 0005 repairs. Bump schema_meta.version to '5' and
presenter.SchemaVersion to 5 so serve refuses an un-migrated store via the
existing mechanism (a version bump has no re-ingest side effect; the ingester
never reads schema_meta). Every migration now bumps the version in lockstep —
no version-neutral special case.

- 0005 also tightens the backfill WHERE to start_ts > 0 AND end_ts > 0, mirroring
  the writer's finalize gate exactly.
- Strengthen the skew migration test to assert the exact unchanged duration
  (skew rows stay 0), not merely non-negative.
- embed-smoke.sh waits for 0005_op_duration_backfill.sql (the version-setting
  migration); sweep all schema-version assertions to 5.
- Document the duration-computed-at-finalize contract: the op-insert
  start_ts = MIN(...) can lower start_ts on a later re-emit without recomputing
  duration, so adapters must emit the authoritative start at/before finalize
  (real adapters do). SOW-0027 tracks a defensive recompute / generated-column.

SOW-0026.
Migration 0005 bumped schema_meta.version to 5, but the public spec examples
still showed 4: the /api/health samples in rest-api.md and observability.md and
the schema_meta example in data-model.md. Update all three to 5 to match
presenter.SchemaVersion and the emitted health payload.

Also narrow the "every migration bumps the version" wording (presenter.go, the
0005 SQL header, the 0004 test comment, data-model.md): only migrations whose
outcome serve reads or validates bump schema_meta.version in lockstep with the
binary; 0002_source_progress.sql is ingester-only and stays version-neutral.
Comment/doc only — no logic change.

SOW-0026.
- presenter.md: the startup schema check refuses ANY version mismatch (!=),
  not only version > supported — SOW-0026 relies on a v5 binary rejecting an
  older pre-0005 (v4) store so it never serves stale durations.
- embed-smoke.sh: drop the "every migration bumps the version" overstatement
  (0002_source_progress is version-neutral); the latest serve-relevant
  migration sets the version this binary expects.
- Close SOW-0026 (op duration_us from persisted start_ts): Status completed,
  four external-review rounds converged, moved to done/.

SOW-0026.
@ktsaou ktsaou merged commit 5d64791 into master May 31, 2026
6 checks passed
@ktsaou ktsaou deleted the sow-0026-op-duration branch May 31, 2026 04:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant