Skip to content

doctor: detect and repair DuckDB spans column-statistics corruption (#56)#58

Merged
anilmurty merged 2 commits into
mainfrom
fix/spans-stats-corruption-doctor
May 12, 2026
Merged

doctor: detect and repair DuckDB spans column-statistics corruption (#56)#58
anilmurty merged 2 commits into
mainfrom
fix/spans-stats-corruption-doctor

Conversation

@anilmurty
Copy link
Copy Markdown
Contributor

Closes #56.

What was broken

DuckDB v1.5.x can leave the `spans` table's per-row-group min/max statistics out of sync with the actual data after certain bulk-write patterns. The equality fast-path then trusts the stale stats and skips every row group, so `WHERE trace_id = X` returns 0 rows even when the data is clearly there. `LIKE col || '%'` works because it forces a full scan that bypasses the bad stats.

User-visible symptom: clicking a trace from the dashboard's Traces list hangs on "Loading trace..." indefinitely. The trace-detail endpoint (`GET /api/v1/traces/{trace_id}`) runs `SELECT * FROM spans WHERE trace_id = $1`, gets 0 rows, and returns `{spans: [], span_count: 0}`. `CHECKPOINT` does not refresh the stats; only a full table copy does.

Fix

Two new helpers in `tokenjam/core/db.py`:

  • `check_spans_stats_corruption(conn) -> bool` — samples up to 3 distinct `trace_id` values and compares `=` vs `LIKE col || '%'`. Returns True iff any sample row exists via LIKE but is invisible via `=`. Safe on empty / missing tables.
  • `repair_spans_stats(conn) -> None` — idempotent rebuild (`CREATE TABLE _spans_repair AS SELECT * FROM spans → DROP → RENAME → CHECKPOINT`). Refreshes column statistics; every row preserved.

Wired into `tj doctor`:

  • New check Add LiteLLM provider integration #9: "Spans column statistics". On detected corruption, surfaces a `warning` pointing at `tj doctor --repair`. On non-DuckDB backends / canary errors, emits an `info` line and exits cleanly (does not crash doctor).
  • New flag `--repair`: after running all checks, executes the repair action attached to any check that flagged one. Currently only the spans rebuild. Reports rows preserved before/after.

Both the check and the repair share the existing writable connection on `ctx.obj["db"]` rather than opening a second one — DuckDB rejects mixing read-only and read-write connections to the same file from the same process. (This was a real bug I hit while iterating live; first revision opened a second `read_only=True` connection and got `Connection Error: Can't open a connection to same database file with a different configuration`.)

Manual verification

```
$ tj doctor
✓ Config file: Found and valid: .tj/config.toml
✓ DuckDB writable: ...
✓ Ingest secret: ...
✓ Prometheus: ...
✓ Schema vs capture: ...
✓ Drift detection: ...
✓ Webhook security: ...
✓ Spans column statistics: Column statistics are consistent.

$ tj doctor --help
...
--repair Attempt to fix issues that have a known repair path (e.g. rebuild
the spans table when DuckDB column statistics are corrupt — see
issue #56).

$ tj doctor --json | python3 -m json.tool

emits structured records including the new "Spans column statistics" entry

```

When the issue first appeared in my local DB, the workaround I ran manually was the same SQL `repair_spans_stats` now wraps — table count went from `=`-returning-0 to `=`-returning-14 immediately.

Test plan

  • 7 new unit tests in `tests/unit/test_spans_stats_repair.py` covering the contract:
    • `check_spans_stats_corruption`: empty table → False, healthy table → False, missing table → False (no crash on pre-migration DBs)
    • `repair_spans_stats`: preserves row count, preserves exact data, idempotent on empty table, idempotent across repeated calls
  • `pytest tests/unit/ tests/synthetic/ tests/agents/ tests/integration/` → 416 passed (was 409, +7 new)
  • `ruff check tokenjam/` → clean
  • Live `tj doctor` and `tj doctor --repair` against my (now-healthy) DB — both run cleanly

What's intentionally NOT in this PR

  • Auto-repair on `tj serve` startup — destructive operation, prefer explicit `--repair`.
  • DuckDB version pin bump — issue may or may not be fixed upstream; doctor-based detection works regardless.
  • Upstream report to DuckDB — worth doing as a separate follow-up once we can reproduce the corruption synthetically.

🤖 Filed via Claude Code

anilmurty and others added 2 commits May 12, 2026 10:03
)

DuckDB v1.5.x can leave the spans table's per-row-group min/max statistics
out of sync with the actual data after certain bulk-write patterns. Once
that happens, `WHERE trace_id = X` returns 0 rows (the equality fast-path
trusts the bad stats and skips every row group), even though the data is
clearly there — `WHERE trace_id LIKE X || '%'` still finds it. The user-
visible symptom is the dashboard hanging on "Loading trace..." when you
click a trace from the list. See issue #56.

`CHECKPOINT` does not refresh the stats; only a full table copy does.

This change:

- Adds `check_spans_stats_corruption(conn)` to `tokenjam/core/db.py`. Samples
  up to 3 distinct trace_ids and compares `=` vs `LIKE col || '%'`. Returns
  True iff any sample row exists via LIKE but is invisible via =.

- Adds `repair_spans_stats(conn)` to the same module. Idempotent table
  rebuild (CREATE AS SELECT → DROP → RENAME → CHECKPOINT) that refreshes
  column statistics while preserving every row.

- Adds check #9 to `tj doctor`: "Spans column statistics". When stats are
  corrupt the check emits a warning pointing at `tj doctor --repair`. When
  the DB is non-DuckDB, locked, or the canary errors, the check yields an
  "info" line and exits cleanly (it does not crash doctor).

- Adds `tj doctor --repair`. After running all checks, executes the repair
  action attached to any check that flagged one — currently only the spans
  rebuild. Reports rows preserved before/after so the user can verify.

Both the check and the repair share the existing writable DuckDB connection
from `ctx.obj["db"]` rather than opening a second one — DuckDB rejects
mixing read-only and read-write connections to the same file from the same
process.

7 new unit tests cover the contract: healthy table returns False, empty
table returns False, missing table returns False, repair preserves row
count, repair preserves data, repair is idempotent.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI was failing on:
  tokenjam/core/db.py:310: error: Value of type "tuple[Any, ...] | None" is not indexable
  tokenjam/core/db.py:313: error: Value of type "tuple[Any, ...] | None" is not indexable

Same issue in cmd_doctor.py's row-count verification.

COUNT(*) always returns exactly one row in practice, but mypy can't infer
that from DuckDB's typing. Hold the row in a temp and index defensively.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@anilmurty anilmurty merged commit 6d09025 into main May 12, 2026
4 checks passed
@anilmurty anilmurty deleted the fix/spans-stats-corruption-doctor branch May 12, 2026 17:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

DuckDB v1.5.2: spans column statistics corrupt after demo/synthetic-span writes — equality predicates return 0 rows

1 participant