doctor: detect and repair DuckDB spans column-statistics corruption (#56) by anilmurty · Pull Request #58 · Metabuilder-Labs/tokenjam

anilmurty · 2026-05-12T17:03:29Z

Closes #56.

What was broken

DuckDB v1.5.x can leave the `spans` table's per-row-group min/max statistics out of sync with the actual data after certain bulk-write patterns. The equality fast-path then trusts the stale stats and skips every row group, so `WHERE trace_id = X` returns 0 rows even when the data is clearly there. `LIKE col || '%'` works because it forces a full scan that bypasses the bad stats.

User-visible symptom: clicking a trace from the dashboard's Traces list hangs on "Loading trace..." indefinitely. The trace-detail endpoint (`GET /api/v1/traces/{trace_id}`) runs `SELECT * FROM spans WHERE trace_id = $1`, gets 0 rows, and returns `{spans: [], span_count: 0}`. `CHECKPOINT` does not refresh the stats; only a full table copy does.

Fix

Two new helpers in `tokenjam/core/db.py`:

`check_spans_stats_corruption(conn) -> bool` — samples up to 3 distinct `trace_id` values and compares `=` vs `LIKE col || '%'`. Returns True iff any sample row exists via LIKE but is invisible via `=`. Safe on empty / missing tables.
`repair_spans_stats(conn) -> None` — idempotent rebuild (`CREATE TABLE _spans_repair AS SELECT * FROM spans → DROP → RENAME → CHECKPOINT`). Refreshes column statistics; every row preserved.

Wired into `tj doctor`:

New check Add LiteLLM provider integration #9: "Spans column statistics". On detected corruption, surfaces a `warning` pointing at `tj doctor --repair`. On non-DuckDB backends / canary errors, emits an `info` line and exits cleanly (does not crash doctor).
New flag `--repair`: after running all checks, executes the repair action attached to any check that flagged one. Currently only the spans rebuild. Reports rows preserved before/after.

Both the check and the repair share the existing writable connection on `ctx.obj["db"]` rather than opening a second one — DuckDB rejects mixing read-only and read-write connections to the same file from the same process. (This was a real bug I hit while iterating live; first revision opened a second `read_only=True` connection and got `Connection Error: Can't open a connection to same database file with a different configuration`.)

Manual verification

```
$ tj doctor
✓ Config file: Found and valid: .tj/config.toml
✓ DuckDB writable: ...
✓ Ingest secret: ...
✓ Prometheus: ...
✓ Schema vs capture: ...
✓ Drift detection: ...
✓ Webhook security: ...
✓ Spans column statistics: Column statistics are consistent.

$ tj doctor --help
...
--repair Attempt to fix issues that have a known repair path (e.g. rebuild
the spans table when DuckDB column statistics are corrupt — see
issue #56).

$ tj doctor --json | python3 -m json.tool

emits structured records including the new "Spans column statistics" entry

```

When the issue first appeared in my local DB, the workaround I ran manually was the same SQL `repair_spans_stats` now wraps — table count went from `=`-returning-0 to `=`-returning-14 immediately.

Test plan

7 new unit tests in `tests/unit/test_spans_stats_repair.py` covering the contract:
- `check_spans_stats_corruption`: empty table → False, healthy table → False, missing table → False (no crash on pre-migration DBs)
- `repair_spans_stats`: preserves row count, preserves exact data, idempotent on empty table, idempotent across repeated calls
`pytest tests/unit/ tests/synthetic/ tests/agents/ tests/integration/` → 416 passed (was 409, +7 new)
`ruff check tokenjam/` → clean
Live `tj doctor` and `tj doctor --repair` against my (now-healthy) DB — both run cleanly

What's intentionally NOT in this PR

Auto-repair on `tj serve` startup — destructive operation, prefer explicit `--repair`.
DuckDB version pin bump — issue may or may not be fixed upstream; doctor-based detection works regardless.
Upstream report to DuckDB — worth doing as a separate follow-up once we can reproduce the corruption synthetically.

🤖 Filed via Claude Code

) DuckDB v1.5.x can leave the spans table's per-row-group min/max statistics out of sync with the actual data after certain bulk-write patterns. Once that happens, `WHERE trace_id = X` returns 0 rows (the equality fast-path trusts the bad stats and skips every row group), even though the data is clearly there — `WHERE trace_id LIKE X || '%'` still finds it. The user- visible symptom is the dashboard hanging on "Loading trace..." when you click a trace from the list. See issue #56. `CHECKPOINT` does not refresh the stats; only a full table copy does. This change: - Adds `check_spans_stats_corruption(conn)` to `tokenjam/core/db.py`. Samples up to 3 distinct trace_ids and compares `=` vs `LIKE col || '%'`. Returns True iff any sample row exists via LIKE but is invisible via =. - Adds `repair_spans_stats(conn)` to the same module. Idempotent table rebuild (CREATE AS SELECT → DROP → RENAME → CHECKPOINT) that refreshes column statistics while preserving every row. - Adds check #9 to `tj doctor`: "Spans column statistics". When stats are corrupt the check emits a warning pointing at `tj doctor --repair`. When the DB is non-DuckDB, locked, or the canary errors, the check yields an "info" line and exits cleanly (it does not crash doctor). - Adds `tj doctor --repair`. After running all checks, executes the repair action attached to any check that flagged one — currently only the spans rebuild. Reports rows preserved before/after so the user can verify. Both the check and the repair share the existing writable DuckDB connection from `ctx.obj["db"]` rather than opening a second one — DuckDB rejects mixing read-only and read-write connections to the same file from the same process. 7 new unit tests cover the contract: healthy table returns False, empty table returns False, missing table returns False, repair preserves row count, repair preserves data, repair is idempotent. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

CI was failing on: tokenjam/core/db.py:310: error: Value of type "tuple[Any, ...] | None" is not indexable tokenjam/core/db.py:313: error: Value of type "tuple[Any, ...] | None" is not indexable Same issue in cmd_doctor.py's row-count verification. COUNT(*) always returns exactly one row in practice, but mypy can't infer that from DuckDB's typing. Hold the row in a temp and index defensively. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

anilmurty and others added 2 commits May 12, 2026 10:03

anilmurty merged commit 6d09025 into main May 12, 2026
4 checks passed

anilmurty deleted the fix/spans-stats-corruption-doctor branch May 12, 2026 17:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

doctor: detect and repair DuckDB spans column-statistics corruption (#56)#58

doctor: detect and repair DuckDB spans column-statistics corruption (#56)#58
anilmurty merged 2 commits into
mainfrom
fix/spans-stats-corruption-doctor

anilmurty commented May 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

anilmurty commented May 12, 2026

What was broken

Fix

Manual verification

emits structured records including the new "Spans column statistics" entry

Test plan

What's intentionally NOT in this PR

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant