doctor: detect and repair DuckDB spans column-statistics corruption (#56)#58
Merged
Conversation
) DuckDB v1.5.x can leave the spans table's per-row-group min/max statistics out of sync with the actual data after certain bulk-write patterns. Once that happens, `WHERE trace_id = X` returns 0 rows (the equality fast-path trusts the bad stats and skips every row group), even though the data is clearly there — `WHERE trace_id LIKE X || '%'` still finds it. The user- visible symptom is the dashboard hanging on "Loading trace..." when you click a trace from the list. See issue #56. `CHECKPOINT` does not refresh the stats; only a full table copy does. This change: - Adds `check_spans_stats_corruption(conn)` to `tokenjam/core/db.py`. Samples up to 3 distinct trace_ids and compares `=` vs `LIKE col || '%'`. Returns True iff any sample row exists via LIKE but is invisible via =. - Adds `repair_spans_stats(conn)` to the same module. Idempotent table rebuild (CREATE AS SELECT → DROP → RENAME → CHECKPOINT) that refreshes column statistics while preserving every row. - Adds check #9 to `tj doctor`: "Spans column statistics". When stats are corrupt the check emits a warning pointing at `tj doctor --repair`. When the DB is non-DuckDB, locked, or the canary errors, the check yields an "info" line and exits cleanly (it does not crash doctor). - Adds `tj doctor --repair`. After running all checks, executes the repair action attached to any check that flagged one — currently only the spans rebuild. Reports rows preserved before/after so the user can verify. Both the check and the repair share the existing writable DuckDB connection from `ctx.obj["db"]` rather than opening a second one — DuckDB rejects mixing read-only and read-write connections to the same file from the same process. 7 new unit tests cover the contract: healthy table returns False, empty table returns False, missing table returns False, repair preserves row count, repair preserves data, repair is idempotent. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI was failing on: tokenjam/core/db.py:310: error: Value of type "tuple[Any, ...] | None" is not indexable tokenjam/core/db.py:313: error: Value of type "tuple[Any, ...] | None" is not indexable Same issue in cmd_doctor.py's row-count verification. COUNT(*) always returns exactly one row in practice, but mypy can't infer that from DuckDB's typing. Hold the row in a temp and index defensively. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #56.
What was broken
DuckDB v1.5.x can leave the `spans` table's per-row-group min/max statistics out of sync with the actual data after certain bulk-write patterns. The equality fast-path then trusts the stale stats and skips every row group, so `WHERE trace_id = X` returns 0 rows even when the data is clearly there. `LIKE col || '%'` works because it forces a full scan that bypasses the bad stats.
User-visible symptom: clicking a trace from the dashboard's Traces list hangs on "Loading trace..." indefinitely. The trace-detail endpoint (`GET /api/v1/traces/{trace_id}`) runs `SELECT * FROM spans WHERE trace_id = $1`, gets 0 rows, and returns `{spans: [], span_count: 0}`. `CHECKPOINT` does not refresh the stats; only a full table copy does.
Fix
Two new helpers in `tokenjam/core/db.py`:
Wired into `tj doctor`:
Both the check and the repair share the existing writable connection on `ctx.obj["db"]` rather than opening a second one — DuckDB rejects mixing read-only and read-write connections to the same file from the same process. (This was a real bug I hit while iterating live; first revision opened a second `read_only=True` connection and got `Connection Error: Can't open a connection to same database file with a different configuration`.)
Manual verification
```
$ tj doctor
✓ Config file: Found and valid: .tj/config.toml
✓ DuckDB writable: ...
✓ Ingest secret: ...
✓ Prometheus: ...
✓ Schema vs capture: ...
✓ Drift detection: ...
✓ Webhook security: ...
✓ Spans column statistics: Column statistics are consistent.
$ tj doctor --help
...
--repair Attempt to fix issues that have a known repair path (e.g. rebuild
the spans table when DuckDB column statistics are corrupt — see
issue #56).
$ tj doctor --json | python3 -m json.tool
emits structured records including the new "Spans column statistics" entry
```
When the issue first appeared in my local DB, the workaround I ran manually was the same SQL `repair_spans_stats` now wraps — table count went from `=`-returning-0 to `=`-returning-14 immediately.
Test plan
What's intentionally NOT in this PR
🤖 Filed via Claude Code