State Store: Compact pruned key range after each prune#3675
Conversation
The State Store (pebbledb) prune scans the entire DB on every run. Deleted
keys linger as un-compacted tombstones, so each scan reads through more dead
data the longer a node stays up — prune latency creeps up and head-lag grows
(a restart temporarily relieves it because reopening triggers compaction).
Two changes address this:
- Raise Pebble's compaction concurrency from the default {1,1} to {1,4} so a
single compactor can keep up with the tombstone churn pruning generates.
- After each prune, compact only the key span that was actually deleted (and
skip compaction entirely when the prune deleted nothing), reclaiming the
tombstoned space immediately instead of letting it accumulate.
Applied to both the descending (default for new DBs) and ascending (legacy)
prune paths. Adds unit tests covering the range compaction, the skip guard,
and the single-key inclusive-bound edge case.
STO-602
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
PR SummaryMedium Risk Overview Pebble open options now allow 1–4 parallel compactions ( After descending and ascending prune paths finish deletes and update earliest version, they call new
Reviewed by Cursor Bugbot for commit 6feb36d. Bugbot is set up for automated code reviews on this repo. Configure here. |
|
The latest Buf updates on your PR. Results from workflow Buf / buf (pull_request).
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3675 +/- ##
==========================================
- Coverage 58.97% 58.07% -0.91%
==========================================
Files 2263 2179 -84
Lines 187223 177592 -9631
==========================================
- Hits 110421 103129 -7292
+ Misses 66858 65304 -1554
+ Partials 9944 9159 -785
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Adds a synchronous compaction of the pruned key range after each prune pass (both descending and ascending paths) and raises Pebble's compaction concurrency to {1,4}. The logic is correct and well-tested; the only concerns are non-blocking performance trade-offs around running a synchronous compaction while the prune iterator is still open and over a range that can span the whole keyspace.
Findings: 0 blocking | 5 non-blocking | 2 posted inline
Blockers
- None at the file/PR level.
Non-blocking
- Both Codex and Cursor produced no material findings (Codex additionally noted it could not run the focused tests because Go 1.25.6 was unavailable in its sandbox).
- compactPrunedRange runs a synchronous, blocking Pebble Compact on every prune that deletes anything. Since the compacted span runs from the smallest to the largest deleted key across ALL stores, a prune that touches a low-keyspace store and a high-keyspace store will compact essentially the entire DB range (including live data between them) each cycle. This is functionally correct but may make prune latency worse rather than better in some workloads — worth confirming with the on-node validation mentioned in the PR description, and consider whether per-store ranges would better match the 'compact only the pruned range' intent.
- Consider validating that the new CompactionConcurrencyRange (returning {1,4}) interacts well with existing L0 compaction settings under sustained write load, since bursting to 4 parallel compactions competes for I/O with foreground writes.
- 2 suggestion(s)/nit(s) flagged inline on specific lines.
| if err := db.SetEarliestVersion(earliestVersion, false); err != nil { | ||
| return err | ||
| } | ||
| return db.compactPrunedRange(firstDeletedKey, lastDeletedKey) |
There was a problem hiding this comment.
[suggestion] compactPrunedRange runs synchronously here, but the prune iterator (itr) and batch are only closed by the defers at the top of the function — so they are still open during the compaction. An open Pebble iterator pins the current LSM version, which prevents the obsolete sstables produced by this compaction from being physically deleted until the iterator is closed, partially defeating the immediate space-reclamation goal. Consider explicitly closing itr (and the already-committed batch) before calling compactPrunedRange.
There was a problem hiding this comment.
@Kbhat1 this is worth addressing in a follow up PR I think.
| // start < end. Appending a zero byte extends the user-key portion of last, | ||
| // yielding a key strictly greater than it under both the MVCC and default | ||
| // comparers, so the entire deleted span is covered. | ||
| end := append(slices.Clone(last), 0) |
There was a problem hiding this comment.
[nit] Minor: append(slices.Clone(last), 0) is correct and safe (clone prevents mutating last). For readability you could note that this relies on 0x00 being the minimum byte so the extended key sorts immediately after last for any suffix — which the test asserts. No change required.
|
Created backport PR for
Please cherry-pick the changes locally and resolve any conflicts. git fetch origin backport-3675-to-release/v6.6
git worktree add --checkout .worktree/backport-3675-to-release/v6.6 backport-3675-to-release/v6.6
cd .worktree/backport-3675-to-release/v6.6
git reset --hard HEAD^
git cherry-pick -x 42d7e20afddf039139502765bdb2ed50e4b45a8e
git push --force-with-lease |
- Pebble prune leaves tombstones uncompacted, so prune slows over uptime
- Bump compaction concurrency to {1,4} and compact each pruned range
right after
- Verifying in unit tests + on node
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
(cherry picked from commit 42d7e20)
Describe your changes and provide context
Testing performed to validate your change