Skip synchronous snapshot during leader election to unblock quorum formation#140
Conversation
…rmation On large ensembles (15M+ znodes), the synchronous snapshot in ZooKeeperServer.loadData() takes 34-43s on production hardware, blocking quorum formation and causing repeated election failures when it exceeds initLimit (incident-10033: 4m47s outage). Implementation: - ZooKeeperServer.loadData(boolean skipSnapshot): new overload that conditionally skips the startup snapshot. The no-arg loadData() delegates to loadData(false) for backward compatibility. - Leader.lead(): passes QuorumPeer.isSkipLeaderStartupSnapshot() to loadData(), controlled by system property. - QuorumPeer: new config property zookeeper.leaderElection.skipStartupSnapshot (default: false). Safety: skipping is safe because killSession() is idempotent on recovery, follower sync uses the in-memory DataTree (not disk), and SyncRequestProcessor takes periodic snapshots after quorum. This matches the approach in ZOOKEEPER-1558 (branch-3.4, 2013). Tests: - ZooKeeperServerTest: loadData(true) skips snapshot, loadData(false) takes snapshot, no-arg loadData() delegates to loadData(false). - QuorumPeerTest: skipLeaderStartupSnapshot defaults to false, getter/setter works correctly. - Zab1_0Test: leader with skipLeaderStartupSnapshot=true does not rewrite snapshot during lead(); leader with it disabled does. All 82 existing + new tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sanju98
left a comment
There was a problem hiding this comment.
LGTM
Discussed the other comments internally.
Production changes: - Leader.lead() (Leader.java:590): honor shouldForceWriteInitialSnapshotAfterLeaderElection() by AND'ing the skip flag with the negation of the safety predicate. Mirrors the follower DIFF-sync gate added by ZOOKEEPER-2678 (Learner.java:553-556). Without this, a fresh ensemble post-upgrade — where trustEmptySnapshot=true and getLastSnapshotInfo()==null — would skip the very snapshot the existing safety mechanism was designed to write. - QuorumPeer.skipLeaderStartupSnapshot: add 'volatile'. The setter is JMX-reachable; without volatile, a flip from a non-leader thread is not guaranteed visible to the leader thread reading via isSkipLeaderStartupSnapshot() at the start of lead(). Matches the convention of adjacent volatile fields initLimit, syncLimit, and connectToLearnerMasterLimit. - ZooKeeperServer.loadData(true) log line: include lastProcessedZxid (hex). The Skipping startup snapshot INFO line is the rollout-monitoring signal; the zxid enables correlation to the leader election epoch in operational triage. Test changes: - Zab1_0Test: remove unused getSocketPair() allocation in testLeaderSkipsSnapshotWhenConfigured / testLeaderTakesSnapshotByDefault (each opened a ServerSocket + 2 Sockets that were never used or closed, leaking FDs in CI). Extract waitForCnxAcceptor() helper with a 10s deadline so a leader-thread death in loadData() fails fast with a clear message instead of hanging until the JUnit suite timeout. - All tests using lastModified() mtime comparison: bump Thread.sleep(50) to Thread.sleep(1100) to cover HFS+ 1s mtime granularity on macOS dev hosts (APFS/ext4 have finer granularity and are unaffected). - ZooKeeperServerTest.testLoadDataTakesSnapshotWhenForceWriteRequested: new test that documents the Leader.java:590 composition collapsing to loadData(false) when the upgrade-safety gate fires, ensuring a snapshot is taken regardless of the skip flag. - Zab1_0Test.testFollowerPathUnaffectedBySkipFlag: new regression guard re-running testNormalFollowerRunWithDiff with the skip flag enabled globally. The follower path must not branch on this flag; if a future refactor leaks it into Learner / Follower, this test diverges and flags the regression. - QuorumPeerTest.testSkipLeaderStartupSnapshotDefaultsToFalse: clear the system property in a try/finally so the default-value assertion is not affected by Surefire forked-JVM property pollution or test ordering. All targeted tests (ZooKeeperServerTest, QuorumPeerTest, Zab1_0Test) pass. Regression set (DIFFSyncConsistencyTest, LearnerHandlerTest, LearnerTest, SnapshotDigestTest, InvalidSnapshotTest, QuorumPeerMainTest, ObserverMasterTest) also passes — 93 tests total, 0 failures. Checkstyle and SpotBugs clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Earlier commit added a guard to Leader.lead() that AND'd the skip flag
with the negation of shouldForceWriteInitialSnapshotAfterLeaderElection().
The intent was to mirror the symmetric Learner DIFF-sync gate
(ZOOKEEPER-2678) so a fresh-ensemble post-upgrade leader would not skip
the safety-mandated initial snapshot.
The guard reproduces the original incident-10033 pattern in the exact
scenario it was meant to handle:
1. Operator brings up a huge ensemble (e.g. 15M znodes) for disaster
recovery or fresh bootstrap, with -Dzookeeper.snapshot.trust.empty=true
and no snapshot on disk (only txn log).
2. QuorumPeer.start() -> loadDataBase() -> restore() replays the txn log
into the in-memory DataTree. zkDb is initialized.
3. Election runs; some node becomes leader.
4. Leader.lead() -> shouldForceWriteInitialSnapshotAfterLeaderElection()
returns true (trustEmptySnapshot && getLastSnapshotInfo() == null).
5. Guard composition collapses to loadData(false) -> takeSnapshot() ->
34-43s pause on 15M znodes.
6. initLimit (40s) exceeded -> followers timeout -> election fails ->
incident-10033 pattern reproduced.
The Learner DIFF-sync gate is a genuine correctness requirement because
follower DIFF sync does not otherwise write a snapshot, and ZOOKEEPER-3781
documented an upgrade case where divergence resulted. The leader's
startup snapshot, by contrast, has never been a correctness requirement:
fpj called it "convenient" in ZOOKEEPER-1558 and removed it from 3.4.
Recovery via txn log replay plus the next periodic snapshot preserves
correctness. The skip flag is opt-in via cfg2; operators bringing up a
huge ensemble with trust.empty=true who want bounded recovery rather
than fast election can disable the flag.
Drop the corresponding test ZooKeeperServerTest.testLoadDataTakesSnapshot
WhenForceWriteRequested - it was documenting the composition we are no
longer doing.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Updated based on review comments — please re-review. @Sanju98 Three inline comments addressed. One important amendment after deeper analysis: I added the Production changes (now in place)
|
Summary
On large ensembles (15M+ znodes), the synchronous snapshot in
ZooKeeperServer.loadData()takes 34–43s on production hardware, blocking quorum formation and causing repeated election failures when it exceedsinitLimit(incident-10033: 4m 47s outage).This PR introduces a feature flag (
zookeeper.leaderElection.skipStartupSnapshot, default: false) that conditionally skips the startup snapshot during leader election. With the flag enabled, projected leader-election time drops from minutes to ~1s, independent of znode count.Motivation
Incident-10033 was four consecutive identical snapshot writes — ~166s wasted before the 4th finally completed near the timeout, after which followers DIFF-synced in ~55ms. The disk snapshot file itself was never sent to followers (DIFF sync uses txn log entries, not the snapshot file). The 4 snapshot writes were redundant work that scaled linearly with znode count.
This is the same fix Flavio Junqueira committed to branch-3.4 in 2013 (ZOOKEEPER-1558), and the same fix proposed upstream in ZOOKEEPER-4766 (open).
Changes
Implementation (3 files, +50/-3)
ZooKeeperServer.java— newloadData(boolean skipSnapshot)overload; no-argloadData()delegates toloadData(false)for backward compatibility.Leader.java:590— passesself.isSkipLeaderStartupSnapshot()toloadData().QuorumPeer.java— system propertyzookeeper.leaderElection.skipStartupSnapshot(defaultfalse), getter/setter.Tests (3 files, +247)
ZooKeeperServerTest— 3 tests:loadData(true)skips snapshot,loadData(false)takes snapshot, no-argloadData()delegates toloadData(false).QuorumPeerTest— 2 tests: default false, getter/setter correctness.Zab1_0Test— 2 tests: leader with skip=true does not rewrite snapshot duringlead(); leader with skip=false (default) does. Tests pre-initialize the DB vialoadDataBase()to match the real election path wherezkDb.isInitialized() == true.All 82 tests pass (7 new + existing).
Safety
Skipping is safe because:
killSession()is idempotent. Same as branch-3.4 behavior, which ran without this snapshot for its full lifecycle.killSession()is idempotent;SyncRequestProcessortriggers a periodic snapshot within seconds on a high-write ensemble.Rollout plan
QuorumDigestmismatches, session count.References
01-proposal-snapshot-optimization.md— exec proposal02-incident-10033-log-correlation.md— production log evidence (all 5 cycles)03-approach-skip-snapshot.md— detailed design, safety analysis, test strategy04-zk-snapshot-mechanics.md— reference for ZK snapshot read/write pathsloadData()Test plan