Skip to content

Skip synchronous snapshot during leader election to unblock quorum formation#140

Merged
laxman-ch merged 3 commits into
branch-3.6from
skip-leader-startup-snapshot
May 28, 2026
Merged

Skip synchronous snapshot during leader election to unblock quorum formation#140
laxman-ch merged 3 commits into
branch-3.6from
skip-leader-startup-snapshot

Conversation

@laxman-ch
Copy link
Copy Markdown
Collaborator

Summary

On large ensembles (15M+ znodes), the synchronous snapshot in ZooKeeperServer.loadData() takes 34–43s on production hardware, blocking quorum formation and causing repeated election failures when it exceeds initLimit (incident-10033: 4m 47s outage).

This PR introduces a feature flag (zookeeper.leaderElection.skipStartupSnapshot, default: false) that conditionally skips the startup snapshot during leader election. With the flag enabled, projected leader-election time drops from minutes to ~1s, independent of znode count.

Status: this is at the proposal stage. PR opened as draft for early visibility. Production rollout is gated on (1) thorough review with in-house ZK SMEs, (2) engagement with Apache ZK OSS committers on the design, and (3) staging (EI) validation.

Motivation

Incident-10033 was four consecutive identical snapshot writes — ~166s wasted before the 4th finally completed near the timeout, after which followers DIFF-synced in ~55ms. The disk snapshot file itself was never sent to followers (DIFF sync uses txn log entries, not the snapshot file). The 4 snapshot writes were redundant work that scaled linearly with znode count.

This is the same fix Flavio Junqueira committed to branch-3.4 in 2013 (ZOOKEEPER-1558), and the same fix proposed upstream in ZOOKEEPER-4766 (open).

Changes

Implementation (3 files, +50/-3)

  • ZooKeeperServer.java — new loadData(boolean skipSnapshot) overload; no-arg loadData() delegates to loadData(false) for backward compatibility.
  • Leader.java:590 — passes self.isSkipLeaderStartupSnapshot() to loadData().
  • QuorumPeer.java — system property zookeeper.leaderElection.skipStartupSnapshot (default false), getter/setter.

Tests (3 files, +247)

  • ZooKeeperServerTest — 3 tests: loadData(true) skips snapshot, loadData(false) takes snapshot, no-arg loadData() delegates to loadData(false).
  • QuorumPeerTest — 2 tests: default false, getter/setter correctness.
  • Zab1_0Test — 2 tests: leader with skip=true does not rewrite snapshot during lead(); leader with skip=false (default) does. Tests pre-initialize the DB via loadDataBase() to match the real election path where zkDb.isInitialized() == true.

All 82 tests pass (7 new + existing).

Safety

Skipping is safe because:

Scenario Why it's safe
Leader crashes before periodic snapshot Recovery replays existing snapshot + txn log → identical state. killSession() is idempotent. Same as branch-3.4 behavior, which ran without this snapshot for its full lifecycle.
Followers sync after election Followers use DIFF sync (txn log entries), not the disk snapshot file. Confirmed in incident-10033 logs.
Session data accuracy on restart killSession() is idempotent; SyncRequestProcessor triggers a periodic snapshot within seconds on a high-write ensemble.
Dirty/uncommitted state The pre-quorum snapshot was actually causing a known dirty-snapshot issue (per ZOOKEEPER-1558). Removing it eliminates that latent bug.

Rollout plan

  • Default OFF — flag must be explicitly enabled per host.
  • Reversible — disable flag, restart hosts, behavior reverts to current.
  • Production order: non-critical → critical non-Kafka → Kafka EI → Kafka prod non-tier-0 → Kafka prod tier-0.
  • Monitoring: election time, periodic snapshot time, QuorumDigest mismatches, session count.

References

  • Design doc and supporting evidence: https://drive.google.com/drive/folders/1jIOO_tJ1oEgCoJ75a4Q2wfr0YvqtHdbU
    • 01-proposal-snapshot-optimization.md — exec proposal
    • 02-incident-10033-log-correlation.md — production log evidence (all 5 cycles)
    • 03-approach-skip-snapshot.md — detailed design, safety analysis, test strategy
    • 04-zk-snapshot-mechanics.md — reference for ZK snapshot read/write paths
  • Apache JIRAs:
    • ZOOKEEPER-1558 (3.4-only fix, 2013) — fpj removed the same snapshot from loadData()
    • ZOOKEEPER-2678 (already in 3.6.3) — same skip-snapshot pattern applied to follower DIFF sync path
    • ZOOKEEPER-4766 (open) — upstream version of the leader-side fix

Test plan

  • Unit tests pass locally (82 total, 7 new)
  • In-house ZK SME review (Clark Haskins, Jon Bringhurst, Kiran Kambhampati)
  • Engagement with Apache ZK OSS committers on the design
  • EI staging validation (small ensemble, then growing)
  • Production rollout per the staged plan above

…rmation

On large ensembles (15M+ znodes), the synchronous snapshot in
ZooKeeperServer.loadData() takes 34-43s on production hardware,
blocking quorum formation and causing repeated election failures
when it exceeds initLimit (incident-10033: 4m47s outage).

Implementation:
- ZooKeeperServer.loadData(boolean skipSnapshot): new overload that
  conditionally skips the startup snapshot. The no-arg loadData()
  delegates to loadData(false) for backward compatibility.
- Leader.lead(): passes QuorumPeer.isSkipLeaderStartupSnapshot()
  to loadData(), controlled by system property.
- QuorumPeer: new config property
  zookeeper.leaderElection.skipStartupSnapshot (default: false).

Safety: skipping is safe because killSession() is idempotent on
recovery, follower sync uses the in-memory DataTree (not disk),
and SyncRequestProcessor takes periodic snapshots after quorum.
This matches the approach in ZOOKEEPER-1558 (branch-3.4, 2013).

Tests:
- ZooKeeperServerTest: loadData(true) skips snapshot, loadData(false)
  takes snapshot, no-arg loadData() delegates to loadData(false).
- QuorumPeerTest: skipLeaderStartupSnapshot defaults to false,
  getter/setter works correctly.
- Zab1_0Test: leader with skipLeaderStartupSnapshot=true does not
  rewrite snapshot during lead(); leader with it disabled does.

All 82 existing + new tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Sanju98 Sanju98 marked this pull request as ready for review May 19, 2026 09:51
@Sanju98 Sanju98 self-requested a review May 25, 2026 06:20
Sanju98
Sanju98 previously approved these changes May 25, 2026
Copy link
Copy Markdown

@Sanju98 Sanju98 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
Discussed the other comments internally.

Production changes:
- Leader.lead() (Leader.java:590): honor shouldForceWriteInitialSnapshotAfterLeaderElection()
  by AND'ing the skip flag with the negation of the safety predicate. Mirrors the
  follower DIFF-sync gate added by ZOOKEEPER-2678 (Learner.java:553-556). Without
  this, a fresh ensemble post-upgrade — where trustEmptySnapshot=true and
  getLastSnapshotInfo()==null — would skip the very snapshot the existing safety
  mechanism was designed to write.
- QuorumPeer.skipLeaderStartupSnapshot: add 'volatile'. The setter is JMX-reachable;
  without volatile, a flip from a non-leader thread is not guaranteed visible to
  the leader thread reading via isSkipLeaderStartupSnapshot() at the start of
  lead(). Matches the convention of adjacent volatile fields initLimit, syncLimit,
  and connectToLearnerMasterLimit.
- ZooKeeperServer.loadData(true) log line: include lastProcessedZxid (hex). The
  Skipping startup snapshot INFO line is the rollout-monitoring signal; the zxid
  enables correlation to the leader election epoch in operational triage.

Test changes:
- Zab1_0Test: remove unused getSocketPair() allocation in
  testLeaderSkipsSnapshotWhenConfigured / testLeaderTakesSnapshotByDefault
  (each opened a ServerSocket + 2 Sockets that were never used or closed,
  leaking FDs in CI). Extract waitForCnxAcceptor() helper with a 10s
  deadline so a leader-thread death in loadData() fails fast with a clear
  message instead of hanging until the JUnit suite timeout.
- All tests using lastModified() mtime comparison: bump Thread.sleep(50) to
  Thread.sleep(1100) to cover HFS+ 1s mtime granularity on macOS dev hosts
  (APFS/ext4 have finer granularity and are unaffected).
- ZooKeeperServerTest.testLoadDataTakesSnapshotWhenForceWriteRequested:
  new test that documents the Leader.java:590 composition collapsing to
  loadData(false) when the upgrade-safety gate fires, ensuring a snapshot
  is taken regardless of the skip flag.
- Zab1_0Test.testFollowerPathUnaffectedBySkipFlag: new regression guard
  re-running testNormalFollowerRunWithDiff with the skip flag enabled
  globally. The follower path must not branch on this flag; if a future
  refactor leaks it into Learner / Follower, this test diverges and flags
  the regression.
- QuorumPeerTest.testSkipLeaderStartupSnapshotDefaultsToFalse: clear the
  system property in a try/finally so the default-value assertion is not
  affected by Surefire forked-JVM property pollution or test ordering.

All targeted tests (ZooKeeperServerTest, QuorumPeerTest, Zab1_0Test) pass.
Regression set (DIFFSyncConsistencyTest, LearnerHandlerTest, LearnerTest,
SnapshotDigestTest, InvalidSnapshotTest, QuorumPeerMainTest, ObserverMasterTest)
also passes — 93 tests total, 0 failures. Checkstyle and SpotBugs clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sanju98
Sanju98 previously approved these changes May 28, 2026
Earlier commit added a guard to Leader.lead() that AND'd the skip flag
with the negation of shouldForceWriteInitialSnapshotAfterLeaderElection().
The intent was to mirror the symmetric Learner DIFF-sync gate
(ZOOKEEPER-2678) so a fresh-ensemble post-upgrade leader would not skip
the safety-mandated initial snapshot.

The guard reproduces the original incident-10033 pattern in the exact
scenario it was meant to handle:

  1. Operator brings up a huge ensemble (e.g. 15M znodes) for disaster
     recovery or fresh bootstrap, with -Dzookeeper.snapshot.trust.empty=true
     and no snapshot on disk (only txn log).
  2. QuorumPeer.start() -> loadDataBase() -> restore() replays the txn log
     into the in-memory DataTree. zkDb is initialized.
  3. Election runs; some node becomes leader.
  4. Leader.lead() -> shouldForceWriteInitialSnapshotAfterLeaderElection()
     returns true (trustEmptySnapshot && getLastSnapshotInfo() == null).
  5. Guard composition collapses to loadData(false) -> takeSnapshot() ->
     34-43s pause on 15M znodes.
  6. initLimit (40s) exceeded -> followers timeout -> election fails ->
     incident-10033 pattern reproduced.

The Learner DIFF-sync gate is a genuine correctness requirement because
follower DIFF sync does not otherwise write a snapshot, and ZOOKEEPER-3781
documented an upgrade case where divergence resulted. The leader's
startup snapshot, by contrast, has never been a correctness requirement:
fpj called it "convenient" in ZOOKEEPER-1558 and removed it from 3.4.
Recovery via txn log replay plus the next periodic snapshot preserves
correctness. The skip flag is opt-in via cfg2; operators bringing up a
huge ensemble with trust.empty=true who want bounded recovery rather
than fast election can disable the flag.

Drop the corresponding test ZooKeeperServerTest.testLoadDataTakesSnapshot
WhenForceWriteRequested - it was documenting the composition we are no
longer doing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@laxman-ch
Copy link
Copy Markdown
Collaborator Author

Updated based on review comments — please re-review. @Sanju98

Three inline comments addressed. One important amendment after deeper analysis: I added the shouldForceWriteInitialSnapshotAfterLeaderElection() guard you suggested (8aa68d9ad), then reverted it (c7569c237) — explanation below.

Production changes (now in place)

  • volatile on skipLeaderStartupSnapshot (QuorumPeer.java:732) — matches initLimit / syncLimit convention; closes the JMX-setter visibility gap.
  • lastProcessedZxid in the skip-path log line (ZooKeeperServer.java:555) — needed for rollout-monitoring correlation in Observe Logs.
  • On the metric ask: opted for log-only (the INFO line is searchable in Observe Logs and isSkipLeaderStartupSnapshot() on JMX covers static state). Happy to add a counter later if you'd prefer fleet-level dashboarding.

shouldForceWrite guard — added then reverted

The guard reproduces the original incident-10033 pattern in the exact scenario it was meant to handle. On a huge ensemble (e.g. 15M znodes) being bootstrapped or recovered with -Dzookeeper.snapshot.trust.empty=true and no on-disk snapshot:

  1. QuorumPeer.start() → loadDataBase() → restore() replays txn log into in-memory DataTree
  2. Election runs; new leader enters Leader.lead()
  3. shouldForceWriteInitialSnapshotAfterLeaderElection() returns true (trustEmptySnapshot && getLastSnapshotInfo()==null)
  4. Guard composition collapses to loadData(false) → takeSnapshot() → 34–43s pause on 15M znodes
  5. initLimit (40s) exceeded → followers timeout → election cycles → incident-10033 redux

The Learner DIFF-sync gate (Learner.java:553) IS a genuine correctness requirement (follower DIFF sync doesn't otherwise write a snapshot, per ZOOKEEPER-3781). The leader's startup snapshot, by contrast, has never been a correctness requirement — fpj called it "convenient" in ZOOKEEPER-1558 and removed it from 3.4. Recovery via txn log replay + the next periodic snapshot preserves correctness. The skip flag is opt-in via cfg2; operators in a recovery scenario who want bounded-recovery rather than fast election can disable the flag.

Happy to discuss further if you'd prefer a different shape (e.g. a separate forceSnapshotOnUpgrade flag that's mutually exclusive with skip), but I think the cleanest answer is to let operators choose between fast election (current behavior) and bounded recovery (disable skip) per ensemble.

Test changes (in 8aa68d9ad)

  • Zab1_0Test: removed an unused getSocketPair() allocation (FD leak), bounded the cnxAcceptor.isAlive() poll with a 10s deadline (no-timeout hang), bumped Thread.sleep(50)1100 (HFS+ 1s mtime granularity on macOS dev hosts).
  • Zab1_0Test.testFollowerPathUnaffectedBySkipFlag — new regression guard re-running testNormalFollowerRunWithDiff with the flag globally enabled.
  • QuorumPeerTest.testSkipLeaderStartupSnapshotDefaultsToFalse now pins/restores the system property to avoid Surefire-forked-JVM pollution.
  • (Dropped testLoadDataTakesSnapshotWhenForceWriteRequested along with the guard revert.)

All targeted tests + leader/learner/snapshot/observer regression set pass locally (92 tests, 0 failures). Checkstyle + SpotBugs clean.

@laxman-ch laxman-ch requested a review from abhilash1in May 28, 2026 10:47
@laxman-ch laxman-ch requested a review from adityaagg09 May 28, 2026 14:28
Copy link
Copy Markdown
Collaborator

@adityaagg09 adityaagg09 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@laxman-ch laxman-ch removed the request for review from abhilash1in May 28, 2026 14:41
@laxman-ch laxman-ch merged commit 556551a into branch-3.6 May 28, 2026
24 of 32 checks passed
@laxman-ch laxman-ch deleted the skip-leader-startup-snapshot branch May 28, 2026 17:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants