Make sync completion slot-precise and remove the initial-sync to regular-sync handoff gap#16607
Open
Make sync completion slot-precise and remove the initial-sync to regular-sync handoff gap#16607
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
TLDR: Go over
TestBlocksFetcher_bestNonFinalizedSlot_PreservesPeerHeadWithinEpochtest to quickly know what bug is being discussed. This test fails on develop.What type of PR is this?
Bug fix
What does this PR do? Why is it needed?
Problem
In e2e presubmits, checkpoint-sync nodes intermittently fail with:
received conflicting head epochs on node 2, expected 12, received 11This is a real sync issue exposed by e2e, not just an evaluator artifact.
How initial sync is supposed to work:
Start() → roundRobinSync() → markSynced()
Start() (beacon-chain/sync/initial-sync/service.go):
roundRobinSync() (beacon-chain/sync/initial-sync/round_robin.go)
In its non-finalized phase, roundRobinSync() uses a block queue that calls:
"What is the highest slot enough peers agree on?"
"Should the queue keep going or declare sync done?"
markSynced() (beacon-chain/sync/initial-sync/service.go)
In a perfect world:
Three things break this flow.
1. Slot precision loss in sync target
bestNonFinalizedSlot() asks peers for their head slots, but converts them through an epoch round-trip that truncates intra-epoch progress:
waitHighestExpectedSlot() re-checks the target when the node catches up. With the truncated value, it sees no further target and cancels the sync queue.
2. Same-epoch fast path skips sync entirely
Start() had a shortcut: if the node's head epoch matches the current epoch, skip initial sync. But "same epoch" can mean up to 7 slots behind (minimal config) or 31 slots behind (mainnet):
3. Before this change, the node could report sync complete before it was ready to follow new gossip blocks
markSynced() immediately sets Syncing()=false and closes InitialSyncComplete. Before this change, the regular-sync subscribe paths in subscriber.go were still waiting on InitialSyncComplete, so topic validators and subscriptions were only registered after the node already reported sync complete.
Why it's a flake
The handoff gap duration varies with goroutine scheduling and CI machine load. With minimal config (4-second slots, 32-second epochs), the window is narrow but non-trivial relative to slot time. An epoch boundary that lands in the gap can cause the failure; if it does not, the test usually passes. On mainnet, the much longer slot and epoch durations plus gossip redundancy from thousands of peers make this much less likely to surface.
Fix
1. Slot-precise sync target (blocks_fetcher_utils.go)
After BestNonFinalized() returns the quorum-backed target epoch + supporting peers, scan those peers for the highest actual HeadSlot within that epoch:
2. Slot-precise startup check (beacon-chain/sync/initial-sync/service.go)
3. Subscriptions active before markSynced() (subscriber.go)
Remove waitForInitialSync() from subscribe() and subscribeWithParameters(). Gossip validators already ignore messages while initialSync.Syncing() is true (e.g., validate_beacon_blocks.go), so subscriptions can safely be live before sync completes:
Which issues(s) does this PR fix?
Fixes #
Other notes for review
Acknowledgements