Skip to content

Make sync completion slot-precise and remove the initial-sync to regular-sync handoff gap#16607

Open
satushh wants to merge 1 commit intodevelopfrom
init-sync-bug-fix
Open

Make sync completion slot-precise and remove the initial-sync to regular-sync handoff gap#16607
satushh wants to merge 1 commit intodevelopfrom
init-sync-bug-fix

Conversation

@satushh
Copy link
Copy Markdown
Collaborator

@satushh satushh commented Mar 30, 2026

TLDR: Go over TestBlocksFetcher_bestNonFinalizedSlot_PreservesPeerHeadWithinEpoch test to quickly know what bug is being discussed. This test fails on develop.

What type of PR is this?

Bug fix

What does this PR do? Why is it needed?

Problem

In e2e presubmits, checkpoint-sync nodes intermittently fail with:

received conflicting head epochs on node 2, expected 12, received 11

This is a real sync issue exposed by e2e, not just an evaluator artifact.

How initial sync is supposed to work:

Start() → roundRobinSync() → markSynced()

Start() (beacon-chain/sync/initial-sync/service.go):

  • Compares head vs current slot
  • Skips sync if already caught up

roundRobinSync() (beacon-chain/sync/initial-sync/round_robin.go)

In its non-finalized phase, roundRobinSync() uses a block queue that calls:

  • bestNonFinalizedSlot() (blocks_fetcher_utils.go)
    "What is the highest slot enough peers agree on?"
  • waitHighestExpectedSlot() (blocks_queue.go)
    "Should the queue keep going or declare sync done?"

markSynced() (beacon-chain/sync/initial-sync/service.go)

  • Sets Syncing()=false
  • Closes InitialSyncComplete channel

In a perfect world:

Peers at slot 92                                                                                                                                                                                             
       │                                                
       ▼
bestNonFinalizedSlot() returns 92
       │
       ▼
waitHighestExpectedSlot() sees target=92, keeps queue running
       │
       ▼
Node reaches slot 92, sync queue ends                                                                                                                                                                        
       │
       ▼                                                                                                                                                                                                     
Gossip subscriptions already active by the time markSynced() runs       
       │
       ▼
Node can immediately follow new head blocks, with no handoff gap

Three things break this flow.

1. Slot precision loss in sync target

bestNonFinalizedSlot() asks peers for their head slots, but converts them through an epoch round-trip that truncates intra-epoch progress:

What peers report:    HeadSlot = 92
                                                                                                                                                                                                             
What bestNonFinalizedSlot() did:
                                                                                                                                                                                                             
  slot 92 → ToEpoch → epoch 11 → epoch 11 × 8 → slot 88 
                                                  ▲
                                                  lost 4 slots
                                                                                                                                                                                                             
┌─── Epoch 11 (slots 88─95) ─────────────────────────┐
│                                                     │                                                                                                                                                      
│  88    89    90    91    92    93    94    95        │
│  ▲                       ▲                          │
│  │                       └── peers are here         │                                                                                                                                                      
│  └── sync target (truncated to epoch start)         │
│                                                     │                                                                                                                                                      
└─────────────────────────────────────────────────────┘ 
                                                                                                                                                                                                             
blocks_fetcher_utils.go ── bestNonFinalizedSlot()
blocks_queue.go         ── waitHighestExpectedSlot() uses this to decide if sync is done  

waitHighestExpectedSlot() re-checks the target when the node catches up. With the truncated value, it sees no further target and cancels the sync queue.

2. Same-epoch fast path skips sync entirely

Start() had a shortcut: if the node's head epoch matches the current epoch, skip initial sync. But "same epoch" can mean up to 7 slots behind (minimal config) or 31 slots behind (mainnet):

beacon-chain/sync/initial-sync/service.go (old):                                                                                                                                                                                        
  if ToEpoch(HeadSlot) == ToEpoch(currentSlot) → markSynced()
                                                                                                                                                                                                             
┌─── Epoch 11 ───────────────────────────────────────┐
│                                                     │                                                                                                                                                      
│  88    89    90    91    92    93    94    95        │
│  ▲                                          ▲       │                                                                                                                                                      
│  HeadSlot                             currentSlot   │
│                                                     │                                                                                                                                                      
│  ToEpoch(88) == ToEpoch(95) == 11                   │ 
│  → "already synced!" (7 slots behind)               │                                                                                                                                                      
└─────────────────────────────────────────────────────┘

3. Before this change, the node could report sync complete before it was ready to follow new gossip blocks

markSynced() immediately sets Syncing()=false and closes InitialSyncComplete. Before this change, the regular-sync subscribe paths in subscriber.go were still waiting on InitialSyncComplete, so topic validators and subscriptions were only registered after the node already reported sync complete.

BEFORE (handoff gap):
                                                                                                                                                                                                             
markSynced()
  ├─ synced.Set()                    ← Syncing()=false instantly                                                                                                                                             
  └─ close(InitialSyncComplete)      ← subscriber goroutines start waking
                                        RegisterTopicValidator()...                                                                                                                                          
                                        SubscribeToTopic()...
       ◄──── GAP ────►                  ← node may not yet be subscribed                                                                                                                                     
                                          to gossip; new head blocks can
                                          be delayed during the handoff                                                                                                                                      

If an epoch boundary block lands in the gap:                                                                                                                                                                 
  The node can remain at epoch 11 while peers advance to epoch 12.    
  resyncIfBehind() requires >1 epoch behind to trigger (rpc_status.go),                                                                                                                                  
  so being exactly 1 epoch behind does not trigger automatic resync, which lets the lag persist long enough for the evaluator to observe it.

Why it's a flake

The handoff gap duration varies with goroutine scheduling and CI machine load. With minimal config (4-second slots, 32-second epochs), the window is narrow but non-trivial relative to slot time. An epoch boundary that lands in the gap can cause the failure; if it does not, the test usually passes. On mainnet, the much longer slot and epoch durations plus gossip redundancy from thousands of peers make this much less likely to surface.

Fix

1. Slot-precise sync target (blocks_fetcher_utils.go)

After BestNonFinalized() returns the quorum-backed target epoch + supporting peers, scan those peers for the highest actual HeadSlot within that epoch:

AFTER:
  slot 92 → BestNonFinalized → epoch 11 + peers [A, B]
  scan peers: A.HeadSlot=92, B.HeadSlot=92                                                                                                                                                                   
  return max(92, 92) = 92        ← slot-precise

2. Slot-precise startup check (beacon-chain/sync/initial-sync/service.go)

OLD:  if ToEpoch(HeadSlot) == ToEpoch(currentSlot) → markSynced()                                                                                                                                            
NEW:  if HeadSlot >= currentSlot                    → markSynced()

3. Subscriptions active before markSynced() (subscriber.go)

Remove waitForInitialSync() from subscribe() and subscribeWithParameters(). Gossip validators already ignore messages while initialSync.Syncing() is true (e.g., validate_beacon_blocks.go), so subscriptions can safely be live before sync completes:

AFTER (no gap):                                         

Subscriptions registered at startup
  ├─ topics active, validators reject while syncing
  │
... initial sync runs ...
  │                                                                                                                                                                                                          
markSynced()
  ├─ synced.Set()                    ← Syncing()=false                                                                                                                                                       
  └─ validators start accepting      ← gossip already subscribed
                                     ← NO GAP

Which issues(s) does this PR fix?

Fixes #

Other notes for review

Acknowledgements

  • I have read CONTRIBUTING.md.
  • I have included a uniquely named changelog fragment file.
  • I have added a description with sufficient context for reviewers to understand this PR.
  • I have tested that my changes work as expected and I added a testing plan to the PR description (if applicable).

@satushh satushh marked this pull request as ready for review March 30, 2026 18:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant