Skip to content

fix: Publisher-mode synchronization option for failover scenario#3222

Draft
alpe wants to merge 2 commits intomainfrom
alex/sync_race
Draft

fix: Publisher-mode synchronization option for failover scenario#3222
alpe wants to merge 2 commits intomainfrom
alex/sync_race

Conversation

@alpe
Copy link
Copy Markdown
Contributor

@alpe alpe commented Mar 31, 2026

Overview

E2E HA tests fail sometimes on a race when the leader is waiting for p2p sync complete on a fresh start.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 31, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 2348f637-fa17-42a5-a4d2-c55c2ca7e66e

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch alex/sync_race

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 31, 2026

The latest Buf updates on your PR. Results from workflow CI / buf-check (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed⏩ skipped✅ passed✅ passedMar 31, 2026, 4:07 PM

@claude
Copy link
Copy Markdown
Contributor

claude bot commented Mar 31, 2026

Claude finished @alpe's task in 3m 5s —— View job


Code Review

  • Gather context and read changed files
  • Review pkg/sync/syncer_status.go and pkg/sync/syncer_status_test.go
  • Review pkg/sync/sync_service.go and pkg/sync/sync_service_test.go
  • Review node/failover.go
  • Review test/e2e/ changes
  • Summarize findings

Summary

The PR correctly identifies and fixes a race/deadlock in the HA failover path: when a fresh raft leader starts with an empty store, initFromP2PWithRetry would block waiting for a peer to serve height 1 — but no peer can do that until the leader produces its first block. The StartForPublishing mode skips this wait, letting the syncer start lazily on the first WriteToStoreAndBroadcast. Accompanying fixes (ticker leaks, proper peer multiaddrs in E2E, defer in loop) are all solid. The overall approach is clean and well-motivated.

A few items worth addressing before merge:


Issues

1. Mutex held during startFn can block shutdown (pkg/sync/syncer_status.go:27)

startOnce holds mu for the entire duration of startFn(). Because stopIfStarted also acquires mu, a goroutine racing to shut down while the syncer is slow-starting will block for the full startup duration. In normal operation this is probably fine, but under context cancellation it could delay shutdown noticeably. A common pattern is to release the lock, run the fn, then re-acquire to commit the flag — with a "in-progress" sentinel to prevent concurrent starts.

That said, the current approach is correct and simple. If startup is expected to be fast in practice, this is acceptable as-is, but worth a comment acknowledging the design choice.

2. stopIfStarted does not reset started on error (pkg/sync/syncer_status.go:35)

func (syncerStatus *SyncerStatus) stopIfStarted(stopFn func() error) error {
    ...
    if err := stopFn(); err != nil {
        return err   // ← started is NOT cleared
    }
    syncerStatus.started = false
    return nil
}

If stopFn fails, started remains true, making subsequent startOnce calls no-ops. This could leave the component in a permanently broken state if Stop returns a transient error. Either unconditionally clear started after calling stopFn, or document that a stop failure is considered fatal. Fix this →

3. require inside goroutine (pkg/sync/syncer_status_test.go:437)

go func() {
    defer wg.Done()
    _, err := status.startOnce(func() error { ... })
    require.NoError(t, err)  // ← calls t.FailNow() from non-test goroutine
}()

require.NoError calls t.FailNow() which is documented as unsafe when called from a goroutine that isn't the test's main goroutine (it panics instead of failing cleanly). Use assert.NoError(t, err) or collect errors into a channel and check after wg.Wait(). Fix this →

4. Missing error case in TestSyncerStatusStopIfStarted (pkg/sync/syncer_status_test.go:484)

The test has a wantErr field but both cases set it to false. The behavior when stopFn returns an error is not exercised at all, so the fix noted in item 2 above would have zero test coverage. A "stop_error_does_not_clear_started" case would complete the contract. Fix this →

5. Subscriber started before syncer in StartForPublishing

Start has an explicit comment: "we must start the subscriber after the syncer is initialized … to ensure p2p syncing works correctly." StartForPublishing intentionally inverts this ordering (subscriber up, syncer deferred to WriteToStoreAndBroadcast). This is correct for the publisher-mode use case, but a comment explaining why the ordering is safe here (no peers to sync from yet; gossip is outbound-only until first block is produced) would prevent future readers from silently re-introducing the "wrong" ordering. Fix this →


Observations (non-blocking)

  • Good catch on time.After leak (sync_service.go): Replacing time.After(backoff) with a persistent retryTimer eliminates a timer goroutine leak per retry iteration.

  • Good catch on time.Tick leak (failover_e2e_test.go): time.Tick leaks a ticker; time.NewTicker + defer Stop() is the correct pattern.

  • defer resp.Body.Close() in loop fixed: Moving this to an immediate _ = resp.Body.Close() prevents the deferred closes from piling up until the outer function returns.

  • Peer multiaddr construction (test/e2e/failover_e2e_test.go): Using mustNodeP2PMultiAddr to build /ip4/.../tcp/PORT/p2p/PEERID addresses is a meaningful correctness fix — without the peer ID component, libp2p connection attempts may fail or be non-deterministic.

  • shouldStartSyncInPublisherMode is a point-in-time snapshot: Raft leadership can be lost between this check and actual block production. The code degrades gracefully in that case (the node would stall waiting for the new leader to produce) but a comment noting this is a best-effort check might help future readers.

  • EV_E2E_LOG_DIR env var: Nice addition for CI debugging — log output can now be redirected without changing test binaries.

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 31, 2026

Codecov Report

❌ Patch coverage is 54.65116% with 39 lines in your changes missing coverage. Please review.
✅ Project coverage is 61.41%. Comparing base (4a70e0b) to head (a2b0ff7).

Files with missing lines Patch % Lines
pkg/sync/sync_service.go 47.36% 13 Missing and 7 partials ⚠️
node/failover.go 37.03% 11 Missing and 6 partials ⚠️
pkg/sync/syncer_status.go 90.47% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3222      +/-   ##
==========================================
- Coverage   61.43%   61.41%   -0.02%     
==========================================
  Files         120      120              
  Lines       12504    12563      +59     
==========================================
+ Hits         7682     7716      +34     
- Misses       3960     3977      +17     
- Partials      862      870       +8     
Flag Coverage Δ
combined 61.41% <54.65%> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant