Skip to content

feat(pkg/p2p): reconnect on disconnected peers#3212

Open
julienrbrt wants to merge 8 commits intomainfrom
julien/p2p
Open

feat(pkg/p2p): reconnect on disconnected peers#3212
julienrbrt wants to merge 8 commits intomainfrom
julien/p2p

Conversation

@julienrbrt
Copy link
Copy Markdown
Member

@julienrbrt julienrbrt commented Mar 30, 2026

Attempt to reconnect on peers when having connectivity issue.
PR made by AI. Still testing.

Summary by CodeRabbit

  • Bug Fixes
    • Improved P2P resilience: automatic reconnection for transient network failures, periodic peer discovery refreshes, and more robust connection maintenance for stable networking.
  • Documentation
    • Added an unreleased changelog entry noting the P2P connection improvements.
  • Chores
    • Re-enabled local module mappings to support building and testing against local source trees.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 30, 2026

📝 Walkthrough

Walkthrough

Adds seed-peer reconnect and periodic discovery to the P2P client, re-enables local Go module replace directives in two apps, and changes retry-loop syntax in AWS/GCP KMS signers.

Changes

Cohort / File(s) Summary
Changelog
CHANGELOG.md
Added an unreleased entry: "Improve P2P connection in case of transient network failure" referencing PR #3212.
Go module replaces
apps/evm/go.mod, apps/testapp/go.mod
Uncommented local replace directives so github.com/evstack/ev-node (and subpath) resolve to local ../../ paths for development builds.
P2P Connection Resilience
pkg/p2p/client.go
Added seed peer tracking, disconnect notifee, reconnect channel, connection semaphore, maintenance goroutine with per-peer exponential backoff reconnects, and periodic peer-discovery/refresh logic.
KMS signer loops
pkg/signer/aws/signer.go, pkg/signer/gcp/signer.go
Switched retry loops from indexed for attempt := 0; ... to range-based iteration over maxAttempts without changing retry/backoff behavior.

Sequence Diagram(s)

sequenceDiagram
    participant Client as P2P Client
    participant Host as libp2p Host
    participant DHT as DHT/Discovery
    participant Notifee as Disconnect Notifier
    participant Maintenance as Maintenance Loop

    Client->>Host: startWithHost()
    Client->>Notifee: register disconnect handler
    Client->>Maintenance: start maintenance goroutine

    Note over Host,Notifee: network event
    Host-->>Notifee: peer disconnected
    Notifee->>Maintenance: send peerID on reconnectCh

    Maintenance->>Maintenance: apply per-peer backoff
    Maintenance->>Host: tryConnect(peerID) (limited by connectSem)
    Host-->>Maintenance: connection result

    Maintenance->>DHT: periodic refreshPeerDiscovery()
    DHT->>Host: re-advertise & FindPeers()
    DHT-->>Maintenance: peer list
    Maintenance->>Host: tryConnect(discovered peers)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested reviewers

  • tac0turtle
  • tzdybal
  • tuxcanfly

Poem

🐰 In tunnels deep where packets roam,
I hop and stitch each falling home.
With gentle hops and patient tries,
I bind the peers beneath the skies.
Reconnected — rabbit's tiny prize.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Description check ⚠️ Warning The description is vague and incomplete, lacking detailed context, rationale, and testing status. It does not follow the required template structure with an Overview section. Expand the description with detailed overview of changes, rationale for the implementation, testing performed, and any relevant issue links following the template format.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: adding reconnection logic for disconnected peers in the P2P package, which aligns with the substantial modifications in pkg/p2p/client.go.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch julien/p2p

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 30, 2026

The latest Buf updates on your PR. Results from workflow CI / buf-check (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed⏩ skipped✅ passed✅ passedApr 1, 2026, 6:11 AM

@claude
Copy link
Copy Markdown
Contributor

claude bot commented Mar 30, 2026

Claude finished @julienrbrt's task in 1m 49s —— View job


Code Review

  • Read AGENTS.md / CLAUDE.md for repo guidelines
  • Review pkg/p2p/client.go (main change)
  • Review apps/evm/go.mod and apps/testapp/go.mod
  • Review signer changes
  • Summarize findings

Summary

This PR adds reconnection logic and periodic peer discovery refresh to the P2P client. The overall design is sound and the context-decoupling fix (using context.Background() in startConnectionMaintenance) was correctly applied in commit 2ff6cee. Two bugs remain that should be addressed before merging.


Critical Issues

1. Goroutine leak in refreshPeerDiscoverypkg/p2p/client.go:399

discutil.Advertise (libp2p v0.48) starts a long-running background goroutine that re-advertises in a loop until ctx.Done(). Calling it every peerDiscoveryInterval (5 min) creates a new goroutine each tick without cancelling the previous one, leaking N goroutines over time.

// refreshPeerDiscovery — called on every ticker tick:
_ = c.advertise(ctx)   // ← spawns new background goroutine every 5 minutes

Fix: Call c.advertise(ctx) once during startConnectionMaintenance setup. The ticker loop should only call FindPeers.

func (c *Client) startConnectionMaintenance() {
    ctx, cancel := context.WithCancel(context.Background())
    c.maintenanceCancel = cancel

    c.maintenanceWg.Go(func() {
        _ = c.advertise(ctx)  // ← start once here; discutil.Advertise loops internally
        // ...ticker loop only calls refreshPeerDiscovery without re-advertising
    })
}

Fix this →


2. Already-disconnected seed peers are never retriedpkg/p2p/client.go:317-321

startConnectionMaintenance only handles future disconnect events from the notifee. Seed peers that failed to connect during initial startup (before the notifee was registered) are never queued into reconnectCh and will remain unreachable until the node restarts.

Fix: Before the select loop, iterate c.seedPeers and enqueue any that are not currently connected:

// Seed initial reconnects for peers already disconnected at startup
for _, sp := range c.seedPeers {
    if !c.isConnected(sp.ID) {
        select {
        case c.reconnectCh <- sp.ID:
        default:
        }
    }
}

Fix this →


Minor Issues

3. peerCh not drained on context cancellationpkg/p2p/client.go:407-420

When ctx.Done() fires during the connectSem select in refreshPeerDiscovery, the function returns early but FindPeers may still have a goroutine writing to peerCh. This goroutine will block until the channel is fully consumed or garbage collected (libp2p should eventually clean it up, but it's not guaranteed promptly).

4. advertise() always returns nilpkg/p2p/client.go:512-515

discutil.Advertise returns nothing (it's fire-and-forget). The advertise() wrapper pretends to return an error but unconditionally returns nil. The _ = c.advertise(ctx) call site in refreshPeerDiscovery silently ignores this. Consider removing the return value or adding a comment clarifying this.

5. No test coverage for new reconnect logic

The codecov report shows 33.7% patch coverage with 57 lines in client.go uncovered. The following behaviors have no test coverage:

  • disconnectNotifee.Disconnected triggering reconnects
  • Exponential backoff in startConnectionMaintenance
  • refreshPeerDiscovery flow

At minimum, a test for the disconnect→reconnect happy path and the backoff rescheduling would catch the goroutine leak issue above.


go.mod Replace Directives

apps/evm/go.mod and apps/testapp/go.mod both commit relative replace directives (e.g. replace github.com/evstack/ev-node => ../../.). These only work inside the monorepo checkout and will break any consumer trying to use these modules as Go dependencies. If these modules are strictly internal/monorepo-only, this is acceptable but worth a comment. If they're intended to be versioned independently, the replace directives should be removed before tagging.


Signer Changes (AWS / GCP)

The for attempt := 0; attempt < maxAttempts; attempt++for attempt := range maxAttempts refactors are correct idiomatic Go 1.22+ style and do not change behavior.


Overall Assessment

The reconnect design is reasonable — exponential backoff with a cap, bounded concurrency via connectSem, and context-owned maintenance lifetime. The goroutine leak (#1) and missing initial seed retry (#2) should be fixed before this goes to production. Test coverage for the new paths would also increase confidence given this is networking resilience code.

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 30, 2026

Codecov Report

❌ Patch coverage is 30.92784% with 67 lines in your changes missing coverage. Please review.
✅ Project coverage is 61.21%. Comparing base (4a70e0b) to head (e276ea2).

Files with missing lines Patch % Lines
pkg/p2p/client.go 29.47% 65 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3212      +/-   ##
==========================================
- Coverage   61.43%   61.21%   -0.23%     
==========================================
  Files         120      120              
  Lines       12504    12599      +95     
==========================================
+ Hits         7682     7712      +30     
- Misses       3960     4024      +64     
- Partials      862      863       +1     
Flag Coverage Δ
combined 61.21% <30.92%> (-0.23%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@julienrbrt julienrbrt marked this pull request as ready for review March 31, 2026 16:08
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
apps/testapp/go.mod (1)

5-5: Keep this replace local to development.

This ties the module to ../../. being present at build time. That works inside this checkout, but not when the submodule is built outside the monorepo layout. Prefer a local go.work or an uncommitted dev override instead. Please verify this module is not expected to build outside the repository layout; if apps/testapp is intentionally monorepo-only, feel free to ignore this.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/testapp/go.mod` at line 5, The committed go.mod contains a local replace
directive ("replace github.com/evstack/ev-node => ../../.") that forces a
monorepo layout; remove this replace from the tracked go.mod and instead keep it
as an uncommitted developer-only override (or move the replacement into a
go.work used only in development), or if this module is intended to be
monorepo-only, document that restriction; update the repository by deleting the
replace line from the committed go.mod and adding a dev-only go.work or
.gitignore'd override as appropriate so builds outside the repo layout are not
broken.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@apps/evm/go.mod`:
- Around line 5-8: The go.mod contains committed relative replace directives for
modules "github.com/evstack/ev-node" and
"github.com/evstack/ev-node/execution/evm" (the two replace (...) entries) which
will break consumers outside the monorepo; remove these relative replace lines
from apps/evm/go.mod before committing and, for local development only, move
them into a go.work file or use developer instructions to run `go work use` so
local overrides remain available but are not shipped in the module's go.mod.

In `@pkg/p2p/client.go`:
- Around line 343-355: The reconnect loop currently only receives signals from
reconnectCh (fed by Disconnected) so a failed host.Connect leaves the peer
without future triggers, and the current logic drops events that arrive during
the backoff window; update the reconnect handling (references: reconnectCh,
Disconnected, host.Connect, st.attempts, st.lastAttempt, reconnectCooldown,
maxReconnectCooldown) so that on a connect failure you re-schedule another retry
(e.g., push a retry event into reconnectCh or start a time.AfterFunc/goroutine
to re-enqueue after computed backoff) and when an event arrives inside the
backoff window do not drop it but schedule it to be processed after the
remaining backoff interval (use a timer to requeue into reconnectCh); apply the
same scheduling change to the other similar block (lines around the second
instance handling backoff) so transient outages cause repeated retries rather
than a single attempt.
- Line 199: The maintenance goroutine is started with the caller-supplied ctx so
timeouts/cancels on Start() prematurely stop reconnection/discovery; change this
so maintenance uses its own background context owned by the client and only
stops when Close() is called. Create a maintenance context via
context.WithCancel(context.Background()) (store the cancel func on the client
struct), pass that maintenance context into startConnectionMaintenance (and the
other maintenance/startDiscovery loops currently using the Start() ctx around
lines 317-320), and invoke the stored cancel in Close(); do not use the Start()
ctx for these long-lived loops.

---

Nitpick comments:
In `@apps/testapp/go.mod`:
- Line 5: The committed go.mod contains a local replace directive ("replace
github.com/evstack/ev-node => ../../.") that forces a monorepo layout; remove
this replace from the tracked go.mod and instead keep it as an uncommitted
developer-only override (or move the replacement into a go.work used only in
development), or if this module is intended to be monorepo-only, document that
restriction; update the repository by deleting the replace line from the
committed go.mod and adding a dev-only go.work or .gitignore'd override as
appropriate so builds outside the repo layout are not broken.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 1c692055-7514-4e34-ad88-0d6d76c89f7d

📥 Commits

Reviewing files that changed from the base of the PR and between 4a70e0b and 32dcf35.

⛔ Files ignored due to path filters (2)
  • apps/evm/go.sum is excluded by !**/*.sum
  • apps/testapp/go.sum is excluded by !**/*.sum
📒 Files selected for processing (6)
  • CHANGELOG.md
  • apps/evm/go.mod
  • apps/testapp/go.mod
  • pkg/p2p/client.go
  • pkg/signer/aws/signer.go
  • pkg/signer/gcp/signer.go

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
pkg/p2p/client.go (1)

351-352: ⚠️ Potential issue | 🟠 Major

Backoff still turns a transient seed outage into a single retry.

At Line 351 the event is dropped if the peer is still inside its backoff window, and at Line 367 a failed host.Connect only logs. After that first failed dial there is no future trigger for that seed peer, so it can remain disconnected indefinitely.

Also applies to: 367-369

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/p2p/client.go` around lines 351 - 352, The code currently skips
processing a seed peer if time.Now().Before(st.lastAttempt.Add(backoff)) and
only logs on host.Connect failure, which means a transient outage may never be
retried; change this so that when a peer is within backoff you requeue/schedule
it for a future attempt instead of dropping the event (use a timer or push it
back onto the attempt queue and keep state in st), and when host.Connect(peer)
fails update st.lastAttempt and also schedule a follow-up retry (rather than
only logging) so the peer will be retried after backoff; make these changes
around the references to st.lastAttempt.Add(backoff) and the host.Connect call
to ensure failed dials trigger future retries.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/p2p/client.go`:
- Around line 317-321: startConnectionMaintenance currently only handles future
Disconnected events so seed peers that are already disconnected never get
enqueued; before the maintenance select loop begins (inside the maintenanceWg.Go
started by startConnectionMaintenance), iterate c.seedPeers and for each peer
that is not currently connected (use existing connection state check used
elsewhere in the file), send that peer into c.reconnectCh once so they will be
retried; ensure sends won't block the worker (use a non-blocking select with
default or spawn a short goroutine per send) and rely on
refreshPeerDiscovery/c.getNamespace as before for discovery updates.
- Around line 388-389: The current loop repeatedly calls c.advertise(ctx) which
spawns a background re-advertisement goroutine each time (discutil.Advertise
runs until ctx.Done()), causing a goroutine leak; fix by invoking
c.advertise(ctx) exactly once from the maintenance context (e.g., start it when
maintenanceCtx is created) and change the ticker loop to only call peer
discovery (e.g., c.FindPeers or the method that refreshes discovery) on each
tick, or alternatively modify c.advertise to provide a single-shot advertise API
and call that from the ticker; locate and update the code paths referencing
c.advertise(ctx) and the ticker loop so no repeated long-lived advertise
goroutines are started.

---

Duplicate comments:
In `@pkg/p2p/client.go`:
- Around line 351-352: The code currently skips processing a seed peer if
time.Now().Before(st.lastAttempt.Add(backoff)) and only logs on host.Connect
failure, which means a transient outage may never be retried; change this so
that when a peer is within backoff you requeue/schedule it for a future attempt
instead of dropping the event (use a timer or push it back onto the attempt
queue and keep state in st), and when host.Connect(peer) fails update
st.lastAttempt and also schedule a follow-up retry (rather than only logging) so
the peer will be retried after backoff; make these changes around the references
to st.lastAttempt.Add(backoff) and the host.Connect call to ensure failed dials
trigger future retries.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 8d4f0858-895a-4ea8-aa2e-46955a48c244

📥 Commits

Reviewing files that changed from the base of the PR and between 32dcf35 and 2ff6cee.

📒 Files selected for processing (1)
  • pkg/p2p/client.go

Comment on lines +317 to +321
func (c *Client) startConnectionMaintenance() {
ctx, cancel := context.WithCancel(context.Background())
c.maintenanceCancel = cancel

c.maintenanceWg.Go(func() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Queue currently disconnected seed peers when maintenance starts.

This worker only reacts to future Disconnected events. A seed peer that never connected during startup, or dropped before the notifee was registered, never enters reconnectCh, and refreshPeerDiscovery only searches c.getNamespace(). Seed any currently disconnected c.seedPeers once before the select loop begins.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/p2p/client.go` around lines 317 - 321, startConnectionMaintenance
currently only handles future Disconnected events so seed peers that are already
disconnected never get enqueued; before the maintenance select loop begins
(inside the maintenanceWg.Go started by startConnectionMaintenance), iterate
c.seedPeers and for each peer that is not currently connected (use existing
connection state check used elsewhere in the file), send that peer into
c.reconnectCh once so they will be retried; ensure sends won't block the worker
(use a non-blocking select with default or spawn a short goroutine per send) and
rely on refreshPeerDiscovery/c.getNamespace as before for discovery updates.

Comment on lines +388 to +389
_ = c.advertise(ctx)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

For github.com/libp2p/go-libp2p v0.48.0, what does p2p/discovery/util.Advertise do? Does it start a background re-advertisement loop tied to ctx.Done(), or is it a single synchronous advertise call?

💡 Result:

In github.com/libp2p/go-libp2p/p2p/discovery/util v0.48.0, util.Advertise(ctx, a, ns, ...) starts a background goroutine that re-advertises in a loop until ctx.Done() is closed—it is not a single synchronous advertise call. (raw.githubusercontent.com)

Behavior details:

Citations:


Fix goroutine leak in advertise refresh loop.

Calling c.advertise(ctx) every 5 minutes at line 388 spawns a new background goroutine with each call. In libp2p v0.48.0, discutil.Advertise starts a re-advertisement loop that runs until ctx.Done() — it is not a single synchronous call. Stacking these goroutines will leak resources for the client's lifetime.

Start c.advertise(ctx) once from the maintenance context and let the ticker only refresh peer discovery via FindPeers, or switch to a single-shot advertise call that does not spawn a background loop.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/p2p/client.go` around lines 388 - 389, The current loop repeatedly calls
c.advertise(ctx) which spawns a background re-advertisement goroutine each time
(discutil.Advertise runs until ctx.Done()), causing a goroutine leak; fix by
invoking c.advertise(ctx) exactly once from the maintenance context (e.g., start
it when maintenanceCtx is created) and change the ticker loop to only call peer
discovery (e.g., c.FindPeers or the method that refreshes discovery) on each
tick, or alternatively modify c.advertise to provide a single-shot advertise API
and call that from the ticker; locate and update the code paths referencing
c.advertise(ctx) and the ticker loop so no repeated long-lived advertise
goroutines are started.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
pkg/p2p/client.go (2)

317-320: ⚠️ Potential issue | 🟠 Major

Still unresolved: keep only one advertise worker alive.

refreshPeerDiscovery() runs on a ticker, but advertise() already starts a long-lived re-advertisement worker. Calling it again here stacks extra advertisers for the lifetime of the client. Start one advertiser from the maintenance context and let the ticker only refresh discovery. If maintenance owns the long-lived advertiser, remove the startup peerDiscovery() call too so there is still only one loop total.

For the go-libp2p version used by this repository, does `p2p/discovery/util.Advertise` start a background re-advertisement loop until `ctx.Done()`, or is it a one-shot advertise call?

Based on learnings: Applies to **/*.go : Be mindful of goroutine leaks in Go code.

Also applies to: 399-400

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/p2p/client.go` around lines 317 - 320, startConnectionMaintenance
currently starts a maintenance context but also leads to multiple concurrent
advertisers because refreshPeerDiscovery calls advertise repeatedly while
advertise itself spawns a long-lived re-advertisement worker; to fix, ensure
only one long-lived advertise worker is created from startConnectionMaintenance
using the maintenance ctx (call advertise(ctx) once there), remove any other
startup call to peerDiscovery() so you don't spawn a separate loop, and change
refreshPeerDiscovery to only refresh discovery state (not call advertise) so it
no longer creates additional advertisers; reference functions:
startConnectionMaintenance, refreshPeerDiscovery, advertise, peerDiscovery and
the maintenanceCancel/maintenance context.

321-324: ⚠️ Potential issue | 🟠 Major

Still unresolved: seed the reconnect queue with already-disconnected seed peers.

This worker still only reacts to future Disconnected callbacks. A configured seed peer that never connected during startup, or dropped before Notify() was registered, never reaches reconnectCh, so the reconnect path never runs for it.

🔧 Minimal fix
 	c.maintenanceWg.Go(func() {
+		for _, sp := range c.seedPeers {
+			if c.isConnected(sp.ID) {
+				continue
+			}
+			select {
+			case c.reconnectCh <- sp.ID:
+			default:
+			}
+		}
+
 		discoveryTicker := time.NewTicker(peerDiscoveryInterval)
 		defer discoveryTicker.Stop()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/p2p/client.go` around lines 321 - 324, The worker registered in
c.maintenanceWg.Go only enqueues peers on future Disconnected callbacks, so
configured seed peers that were never connected or disconnected before Notify
was registered are never added to c.reconnectCh; fix by seeding the reconnect
queue at startup: after creating discoveryTicker in the maintenance goroutine
(inside the same block started by maintenanceWg.Go), iterate configured seed
peers (the seeds list used by the peer discovery logic), check each peer's
current connectedness via the host/peerstore/Network API (e.g.,
Host.Network().Connectedness / Peerstore.Connectedness), and send any peers that
are not currently connected into c.reconnectCh (respecting non-blocking
semantics or select with ctx.Done() to avoid blocking). Ensure this happens once
before the ticker loop so the existing reconnect worker will process
already-disconnected seed peers.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/p2p/client.go`:
- Around line 206-209: Close() currently cancels maintenanceCancel and waits
only for maintenanceWg but discovery and dial goroutines (spawned by the
maintenance worker) are not time-boxed or tracked, so add a child context with a
short timeout for the periodic discovery/refresh and for each dial/Connect call
(derive via maintenanceCtx := context.WithTimeout(maintenanceCancelCtx,
shortDuration)), ensure every spawned dial worker is registered with the same
maintenanceWg (or use an errgroup bound to maintenanceCtx) so they are waited
on, pass the child ctx into the dial/Connect/DHT calls so they can be cancelled,
and guarantee connectSem is released in a defer inside each worker even on
timeout/cancellation; also ensure maintenanceCancel cancels the parent ctx so
all child timeouts abort during Close().

---

Duplicate comments:
In `@pkg/p2p/client.go`:
- Around line 317-320: startConnectionMaintenance currently starts a maintenance
context but also leads to multiple concurrent advertisers because
refreshPeerDiscovery calls advertise repeatedly while advertise itself spawns a
long-lived re-advertisement worker; to fix, ensure only one long-lived advertise
worker is created from startConnectionMaintenance using the maintenance ctx
(call advertise(ctx) once there), remove any other startup call to
peerDiscovery() so you don't spawn a separate loop, and change
refreshPeerDiscovery to only refresh discovery state (not call advertise) so it
no longer creates additional advertisers; reference functions:
startConnectionMaintenance, refreshPeerDiscovery, advertise, peerDiscovery and
the maintenanceCancel/maintenance context.
- Around line 321-324: The worker registered in c.maintenanceWg.Go only enqueues
peers on future Disconnected callbacks, so configured seed peers that were never
connected or disconnected before Notify was registered are never added to
c.reconnectCh; fix by seeding the reconnect queue at startup: after creating
discoveryTicker in the maintenance goroutine (inside the same block started by
maintenanceWg.Go), iterate configured seed peers (the seeds list used by the
peer discovery logic), check each peer's current connectedness via the
host/peerstore/Network API (e.g., Host.Network().Connectedness /
Peerstore.Connectedness), and send any peers that are not currently connected
into c.reconnectCh (respecting non-blocking semantics or select with ctx.Done()
to avoid blocking). Ensure this happens once before the ticker loop so the
existing reconnect worker will process already-disconnected seed peers.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 7ea60430-76a6-4442-9628-2a12b7b139d4

📥 Commits

Reviewing files that changed from the base of the PR and between 2ff6cee and e276ea2.

📒 Files selected for processing (1)
  • pkg/p2p/client.go

Comment on lines +206 to +209
if c.maintenanceCancel != nil {
c.maintenanceCancel()
}
c.maintenanceWg.Wait()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Track and time-box the maintenance dials.

Close() only waits for the dispatcher goroutine. Lines 373-381 and 413-416 spawn dial workers outside maintenanceWg, while Line 385 runs discovery inline on the only maintenance worker. A slow discovery/dial can block reconnect handling, pin connectSem, or still be running when teardown closes the host and DHT. Give the refresh and each dial a short child timeout, and track the spawned workers in the same wait group/errgroup.

As per coding guidelines "Use context.Context for cancellation in Go" and "Be mindful of goroutine leaks in Go code".

Also applies to: 373-381, 384-385, 401-419

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/p2p/client.go` around lines 206 - 209, Close() currently cancels
maintenanceCancel and waits only for maintenanceWg but discovery and dial
goroutines (spawned by the maintenance worker) are not time-boxed or tracked, so
add a child context with a short timeout for the periodic discovery/refresh and
for each dial/Connect call (derive via maintenanceCtx :=
context.WithTimeout(maintenanceCancelCtx, shortDuration)), ensure every spawned
dial worker is registered with the same maintenanceWg (or use an errgroup bound
to maintenanceCtx) so they are waited on, pass the child ctx into the
dial/Connect/DHT calls so they can be cancelled, and guarantee connectSem is
released in a defer inside each worker even on timeout/cancellation; also ensure
maintenanceCancel cancels the parent ctx so all child timeouts abort during
Close().

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant