fix(metrics): connection counter double-decrement on monoio path#72
Conversation
handle_connection_sharded_monoio called record_connection_closed() at its exit, while conn_accept.rs (the caller) ALSO calls it in the non- migration branch (line 627). The AtomicU64 counter wrapped from 0 to u64::MAX on the second fetch_sub, causing every subsequent try_accept_connection to reject against maxclients. Symptom: first connection succeeds, all subsequent connections rejected with "maxclients reached" — even though CONFIG GET maxclients returns 10000 and the original value was never reached. Repro (before fix): ./target/release/moon --port 6400 --shards 1 --appendonly no & redis-cli -p 6400 SET foo bar # ✓ OK redis-cli -p 6400 SET foo bar # ✗ Connection reset by peer Fix: remove the handler-level decrement. The comment at line 84 already documents that the caller owns the increment via try_accept_connection; by symmetry the caller owns the decrement (conn_accept.rs:547 for TLS, conn_accept.rs:627 for plain TCP non-migrated path). Migration path counter accounting is a separate concern (already imbalanced) and is not addressed here. Verified on aarch64 Linux (OrbStack moon-dev): - 10 sequential SETs all succeed - INFO clients reports connected_clients:1 (just the probe) - redis-benchmark SET p=16 c=50 n=10000 → 1.25M req/s (real number) Blocks: PR #71 perf recovery — cannot measure real throughput without this fix. Once merged, PR #71 can be validated with bench-compare.sh.
Review Summary by QodoFix connection counter double-decrement on monoio path
WalkthroughsDescription• Remove duplicate record_connection_closed() call in monoio handler • Fixes counter wraparound causing all connections rejected after first • Restores symmetry: caller owns both increment and decrement • Verified: sequential SETs succeed, real throughput restored (1.25M req/s) Diagramflowchart LR
A["try_accept_connection<br/>increments counter"] -->|caller owns| B["conn_accept.rs<br/>decrements on close"]
C["handle_connection_sharded_monoio<br/>REMOVED duplicate decrement"] -.->|was breaking| D["AtomicU64 wraps<br/>to u64::MAX"]
D -->|causes| E["maxclients check<br/>always rejects"]
B -->|fix restores| F["Counter stays valid<br/>connections accepted"]
File Changes1. src/server/conn/handler_monoio.rs
|
|
Warning Rate limit exceeded
Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 9 minutes and 20 seconds. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. 📝 WalkthroughWalkthroughA bug fix addressing connection counter underflow: removed double-decrementing of the shared Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant Accept as conn_accept.rs (caller)
participant Handler as handler_monoio.rs
participant Target as spawn_migrated_monoio_connection
participant Metrics as record_connection_closed()
Client->>Accept: connect / try_accept_connection
Accept->>Handler: spawn handler (monoio)
alt non-migration path
Handler-->>Accept: handler completes
Accept->>Metrics: record_connection_closed() %% caller decrements on normal close
else migration path
Handler-->>Target: migrate connection (spawn on target shard)
Target->>Handler: run migrated handler future
Target-->>Metrics: record_connection_closed() %% target decrements after migrated handler finishes
end
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Code Review by Qodo
|
| // NOTE: connection close is recorded by the caller (conn_accept.rs) to | ||
| // preserve symmetry with `try_accept_connection`, which owns the | ||
| // increment. Decrementing here too produces a double-decrement on the | ||
| // AtomicU64 counter — it wraps to u64::MAX on the second subtraction | ||
| // and all subsequent `try_accept_connection` comparisons against | ||
| // `maxclients` reject new connections. | ||
| (MonoioHandlerResult::Done, None) |
There was a problem hiding this comment.
1. Migrated clients never decrement 🐞 Bug ≡ Correctness
After this PR, monoio connections that migrate to another shard will no longer call record_connection_closed() when they finally disconnect on the target shard, so CONNECTED_CLIENTS leaks upward. Over time this will incorrectly trip try_accept_connection(maxclients) and reject new connections even when no clients are actually connected.
Agent Prompt
## Issue description
`handle_connection_sharded_monoio()` no longer calls `record_connection_closed()`. For migrated monoio connections, the source shard correctly skips decrementing (connection stays alive), but the target shard’s migrated handler spawn also does not decrement after the handler finishes. This leaks `CONNECTED_CLIENTS`, eventually causing erroneous `maxclients` rejections.
## Issue Context
- Source shard: increments via `try_accept_connection(maxclients)`.
- Source shard: skips decrement when `_migrated == true`.
- Target shard: `spawn_migrated_monoio_connection()` spawns `handle_connection_sharded_monoio(..., Some(&state))` but does not call `record_connection_closed()` afterward.
- After this PR, there is no remaining place that decrements for migrated connections on final disconnect.
## Fix Focus Areas
- src/shard/conn_accept.rs[640-778]
- src/server/conn/handler_monoio.rs[2289-2337]
### Suggested implementation direction
Add `crate::admin::metrics_setup::record_connection_closed();` after awaiting `handle_connection_sharded_monoio()` in the `monoio::spawn` block inside `spawn_migrated_monoio_connection()`.
Optionally, update the comment in `handler_monoio.rs` to clarify that the caller owns close accounting for fresh accepts, but migrated-connection spawns must still decrement on final close.
ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/server/conn/handler_monoio.rs`:
- Around line 2330-2335: In the migration/resumption branch where
handle_connection_sharded_monoio(...).await is invoked with can_migrate: false,
add a call to record_connection_closed() immediately after the await completes
so the connection-count decrement mirrors the plain-TCP path (see the existing
pattern around the plain TCP handler). This prevents leaked connected_clients
for resumed/migrated connections by ensuring the counter is decremented when the
handler returns Done.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 3bc90056-84a7-4420-be78-9ec447786152
📒 Files selected for processing (2)
CHANGELOG.mdsrc/server/conn/handler_monoio.rs
The initial inline SET used Bytes::copy_from_slice for key and value,
which triggers MALLOC+memcpy twice per SET. The Frame-based path
achieves zero-copy via Bytes::slice on a frozen BytesMut, which is
just an Arc refcount bump. This disparity caused a ~4-7% SET p=1
regression because the inline-path savings were outweighed by the
new allocations at low pipeline depth.
Fix: call read_buf.split_to(consumed).freeze() once, then slice() the
frozen Bytes for key, value, and AOF. All three are now Arc refcount
bumps over the same underlying allocation — zero malloc, zero memcpy.
Measured impact on aarch64 Linux (OrbStack, 1 shard, 50 clients):
before nocopy after nocopy delta
SET p=16: 2.94M rps 3.11M rps +5.4% (peak 3.60M)
SET p=1: 237K rps 235K rps -1% (within noise)
GET p=16: 3.32M rps 3.33M rps +0.3%
PR #71 totals vs origin/main (both with PR #72 maxclients fix):
SET p=16: 2.43M → 3.11M = +28% (peak +48%)
SET p=1: 241K → 235K = -2.5% (was -4%, noise-level now)
GET p=16: 3.36M → 3.33M = ±0%
All 11 inline-dispatch tests pass unchanged.
When a connection migrates to another shard, the source wrapper skips the decrement (connection stays alive) and the target shard's spawn_migrated_monoio_connection() now owns the balancing decrement on final close. Without this, every migration leaked CONNECTED_CLIENTS upward and eventually tripped maxclients on long-running clusters. Addresses review feedback from qodo-code-review and coderabbit on PR #72 — both bots flagged this exact leak.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/shard/conn_accept.rs`:
- Around line 768-773: The migrated-connection decrement is only done after
handle_connection_sharded_monoio() returns, so if target-side setup fails
earlier the source-side skipped decrement (because _migrated == true) and
CONNECTED_CLIENTS leaks; fix by ensuring record_connection_closed() (or the
CONNECTED_CLIENTS decrement) is invoked on all early error/return paths when
_migrated == true — e.g., add a short-lived guard in the conn_accept setup code
(or explicit calls) so that any early return before
handle_connection_sharded_monoio() will call
crate::admin::metrics_setup::record_connection_closed(); keep transfer semantics
so that when ownership is successfully transferred the guard is disarmed and no
double-decrement occurs.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
coderabbit flagged two leak paths in spawn_migrated_monoio_connection: 1. set_nonblocking(true) failure on the reconstructed std stream 2. monoio::net::TcpStream::from_std(...) Err branch Both return without running the handler. The source shard already skipped its wrapper decrement on successful hand-off (_migrated == true), so the counter was +1 with no balancing decrement. Add record_connection_closed() on each early-error return. The happy path's decrement after handler.await still works unchanged.
Review feedback addressedqodo-code-review (1 bug): ✅ Fixed in 4b5e1e9 — coderabbitai (2 leak paths on migration errors): ✅ Fixed in 320b780 — All three counter-leak exit paths now balanced. Source shard owns increment via |
CI flake — re-running
Evidence this is not our change: this PR only touches
|
The initial inline SET used Bytes::copy_from_slice for key and value,
which triggers MALLOC+memcpy twice per SET. The Frame-based path
achieves zero-copy via Bytes::slice on a frozen BytesMut, which is
just an Arc refcount bump. This disparity caused a ~4-7% SET p=1
regression because the inline-path savings were outweighed by the
new allocations at low pipeline depth.
Fix: call read_buf.split_to(consumed).freeze() once, then slice() the
frozen Bytes for key, value, and AOF. All three are now Arc refcount
bumps over the same underlying allocation — zero malloc, zero memcpy.
Measured impact on aarch64 Linux (OrbStack, 1 shard, 50 clients):
before nocopy after nocopy delta
SET p=16: 2.94M rps 3.11M rps +5.4% (peak 3.60M)
SET p=1: 237K rps 235K rps -1% (within noise)
GET p=16: 3.32M rps 3.33M rps +0.3%
PR #71 totals vs origin/main (both with PR #72 maxclients fix):
SET p=16: 2.43M → 3.11M = +28% (peak +48%)
SET p=1: 241K → 235K = -2.5% (was -4%, noise-level now)
GET p=16: 3.36M → 3.33M = ±0%
All 11 inline-dispatch tests pass unchanged.
Summary
handle_connection_sharded_monoioandconn_accept.rsboth callrecord_connection_closed()on disconnect, producing a doublefetch_subon theAtomicU64counter. On the first disconnect the counter wraps0 → u64::MAX, which exceedsmaxclientson every subsequenttry_accept_connection— all new connections are rejected until server restart.Reproduction (before fix)
Server log floods with:
Even though
CONFIG GET maxclientsreturns10000.Root cause
src/server/conn/handler_monoio.rs:2330:While
src/shard/conn_accept.rs:624-628also decrements (correctly guarded for the non-migration case):The comment at
handler_monoio.rs:84already documents the intended ownership:By symmetry, the caller should also own the decrement. Removing the handler-level call restores that symmetry.
Fix
Single-line removal:
record_connection_closed()call athandler_monoio.rs:2330.Verification (aarch64 Linux / OrbStack moon-dev)
Before the fix, the bench produced
SET: rps=0.0+Error: Server closed the connection.Scope
Fixes only the monoio path. The Tokio TLS path (
conn_accept.rs:240+handler_sharded.rs:176) has an analogous double-decrement — follow-up PR. The migration path's counter accounting is imbalanced in the opposite direction and is also a separate fix.Blocks
PR #71 (perf recovery) — any throughput claim on monoio aarch64 needs this fix to produce real benchmark numbers.
Test plan
cargo clippy -- -D warningscleanscripts/audit-unsafe.shpasses (178/178 SAFETY comments)redis-benchmark SET p=16 c=50 n=10000returns real throughput (1.25M rps on OrbStack aarch64)INFO clientsshows correctconnected_clients(no wraparound)Summary by CodeRabbit
--maxclients.