Skip to content

fix(metrics): connection counter double-decrement on monoio path#72

Merged
pilotspacex-byte merged 3 commits into
mainfrom
fix/maxclients-double-decrement
Apr 12, 2026
Merged

fix(metrics): connection counter double-decrement on monoio path#72
pilotspacex-byte merged 3 commits into
mainfrom
fix/maxclients-double-decrement

Conversation

@pilotspacex-byte
Copy link
Copy Markdown
Contributor

@pilotspacex-byte pilotspacex-byte commented Apr 12, 2026

Summary

handle_connection_sharded_monoio and conn_accept.rs both call record_connection_closed() on disconnect, producing a double fetch_sub on the AtomicU64 counter. On the first disconnect the counter wraps 0 → u64::MAX, which exceeds maxclients on every subsequent try_accept_connection — all new connections are rejected until server restart.

Reproduction (before fix)

./target/release/moon --port 6400 --shards 1 --appendonly no --protected-mode no &
redis-cli -p 6400 SET foo bar   # ✓ OK  (counter: 1 → 0 → u64::MAX)
redis-cli -p 6400 SET foo bar   # ✗ Error: Connection reset by peer

Server log floods with:

WARN moon::shard::conn_accept: Shard 0: connection rejected: maxclients reached

Even though CONFIG GET maxclients returns 10000.

Root cause

src/server/conn/handler_monoio.rs:2330:

crate::admin::metrics_setup::record_connection_closed();   // <── DUPLICATE
(MonoioHandlerResult::Done, None)

While src/shard/conn_accept.rs:624-628 also decrements (correctly guarded for the non-migration case):

#[cfg(target_os = "linux")]
if !_migrated {
    crate::admin::metrics_setup::record_connection_closed();
}

The comment at handler_monoio.rs:84 already documents the intended ownership:

// NOTE: do NOT call record_connection_opened() here — the caller
// (conn_accept.rs) already increments via try_accept_connection().

By symmetry, the caller should also own the decrement. Removing the handler-level call restores that symmetry.

Fix

Single-line removal: record_connection_closed() call at handler_monoio.rs:2330.

Verification (aarch64 Linux / OrbStack moon-dev)

--- 10 sequential SETs ---
SET key1 -> OK
SET key2 -> OK
... (all 10 succeed)
--- connected_clients ---
connected_clients:1   # just the INFO probe — counter NOT wrapped
--- bench 10k SET p=16 c=50 ---
100.000% <= 3.103 milliseconds (cumulative count 10000)
  throughput summary: 1250000.00 requests per second

Before the fix, the bench produced SET: rps=0.0 + Error: Server closed the connection.

Scope

Fixes only the monoio path. The Tokio TLS path (conn_accept.rs:240 + handler_sharded.rs:176) has an analogous double-decrement — follow-up PR. The migration path's counter accounting is imbalanced in the opposite direction and is also a separate fix.

Blocks

PR #71 (perf recovery) — any throughput claim on monoio aarch64 needs this fix to produce real benchmark numbers.

Test plan

  • Compiles on both runtime-monoio and runtime-tokio feature sets
  • cargo clippy -- -D warnings clean
  • scripts/audit-unsafe.sh passes (178/178 SAFETY comments)
  • 10 sequential SETs succeed (reproduction fixed)
  • redis-benchmark SET p=16 c=50 n=10000 returns real throughput (1.25M rps on OrbStack aarch64)
  • INFO clients shows correct connected_clients (no wraparound)

Summary by CodeRabbit

  • Bug Fixes
    • Fixed a connection-counting bug that could underflow and cause new connections to be rejected under --maxclients.
    • Ensured migrated connections are accounted for so connection metrics stay balanced.
    • Result: more reliable connection handling and improved benchmark throughput with fewer rejection errors.

handle_connection_sharded_monoio called record_connection_closed() at
its exit, while conn_accept.rs (the caller) ALSO calls it in the non-
migration branch (line 627). The AtomicU64 counter wrapped from 0 to
u64::MAX on the second fetch_sub, causing every subsequent
try_accept_connection to reject against maxclients.

Symptom: first connection succeeds, all subsequent connections rejected
with "maxclients reached" — even though CONFIG GET maxclients returns
10000 and the original value was never reached.

Repro (before fix):
  ./target/release/moon --port 6400 --shards 1 --appendonly no &
  redis-cli -p 6400 SET foo bar   # ✓ OK
  redis-cli -p 6400 SET foo bar   # ✗ Connection reset by peer

Fix: remove the handler-level decrement. The comment at line 84 already
documents that the caller owns the increment via try_accept_connection;
by symmetry the caller owns the decrement (conn_accept.rs:547 for TLS,
conn_accept.rs:627 for plain TCP non-migrated path).

Migration path counter accounting is a separate concern (already
imbalanced) and is not addressed here.

Verified on aarch64 Linux (OrbStack moon-dev):
  - 10 sequential SETs all succeed
  - INFO clients reports connected_clients:1 (just the probe)
  - redis-benchmark SET p=16 c=50 n=10000 → 1.25M req/s (real number)

Blocks: PR #71 perf recovery — cannot measure real throughput without
this fix. Once merged, PR #71 can be validated with bench-compare.sh.
@qodo-code-review
Copy link
Copy Markdown

Review Summary by Qodo

Fix connection counter double-decrement on monoio path

🐞 Bug fix

Grey Divider

Walkthroughs

Description
• Remove duplicate record_connection_closed() call in monoio handler
• Fixes counter wraparound causing all connections rejected after first
• Restores symmetry: caller owns both increment and decrement
• Verified: sequential SETs succeed, real throughput restored (1.25M req/s)
Diagram
flowchart LR
  A["try_accept_connection<br/>increments counter"] -->|caller owns| B["conn_accept.rs<br/>decrements on close"]
  C["handle_connection_sharded_monoio<br/>REMOVED duplicate decrement"] -.->|was breaking| D["AtomicU64 wraps<br/>to u64::MAX"]
  D -->|causes| E["maxclients check<br/>always rejects"]
  B -->|fix restores| F["Counter stays valid<br/>connections accepted"]
Loading

Grey Divider

File Changes

1. src/server/conn/handler_monoio.rs 🐞 Bug fix +6/-1

Remove duplicate connection close metric decrement

• Removed record_connection_closed() call at line 2330 (was causing double-decrement)
• Added explanatory comment documenting ownership model and wraparound bug
• Clarifies that caller (conn_accept.rs) owns both increment and decrement

src/server/conn/handler_monoio.rs


2. CHANGELOG.md 📝 Documentation +4/-0

Document connection counter double-decrement fix

• Added "Fixed" section documenting the double-decrement bug
• Describes symptom: maxclients reached after first disconnect
• Explains root cause: counter wraparound to u64::MAX
• Notes verification: sequential SETs succeed, throughput restored

CHANGELOG.md


Grey Divider

Qodo Logo

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 12, 2026

Warning

Rate limit exceeded

@TinDang97 has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 9 minutes and 20 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 9 minutes and 20 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 9e86c77a-c10e-4604-8d5d-e193c6a8c5cc

📥 Commits

Reviewing files that changed from the base of the PR and between 4b5e1e9 and 320b780.

📒 Files selected for processing (1)
  • src/shard/conn_accept.rs
📝 Walkthrough

Walkthrough

A bug fix addressing connection counter underflow: removed double-decrementing of the shared AtomicU64 connection-closed counter in the monoio handler by delegating metrics recording to the caller, and added explicit accounting at migration target to balance migrated connections.

Changes

Cohort / File(s) Summary
Documentation
CHANGELOG.md
Added Unreleased changelog entry documenting the connection-counter double-decrement fix and verification/benchmark notes.
Connection Handler
src/server/conn/handler_monoio.rs
Removed the call to crate::admin::metrics_setup::record_connection_closed() at handler shutdown and added a comment explaining metrics are recorded by the caller to avoid counter underflow.
Connection Acceptance / Migration
src/shard/conn_accept.rs
After a migrated monoio handler future completes in spawn_migrated_monoio_connection, now unconditionally calls record_connection_closed() on the target shard to decrement the connected clients metric for migrated connections.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant Accept as conn_accept.rs (caller)
    participant Handler as handler_monoio.rs
    participant Target as spawn_migrated_monoio_connection
    participant Metrics as record_connection_closed()

    Client->>Accept: connect / try_accept_connection
    Accept->>Handler: spawn handler (monoio)
    alt non-migration path
        Handler-->>Accept: handler completes
        Accept->>Metrics: record_connection_closed()  %% caller decrements on normal close
    else migration path
        Handler-->>Target: migrate connection (spawn on target shard)
        Target->>Handler: run migrated handler future
        Target-->>Metrics: record_connection_closed()  %% target decrements after migrated handler finishes
    end
Loading

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

🐰 I hopped where counters used to fall,
Tweaked a decrement, no wrap at all.
Migrants now say goodbye in the right place,
Connections resume their steady race. ✨

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The PR title accurately summarizes the main change: removal of a duplicate record_connection_closed() call that was causing the connection counter to double-decrement on the monoio path.
Description check ✅ Passed The PR description is comprehensive and includes all required sections: summary, root cause explanation, fix details, verification with test results, and scope. However, the checklist items (cargo fmt, clippy, cargo test, consistency tests) are not explicitly marked as completed in the provided description.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/maxclients-double-decrement

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@qodo-code-review
Copy link
Copy Markdown

qodo-code-review Bot commented Apr 12, 2026

Code Review by Qodo

🐞 Bugs (1)   📘 Rule violations (0)   📎 Requirement gaps (0)   🎨 UX Issues (0)
🐞\ ≡ Correctness (1)

Grey Divider


Action required

1. Migrated clients never decrement 🐞
Description
After this PR, monoio connections that migrate to another shard will no longer call
record_connection_closed() when they finally disconnect on the target shard, so
CONNECTED_CLIENTS leaks upward. Over time this will incorrectly trip
try_accept_connection(maxclients) and reject new connections even when no clients are actually
connected.
Code

src/server/conn/handler_monoio.rs[R2330-2336]

+    // NOTE: connection close is recorded by the caller (conn_accept.rs) to
+    // preserve symmetry with `try_accept_connection`, which owns the
+    // increment.  Decrementing here too produces a double-decrement on the
+    // AtomicU64 counter — it wraps to u64::MAX on the second subtraction
+    // and all subsequent `try_accept_connection` comparisons against
+    // `maxclients` reject new connections.
    (MonoioHandlerResult::Done, None)
Evidence
Fresh monoio accepts increment via try_accept_connection(), and the source shard intentionally
skips record_connection_closed() when the connection migrates (because it stays alive). Prior to
this PR, the *target shard's* handle_connection_sharded_monoio call performed the final decrement
on disconnect; this PR removes that decrement, and spawn_migrated_monoio_connection does not add a
replacement decrement after awaiting the handler, leaving no close accounting for migrated
connections.

src/admin/metrics_setup.rs[338-402]
src/shard/conn_accept.rs[511-631]
src/shard/conn_accept.rs[640-778]
src/server/conn/handler_monoio.rs[2239-2262]
src/server/conn/handler_monoio.rs[2289-2337]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`handle_connection_sharded_monoio()` no longer calls `record_connection_closed()`. For migrated monoio connections, the source shard correctly skips decrementing (connection stays alive), but the target shard’s migrated handler spawn also does not decrement after the handler finishes. This leaks `CONNECTED_CLIENTS`, eventually causing erroneous `maxclients` rejections.

## Issue Context
- Source shard: increments via `try_accept_connection(maxclients)`.
- Source shard: skips decrement when `_migrated == true`.
- Target shard: `spawn_migrated_monoio_connection()` spawns `handle_connection_sharded_monoio(..., Some(&state))` but does not call `record_connection_closed()` afterward.
- After this PR, there is no remaining place that decrements for migrated connections on final disconnect.

## Fix Focus Areas
- src/shard/conn_accept.rs[640-778]
- src/server/conn/handler_monoio.rs[2289-2337]

### Suggested implementation direction
Add `crate::admin::metrics_setup::record_connection_closed();` after awaiting `handle_connection_sharded_monoio()` in the `monoio::spawn` block inside `spawn_migrated_monoio_connection()`.

Optionally, update the comment in `handler_monoio.rs` to clarify that the caller owns close accounting for fresh accepts, but migrated-connection spawns must still decrement on final close.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

ⓘ The new review experience is currently in Beta. Learn more

Grey Divider

Qodo Logo

Comment on lines +2330 to 2336
// NOTE: connection close is recorded by the caller (conn_accept.rs) to
// preserve symmetry with `try_accept_connection`, which owns the
// increment. Decrementing here too produces a double-decrement on the
// AtomicU64 counter — it wraps to u64::MAX on the second subtraction
// and all subsequent `try_accept_connection` comparisons against
// `maxclients` reject new connections.
(MonoioHandlerResult::Done, None)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Action required

1. Migrated clients never decrement 🐞 Bug ≡ Correctness

After this PR, monoio connections that migrate to another shard will no longer call
record_connection_closed() when they finally disconnect on the target shard, so
CONNECTED_CLIENTS leaks upward. Over time this will incorrectly trip
try_accept_connection(maxclients) and reject new connections even when no clients are actually
connected.
Agent Prompt
## Issue description
`handle_connection_sharded_monoio()` no longer calls `record_connection_closed()`. For migrated monoio connections, the source shard correctly skips decrementing (connection stays alive), but the target shard’s migrated handler spawn also does not decrement after the handler finishes. This leaks `CONNECTED_CLIENTS`, eventually causing erroneous `maxclients` rejections.

## Issue Context
- Source shard: increments via `try_accept_connection(maxclients)`.
- Source shard: skips decrement when `_migrated == true`.
- Target shard: `spawn_migrated_monoio_connection()` spawns `handle_connection_sharded_monoio(..., Some(&state))` but does not call `record_connection_closed()` afterward.
- After this PR, there is no remaining place that decrements for migrated connections on final disconnect.

## Fix Focus Areas
- src/shard/conn_accept.rs[640-778]
- src/server/conn/handler_monoio.rs[2289-2337]

### Suggested implementation direction
Add `crate::admin::metrics_setup::record_connection_closed();` after awaiting `handle_connection_sharded_monoio()` in the `monoio::spawn` block inside `spawn_migrated_monoio_connection()`.

Optionally, update the comment in `handler_monoio.rs` to clarify that the caller owns close accounting for fresh accepts, but migrated-connection spawns must still decrement on final close.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/server/conn/handler_monoio.rs`:
- Around line 2330-2335: In the migration/resumption branch where
handle_connection_sharded_monoio(...).await is invoked with can_migrate: false,
add a call to record_connection_closed() immediately after the await completes
so the connection-count decrement mirrors the plain-TCP path (see the existing
pattern around the plain TCP handler). This prevents leaked connected_clients
for resumed/migrated connections by ensuring the counter is decremented when the
handler returns Done.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 3bc90056-84a7-4420-be78-9ec447786152

📥 Commits

Reviewing files that changed from the base of the PR and between d7d10da and 7bc2712.

📒 Files selected for processing (2)
  • CHANGELOG.md
  • src/server/conn/handler_monoio.rs

Comment thread src/server/conn/handler_monoio.rs
TinDang97 added a commit that referenced this pull request Apr 12, 2026
The initial inline SET used Bytes::copy_from_slice for key and value,
which triggers MALLOC+memcpy twice per SET. The Frame-based path
achieves zero-copy via Bytes::slice on a frozen BytesMut, which is
just an Arc refcount bump. This disparity caused a ~4-7% SET p=1
regression because the inline-path savings were outweighed by the
new allocations at low pipeline depth.

Fix: call read_buf.split_to(consumed).freeze() once, then slice() the
frozen Bytes for key, value, and AOF. All three are now Arc refcount
bumps over the same underlying allocation — zero malloc, zero memcpy.

Measured impact on aarch64 Linux (OrbStack, 1 shard, 50 clients):

                before nocopy       after nocopy       delta
  SET p=16:     2.94M rps           3.11M rps          +5.4%  (peak 3.60M)
  SET p=1:      237K rps            235K rps           -1%    (within noise)
  GET p=16:     3.32M rps           3.33M rps          +0.3%

PR #71 totals vs origin/main (both with PR #72 maxclients fix):
  SET p=16: 2.43M → 3.11M  =  +28% (peak +48%)
  SET p=1:  241K  → 235K   =  -2.5% (was -4%, noise-level now)
  GET p=16: 3.36M → 3.33M  =  ±0%

All 11 inline-dispatch tests pass unchanged.
When a connection migrates to another shard, the source wrapper skips
the decrement (connection stays alive) and the target shard's
spawn_migrated_monoio_connection() now owns the balancing decrement
on final close.

Without this, every migration leaked CONNECTED_CLIENTS upward and
eventually tripped maxclients on long-running clusters.

Addresses review feedback from qodo-code-review and coderabbit on
PR #72 — both bots flagged this exact leak.
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/shard/conn_accept.rs`:
- Around line 768-773: The migrated-connection decrement is only done after
handle_connection_sharded_monoio() returns, so if target-side setup fails
earlier the source-side skipped decrement (because _migrated == true) and
CONNECTED_CLIENTS leaks; fix by ensuring record_connection_closed() (or the
CONNECTED_CLIENTS decrement) is invoked on all early error/return paths when
_migrated == true — e.g., add a short-lived guard in the conn_accept setup code
(or explicit calls) so that any early return before
handle_connection_sharded_monoio() will call
crate::admin::metrics_setup::record_connection_closed(); keep transfer semantics
so that when ownership is successfully transferred the guard is disarmed and no
double-decrement occurs.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 45661695-bcf3-451b-9cee-ceaf03360aa1

📥 Commits

Reviewing files that changed from the base of the PR and between 7bc2712 and 4b5e1e9.

📒 Files selected for processing (1)
  • src/shard/conn_accept.rs

Comment thread src/shard/conn_accept.rs
coderabbit flagged two leak paths in spawn_migrated_monoio_connection:

1. set_nonblocking(true) failure on the reconstructed std stream
2. monoio::net::TcpStream::from_std(...) Err branch

Both return without running the handler. The source shard already
skipped its wrapper decrement on successful hand-off (_migrated ==
true), so the counter was +1 with no balancing decrement.

Add record_connection_closed() on each early-error return.  The happy
path's decrement after handler.await still works unchanged.
@pilotspacex-byte
Copy link
Copy Markdown
Contributor Author

Review feedback addressed

qodo-code-review (1 bug): ✅ Fixed in 4b5e1e9spawn_migrated_monoio_connection now calls record_connection_closed() after handle_connection_sharded_monoio.await.

coderabbitai (2 leak paths on migration errors): ✅ Fixed in 320b780set_nonblocking(true) failure and monoio::net::TcpStream::from_std(...) Err branch now decrement the counter before returning.

All three counter-leak exit paths now balanced. Source shard owns increment via try_accept_connection; source skips decrement iff migration succeeded; target owns decrement on handler exit OR on any early-error path before the handler runs.

@pilotspacex-byte
Copy link
Copy Markdown
Contributor Author

CI flake — re-running

Check job failed on storage::eviction::tests::test_volatile_ttl_evicts_soonest — a pre-existing flaky test unrelated to this PR. Timing-dependent; passes in isolation, fails when run alongside 2000+ other tests.

Evidence this is not our change: this PR only touches src/shard/conn_accept.rs (connection counter decrement paths). The failing test is in src/storage/eviction.rs which is untouched.

gh run rerun 24301256589 --failed triggered.

@pilotspacex-byte pilotspacex-byte merged commit 3c7011a into main Apr 12, 2026
8 of 9 checks passed
TinDang97 added a commit that referenced this pull request Apr 12, 2026
The initial inline SET used Bytes::copy_from_slice for key and value,
which triggers MALLOC+memcpy twice per SET. The Frame-based path
achieves zero-copy via Bytes::slice on a frozen BytesMut, which is
just an Arc refcount bump. This disparity caused a ~4-7% SET p=1
regression because the inline-path savings were outweighed by the
new allocations at low pipeline depth.

Fix: call read_buf.split_to(consumed).freeze() once, then slice() the
frozen Bytes for key, value, and AOF. All three are now Arc refcount
bumps over the same underlying allocation — zero malloc, zero memcpy.

Measured impact on aarch64 Linux (OrbStack, 1 shard, 50 clients):

                before nocopy       after nocopy       delta
  SET p=16:     2.94M rps           3.11M rps          +5.4%  (peak 3.60M)
  SET p=1:      237K rps            235K rps           -1%    (within noise)
  GET p=16:     3.32M rps           3.33M rps          +0.3%

PR #71 totals vs origin/main (both with PR #72 maxclients fix):
  SET p=16: 2.43M → 3.11M  =  +28% (peak +48%)
  SET p=1:  241K  → 235K   =  -2.5% (was -4%, noise-level now)
  GET p=16: 3.36M → 3.33M  =  ±0%

All 11 inline-dispatch tests pass unchanged.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants