Skip to content

rpc: de-flake Server.OverloadTest{,NoCreateSpecialMessage}#209

Merged
chen3feng merged 1 commit into
masterfrom
deflake-server-overload-test
Jun 27, 2026
Merged

rpc: de-flake Server.OverloadTest{,NoCreateSpecialMessage}#209
chen3feng merged 1 commit into
masterfrom
deflake-server-overload-test

Conversation

@chen3feng

Copy link
Copy Markdown
Collaborator

Problem

Server.OverloadTestNoCreateSpecialMessage (and OverloadTest) flake on busy CI runners — the Bazel lane failed consistently with server_test.cc:261: ASSERT_EQ(100, succeeded)Which is: 300 / 400, followed by a SIGSEGV in the next test.

The handler slept a fixed 2s, and the test fired N (1000/10000) calls at a 100-slot cap expecting exactly 100 to succeed. On a slow runner, submitting all N takes longer than 2s, so the first sleepy handlers finish and free slots mid-submit → more than the cap succeed. And because it's a fatal ASSERT_, the failure skipped gate->Stop()/server.Stop(), leaving in-flight calls → the SIGSEGV in the following test.

Fix — make it deterministic

SleepyEchoService now holds its concurrency slot until released (bumping an admitted counter on entry). Each test:

  1. fires exactly the cap (100) with a long deadline (the held calls),
  2. waits until the cap is full (admitted == 100),
  3. floods the remaining N−100 with a short deadline — all dropped, since the cap is held,
  4. waits for those to drain, then releases the held calls.

So precisely the cap's worth succeed, no extra slips into a freed slot, and nothing times out — independent of runner speed. Also switched ASSERT_EQEXPECT_EQ so a failure can't skip teardown.

Verification

  • 10/10 passes under heavy CPU load (half the cores saturated) — was consistently failing on the busy CI runner
  • ~1.1s per test (down from ~3.1s)
  • all 10 server_test cases pass; the downstream SIGSEGV is gone

`ASSERT_EQ(100, succeeded)` over N racing calls against a 100-slot cap is
timing-dependent: the handler slept a fixed 2s, so on a busy/slow runner the
first sleepy handlers finish and free slots before all N calls are submitted,
letting more than the cap succeed (CI saw 300/400). Being a *fatal* assert, the
failure also skipped `gate->Stop()`/`server.Stop()`, leaving in-flight calls and
crashing the next test (SIGSEGV).

Make it deterministic: SleepyEchoService now *holds* its concurrency slot until
released (bumping an `admitted` counter on entry). Each test fires exactly the
cap (100) with a long deadline, waits until the cap is full, then floods the
rest with a short deadline -- all dropped, since the cap is held. Once those
drain, it releases the held calls: precisely the cap's worth succeed, no extra
slips into a freed slot, and nothing times out -- independent of runner speed.
Also switch to EXPECT_EQ so a failure can't skip teardown.

Verified: passes 10/10 under heavy CPU load (was consistently failing on the
busy CI runner); ~1.1s each (down from ~3.1s).
@chen3feng chen3feng force-pushed the deflake-server-overload-test branch from 46d24e4 to fd8ec71 Compare June 27, 2026 18:00
@chen3feng chen3feng merged commit be265c3 into master Jun 27, 2026
9 checks passed
@chen3feng chen3feng deleted the deflake-server-overload-test branch June 27, 2026 22:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant