[WIP]: fix(cosmos): HTTP/2 PING handler must observe child-stream read activity, not just parent-pipeline frames#49398
Conversation
There was a problem hiding this comment.
Pull request overview
This PR fixes Cosmos DB’s manual HTTP/2 keepalive so the parent-channel Http2PingHandler correctly treats child-stream HTTP/2 traffic (HEADERS/DATA) as read activity when determining idleness, avoiding unnecessary PINGs (and potential spurious failure accumulation) under sustained load with Http2MultiplexHandler.
Changes:
- Track child-stream inbound read activity via a parent-channel
AttributeKey<AtomicLong>updated from each child stream’schannelReadComplete. - Update the PING state machine to use an “effective last read activity” (max of parent-pipeline reads and child-stream reads), including clearing an outstanding PING when post-PING child-stream traffic proves the connection is healthy.
- Add unit tests validating child-stream activity suppresses idle PINGs and resets outstanding-PING failure state.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/http/Http2PingHandler.java | Adds child-stream read activity tracking and updates idle/outstanding-PING logic to use effective read activity. |
| sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/http/Http2PingCloseRewrapHandler.java | Stamps parent-channel child-read activity once per child read cycle via channelReadComplete. |
| sdk/cosmos/azure-cosmos/CHANGELOG.md | Documents the behavioral fix and its impact under load. |
| sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/implementation/http/Http2PingHandlerTest.java | Adds unit tests covering the child-stream activity behavior and state transitions. |
|
@sdkReviewAgent |
…ity, not just parent-pipeline frames Follow-up to Azure#49095. Root cause: Http2PingHandler is installed on the parent (TCP) channel and only updated lastReadActivityNanos from parent-pipeline channelRead -- which sees PING ACK / SETTINGS / GOAWAY but NOT HEADERS / DATA frames. With reactor-netty's Http2MultiplexHandler in the parent pipeline, all application H2 traffic is demuxed onto per-stream child channels, so a connection serving thousands of QPS still looked idle to the PING handler, causing unnecessary PINGs and measurable P95 overhead under load. Fix: - New package-private AttributeKey<AtomicLong> LAST_CHILD_STREAM_READ_ACTIVITY_NANOS seeded on the parent channel in handlerAdded. - Http2PingCloseRewrapHandler (already installed in every child stream's pipeline) overrides channelReadComplete to stamp the parent attribute via accumulateAndGet(now, Math::max) -- once per read cycle, monotonic. - Http2PingHandler.maybeSendPing now uses effectiveLastReadActivityNanos = max(field, child attr) for both the idle-threshold check AND the outstanding-PING early-return: if child-stream reads arrived after an outstanding PING, clear pingOutstandingSinceNanos and reset consecutiveFailures (defense against middleboxes that drop PING-ACK but pass application H2 frames). - Names use lastReadActivity / read activity -- both signals are inbound reads, never writes. Tests: two new unit tests in Http2PingHandlerTest covering the suppression path and the outstanding-PING reset path; both bypass real time via reflection. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
c85d48c to
55ed09e
Compare
|
/azp run java - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
@sdkReviewAgent |
…equest The previous .doOnRequest() chained onto the HTTP/2 HttpClient builder wrapped every outgoing request's Mono in a Reactor operator. That paid per-request costs even when the rewrap handler was already installed: * MonoPeekTerminal wrap + per-request subscriber allocation (InternalMonoOperator.subscribe +2.60% in alloc profile) * per-request BiConsumer lambda invocation (Http2Pool\$\...run +1.16% in alloc profile) * per-request channel pipeline String-key walk via pipeline.get(...) Long-run async-profiler diff vs v4.80.0 on D2 attributed the PR's -7.6% QPS / +14.5% mean / +166% tail-max regression primarily to a +10.2% allocation rate -- with the operators above dominating the delta. Move the install into the existing .doOnConnected() block, which already runs once per HTTP/2 child stream (the customHeaderCleaner install right above it relies on the same contract). The rewrap handler is @sharable, so installing once per stream is correct and amortizes the install + pipeline-walk away from the hot path. The ch.parent() != null guard is kept so the install only targets child streams, never the parent TCP channel (which uses Http2ParentChannelExceptionHandler instead). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
| // stack maps that to GATEWAY_HTTP2_PING_TIMEOUT_CHANNEL_CLOSED so | ||
| // ClientRetryPolicy can suppress region mark-down. | ||
| // | ||
| // doOnConnected fires once per child stream channel (the existing |
There was a problem hiding this comment.
🔴 Blocking — Lifecycle Mismatch: doOnConnected does not fire for H2 child streams — handler is never installed
This comment claims doOnConnected fires once per child stream channel, but it contradicts the comment 87 lines earlier in the same file (lines 152–154):
"For H2, doOnConnected fires on the parent TCP channel when the connection is first established (State.CONFIGURED). Child stream channels emit STREAM_CONFIGURED, which does not trigger this callback."
In reactor-netty, ClientTransportDoOn.onStateChange matches only State.CONFIGURED. H2 child streams emit HttpClientState.STREAM_CONFIGURED — a completely different enum value that never reaches this callback. When doOnConnected fires on the parent TCP channel, connection.channel() is the parent, and ch.parent() at line 251 is null. The guard ch.parent() != null evaluates to false, and Http2PingCloseRewrapHandler is never installed on any child stream.
Two failures result:
-
The core fix is nullified.
Http2PingCloseRewrapHandler.channelReadCompletenever runs on child streams →LAST_CHILD_STREAM_READ_ACTIVITY_NANOSis never updated beyond its seed value →effectiveLastReadActivityNanos()returns only the stale parent-pipeline timestamp → connections serving thousands of QPS still appear "idle" → spurious PINGs continue on every interval. This is exactly the bug the PR is supposed to fix. -
Existing rewrap feature regresses.
Http2PingCloseRewrapHandler.channelInactivenever fires on child streams → PING-timeout closes produce barePrematureCloseExceptioninstead ofHttp2PingTimeoutChannelClosedException→ClientRetryPolicycannot suppress region mark-down (regression from PR Add HTTP/2 PING for broken connection detection. #49095).
The claim that "the existing customHeaderCleaner install above relies on the same contract" is also incorrect — customHeaderCleaner installs on the parent pipeline (after reactor.left.httpCodec, which is the H2 frame codec on the parent) and works correctly precisely because it runs on the parent.
Suggested fix: Revert commit 26f1615f914 and keep the handler installation in doOnRequest. The per-request overhead is a single HashMap.get() + addFirst() call for a @Sharable singleton — negligible compared to the HTTP request itself. If the goal is to avoid per-request overhead, use reactor.netty.http.client.HttpClient.observe(ConnectionObserver) with a custom observer that matches both CONFIGURED and STREAM_CONFIGURED.
| } | ||
|
|
||
| @Test(groups = "unit") | ||
| public void childStreamReadActivity_suppressesIdlePing() throws Exception { |
There was a problem hiding this comment.
🟡 Recommendation — Test Gap: New tests bypass Http2PingCloseRewrapHandler.channelReadComplete entirely
Both new tests (childStreamReadActivity_suppressesIdlePing and childStreamReadActivityAfterOutstandingPing_resetsFailureState) set the LAST_CHILD_STREAM_READ_ACTIVITY_NANOS attribute directly via AtomicLong.set(). They never exercise the actual production code path in Http2PingCloseRewrapHandler.channelReadComplete() — the channelReadComplete → parent.attr(...).get().accumulateAndGet(System.nanoTime(), Math::max) path.
This means:
- If
channelReadCompleteused the wrong attribute key,set(0)instead ofaccumulateAndGet(now, max), or accidentally omitted thesuper.channelReadComplete(ctx)call, all tests would still pass. - The monotonic guarantee (
Math::max) for concurrent child streams is never verified. - Combined with the
doOnConnectedwiring issue above, the entire child-stream-to-parent signaling path has zero automated verification.
Consider adding a test that constructs a parent EmbeddedChannel (with Http2PingHandler) and a child EmbeddedChannel (with Http2PingCloseRewrapHandler.INSTANCE), fires channelReadComplete on the child, and asserts the parent's LAST_CHILD_STREAM_READ_ACTIVITY_NANOS attribute advanced.
| */ | ||
| private long effectiveLastReadActivityNanos(ChannelHandlerContext ctx) { | ||
| AtomicLong childReadActivity = ctx.channel().attr(LAST_CHILD_STREAM_READ_ACTIVITY_NANOS).get(); | ||
| long childNanos = childReadActivity != null ? childReadActivity.get() : 0L; |
There was a problem hiding this comment.
🟢 Suggestion — Defensive fallback 0L is not a valid nanoTime sentinel
When childReadActivity is null (defensive path — should never happen since handlerAdded seeds it), the fallback 0L is not a valid System.nanoTime() value. The Java spec allows nanoTime() to return negative values (arbitrary origin). If the JVM's nanoTime origin is negative:
lastReadActivityNanoscould be e.g.-500_000_000Math.max(-500_000_000, 0L) = 0L— a timestamp in the "future" relative to the current clockidleNanos = now - 0Lwould be negative → the connection would never appear idle → PINGs would never fire on that connection
The null path is essentially unreachable in practice, and the fix is trivial:
long childNanos = childReadActivity != null ? childReadActivity.get() : lastReadActivityNanos;This makes the null case a pure no-op (falls back to parent-only behavior) rather than introducing a foreign sentinel.
| // activity fix to prevent rapid-fire PING storms after write failure) does not | ||
| // block the second send within the same pingInterval. | ||
| setField(handler, "lastReadActivityNanos", longAgo); | ||
| setField(handler, "lastPingSendNanos", longAgo); |
There was a problem hiding this comment.
🟡 Recommendation — Test Gap: lastPingSendNanos send-throttle is tested by bypassing it, never by triggering it
The lastPingSendNanos field exists to prevent rapid-fire PING storms after write failure. This test proves the multi-tick failure path, but explicitly rewinds lastPingSendNanos to longAgo to bypass the throttle on the second tick (the comment at line 231-233 even acknowledges this).
No test ever asserts the converse: that a second call to maybeSendPing within the same pingIntervalNanos window (without rewinding lastPingSendNanos) does NOT send a second PING.
If lastPingSendNanos were accidentally removed from the idleAnchor computation at line 264 (long idleAnchor = Math.max(effectiveLastReadActivityNanos(ctx), lastPingSendNanos)), the PING-storm defense would disappear and all tests would still pass.
Consider adding a writeFailure_throttlesRapidResend test:
- First tick: write fails,
consecutiveFailures = 1,lastPingSendNanos = now - Rewind
lastReadActivityNanosand child attr to stale, but do not rewindlastPingSendNanos - Second tick: assert
pingsSentandconsecutiveFailuresdid not advance (throttle held)
|
✅ Review complete (50:38) Posted 4 inline comment(s). Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage |
Description
Follow-up to #49095.
Root cause
Http2PingHandleris installed on the parent (TCP) channel. ItslastReadActivityNanosfield was only updated fromchannelReadon the parent pipeline — which observes PING ACK / SETTINGS / GOAWAY but not HEADERS / DATA frames.With reactor-netty's
Http2MultiplexHandlerin the parent pipeline, all application H2 traffic (request/response HEADERS + DATA) is demuxed onto per-stream child channels, so it never reaches the parent'schannelRead. Result: a connection actively serving thousands of QPS still appeared "idle" to the PING handler, which then sent unnecessary PINGs every interval. Under load this added measurable latency overhead (visible as a P95 regression in benchmarking against v4.80.0 with thin-client enabled).Fix
AttributeKey<AtomicLong> LAST_CHILD_STREAM_READ_ACTIVITY_NANOSseeded on the parent channel inHttp2PingHandler.handlerAdded.Http2PingCloseRewrapHandler(already installed in every child stream's pipeline) overrideschannelReadCompleteto stamp the parent attribute viaaccumulateAndGet(now, Math::max)— once per read cycle, monotonic, not per-frame.Http2PingHandler.maybeSendPingnow readseffectiveLastReadActivityNanos = max(lastReadActivityNanos, child attr)for both the idle-threshold check and the "outstanding PING + activity resumed" early-return: if child-stream reads arrive after an outstanding PING, clearpingOutstandingSinceNanosand resetconsecutiveFailures(defense against middleboxes that drop PING-ACK but pass application H2 frames).lastReadActivity/ "read activity" throughout — both signals are inbound reads, never writes; the naming documents the semantics precisely.Tests
Two new unit tests in
Http2PingHandlerTest:childStreamReadActivity_suppressesIdlePing— stale parent timestamp + fresh child-stream attribute → no PING is sent.childStreamReadActivityAfterOutstandingPing_resetsFailureState— outstanding PING + stale parent + fresh child-stream attribute →pingOutstandingSinceNanoscleared,consecutiveFailuresreset, no new PING.Both bypass real time via reflection on package-private fields and verify state transitions only.
Additionally validated end-to-end with
Http2PingKeepaliveTest(Linuxtc netem/iptablesnetwork-fault test) against a live thin-client Cosmos account in Docker: passed (1 test, 0 failures, 57.98s).Why this isn't covered by the merged PR
PR #49095 installs the PING handler correctly and detects truly idle connections. It only mis-attributes activity when
Http2MultiplexHandleris present in the pipeline — which is always the case for reactor-netty's HTTP/2 client. The defect manifests only under sustained load (where the connection is busy enough that the PING is pure overhead, not a probe).