Skip to content

[WIP]: fix(cosmos): HTTP/2 PING handler must observe child-stream read activity, not just parent-pipeline frames#49398

Open
jeet1995 wants to merge 2 commits into
Azure:mainfrom
jeet1995:users/jeet1995/http2-ping-child-stream-activity
Open

[WIP]: fix(cosmos): HTTP/2 PING handler must observe child-stream read activity, not just parent-pipeline frames#49398
jeet1995 wants to merge 2 commits into
Azure:mainfrom
jeet1995:users/jeet1995/http2-ping-child-stream-activity

Conversation

@jeet1995

@jeet1995 jeet1995 commented Jun 7, 2026

Copy link
Copy Markdown
Member

Description

Follow-up to #49095.

Root cause

Http2PingHandler is installed on the parent (TCP) channel. Its lastReadActivityNanos field was only updated from channelRead on the parent pipeline — which observes PING ACK / SETTINGS / GOAWAY but not HEADERS / DATA frames.

With reactor-netty's Http2MultiplexHandler in the parent pipeline, all application H2 traffic (request/response HEADERS + DATA) is demuxed onto per-stream child channels, so it never reaches the parent's channelRead. Result: a connection actively serving thousands of QPS still appeared "idle" to the PING handler, which then sent unnecessary PINGs every interval. Under load this added measurable latency overhead (visible as a P95 regression in benchmarking against v4.80.0 with thin-client enabled).

Fix

  1. New package-private AttributeKey<AtomicLong> LAST_CHILD_STREAM_READ_ACTIVITY_NANOS seeded on the parent channel in Http2PingHandler.handlerAdded.
  2. Http2PingCloseRewrapHandler (already installed in every child stream's pipeline) overrides channelReadComplete to stamp the parent attribute via accumulateAndGet(now, Math::max) — once per read cycle, monotonic, not per-frame.
  3. Http2PingHandler.maybeSendPing now reads effectiveLastReadActivityNanos = max(lastReadActivityNanos, child attr) for both the idle-threshold check and the "outstanding PING + activity resumed" early-return: if child-stream reads arrive after an outstanding PING, clear pingOutstandingSinceNanos and reset consecutiveFailures (defense against middleboxes that drop PING-ACK but pass application H2 frames).
  4. Names use lastReadActivity / "read activity" throughout — both signals are inbound reads, never writes; the naming documents the semantics precisely.

Tests

Two new unit tests in Http2PingHandlerTest:

  • childStreamReadActivity_suppressesIdlePing — stale parent timestamp + fresh child-stream attribute → no PING is sent.
  • childStreamReadActivityAfterOutstandingPing_resetsFailureState — outstanding PING + stale parent + fresh child-stream attribute → pingOutstandingSinceNanos cleared, consecutiveFailures reset, no new PING.

Both bypass real time via reflection on package-private fields and verify state transitions only.

Additionally validated end-to-end with Http2PingKeepaliveTest (Linux tc netem / iptables network-fault test) against a live thin-client Cosmos account in Docker: passed (1 test, 0 failures, 57.98s).

Why this isn't covered by the merged PR

PR #49095 installs the PING handler correctly and detects truly idle connections. It only mis-attributes activity when Http2MultiplexHandler is present in the pipeline — which is always the case for reactor-netty's HTTP/2 client. The defect manifests only under sustained load (where the connection is busy enough that the PING is pure overhead, not a probe).

Copilot AI review requested due to automatic review settings June 7, 2026 01:55
@jeet1995 jeet1995 requested review from a team and kirankumarkolli as code owners June 7, 2026 01:55
@github-actions github-actions Bot added the Cosmos label Jun 7, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes Cosmos DB’s manual HTTP/2 keepalive so the parent-channel Http2PingHandler correctly treats child-stream HTTP/2 traffic (HEADERS/DATA) as read activity when determining idleness, avoiding unnecessary PINGs (and potential spurious failure accumulation) under sustained load with Http2MultiplexHandler.

Changes:

  • Track child-stream inbound read activity via a parent-channel AttributeKey<AtomicLong> updated from each child stream’s channelReadComplete.
  • Update the PING state machine to use an “effective last read activity” (max of parent-pipeline reads and child-stream reads), including clearing an outstanding PING when post-PING child-stream traffic proves the connection is healthy.
  • Add unit tests validating child-stream activity suppresses idle PINGs and resets outstanding-PING failure state.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

File Description
sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/http/Http2PingHandler.java Adds child-stream read activity tracking and updates idle/outstanding-PING logic to use effective read activity.
sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/http/Http2PingCloseRewrapHandler.java Stamps parent-channel child-read activity once per child read cycle via channelReadComplete.
sdk/cosmos/azure-cosmos/CHANGELOG.md Documents the behavioral fix and its impact under load.
sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/implementation/http/Http2PingHandlerTest.java Adds unit tests covering the child-stream activity behavior and state transitions.

@jeet1995

jeet1995 commented Jun 7, 2026

Copy link
Copy Markdown
Member Author

@sdkReviewAgent

…ity, not just parent-pipeline frames

Follow-up to Azure#49095.

Root cause: Http2PingHandler is installed on the parent (TCP) channel and only updated lastReadActivityNanos from parent-pipeline channelRead -- which sees PING ACK / SETTINGS / GOAWAY but NOT HEADERS / DATA frames. With reactor-netty's Http2MultiplexHandler in the parent pipeline, all application H2 traffic is demuxed onto per-stream child channels, so a connection serving thousands of QPS still looked idle to the PING handler, causing unnecessary PINGs and measurable P95 overhead under load.

Fix:
- New package-private AttributeKey<AtomicLong> LAST_CHILD_STREAM_READ_ACTIVITY_NANOS seeded on the parent channel in handlerAdded.
- Http2PingCloseRewrapHandler (already installed in every child stream's pipeline) overrides channelReadComplete to stamp the parent attribute via accumulateAndGet(now, Math::max) -- once per read cycle, monotonic.
- Http2PingHandler.maybeSendPing now uses effectiveLastReadActivityNanos = max(field, child attr) for both the idle-threshold check AND the outstanding-PING early-return: if child-stream reads arrived after an outstanding PING, clear pingOutstandingSinceNanos and reset consecutiveFailures (defense against middleboxes that drop PING-ACK but pass application H2 frames).
- Names use lastReadActivity / read activity -- both signals are inbound reads, never writes.

Tests: two new unit tests in Http2PingHandlerTest covering the suppression path and the outstanding-PING reset path; both bypass real time via reflection.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@jeet1995 jeet1995 force-pushed the users/jeet1995/http2-ping-child-stream-activity branch from c85d48c to 55ed09e Compare June 7, 2026 02:31
@jeet1995

jeet1995 commented Jun 7, 2026

Copy link
Copy Markdown
Member Author

/azp run java - cosmos - tests

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

@jeet1995

jeet1995 commented Jun 7, 2026

Copy link
Copy Markdown
Member Author

@sdkReviewAgent

…equest

The previous .doOnRequest() chained onto the HTTP/2 HttpClient builder
wrapped every outgoing request's Mono in a Reactor operator. That paid
per-request costs even when the rewrap handler was already installed:

* MonoPeekTerminal wrap + per-request subscriber allocation
  (InternalMonoOperator.subscribe +2.60% in alloc profile)
* per-request BiConsumer lambda invocation
  (Http2Pool\$\...run +1.16% in alloc profile)
* per-request channel pipeline String-key walk via pipeline.get(...)

Long-run async-profiler diff vs v4.80.0 on D2 attributed the PR's
-7.6% QPS / +14.5% mean / +166% tail-max regression primarily to a
+10.2% allocation rate -- with the operators above dominating the delta.

Move the install into the existing .doOnConnected() block, which
already runs once per HTTP/2 child stream (the customHeaderCleaner
install right above it relies on the same contract). The rewrap
handler is @sharable, so installing once per stream is correct and
amortizes the install + pipeline-walk away from the hot path.

The ch.parent() != null guard is kept so the install only targets
child streams, never the parent TCP channel (which uses
Http2ParentChannelExceptionHandler instead).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@jeet1995 jeet1995 changed the title fix(cosmos): HTTP/2 PING handler must observe child-stream read activity, not just parent-pipeline frames [WIP]: fix(cosmos): HTTP/2 PING handler must observe child-stream read activity, not just parent-pipeline frames Jun 7, 2026
// stack maps that to GATEWAY_HTTP2_PING_TIMEOUT_CHANNEL_CLOSED so
// ClientRetryPolicy can suppress region mark-down.
//
// doOnConnected fires once per child stream channel (the existing

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Blocking — Lifecycle Mismatch: doOnConnected does not fire for H2 child streams — handler is never installed

This comment claims doOnConnected fires once per child stream channel, but it contradicts the comment 87 lines earlier in the same file (lines 152–154):

"For H2, doOnConnected fires on the parent TCP channel when the connection is first established (State.CONFIGURED). Child stream channels emit STREAM_CONFIGURED, which does not trigger this callback."

In reactor-netty, ClientTransportDoOn.onStateChange matches only State.CONFIGURED. H2 child streams emit HttpClientState.STREAM_CONFIGURED — a completely different enum value that never reaches this callback. When doOnConnected fires on the parent TCP channel, connection.channel() is the parent, and ch.parent() at line 251 is null. The guard ch.parent() != null evaluates to false, and Http2PingCloseRewrapHandler is never installed on any child stream.

Two failures result:

  1. The core fix is nullified. Http2PingCloseRewrapHandler.channelReadComplete never runs on child streams → LAST_CHILD_STREAM_READ_ACTIVITY_NANOS is never updated beyond its seed value → effectiveLastReadActivityNanos() returns only the stale parent-pipeline timestamp → connections serving thousands of QPS still appear "idle" → spurious PINGs continue on every interval. This is exactly the bug the PR is supposed to fix.

  2. Existing rewrap feature regresses. Http2PingCloseRewrapHandler.channelInactive never fires on child streams → PING-timeout closes produce bare PrematureCloseException instead of Http2PingTimeoutChannelClosedExceptionClientRetryPolicy cannot suppress region mark-down (regression from PR Add HTTP/2 PING for broken connection detection. #49095).

The claim that "the existing customHeaderCleaner install above relies on the same contract" is also incorrect — customHeaderCleaner installs on the parent pipeline (after reactor.left.httpCodec, which is the H2 frame codec on the parent) and works correctly precisely because it runs on the parent.

Suggested fix: Revert commit 26f1615f914 and keep the handler installation in doOnRequest. The per-request overhead is a single HashMap.get() + addFirst() call for a @Sharable singleton — negligible compared to the HTTP request itself. If the goal is to avoid per-request overhead, use reactor.netty.http.client.HttpClient.observe(ConnectionObserver) with a custom observer that matches both CONFIGURED and STREAM_CONFIGURED.

⚠️ AI-generated review — may be incorrect. Agree? → resolve the conversation. Disagree? → reply with your reasoning.

}

@Test(groups = "unit")
public void childStreamReadActivity_suppressesIdlePing() throws Exception {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Recommendation — Test Gap: New tests bypass Http2PingCloseRewrapHandler.channelReadComplete entirely

Both new tests (childStreamReadActivity_suppressesIdlePing and childStreamReadActivityAfterOutstandingPing_resetsFailureState) set the LAST_CHILD_STREAM_READ_ACTIVITY_NANOS attribute directly via AtomicLong.set(). They never exercise the actual production code path in Http2PingCloseRewrapHandler.channelReadComplete() — the channelReadCompleteparent.attr(...).get().accumulateAndGet(System.nanoTime(), Math::max) path.

This means:

  • If channelReadComplete used the wrong attribute key, set(0) instead of accumulateAndGet(now, max), or accidentally omitted the super.channelReadComplete(ctx) call, all tests would still pass.
  • The monotonic guarantee (Math::max) for concurrent child streams is never verified.
  • Combined with the doOnConnected wiring issue above, the entire child-stream-to-parent signaling path has zero automated verification.

Consider adding a test that constructs a parent EmbeddedChannel (with Http2PingHandler) and a child EmbeddedChannel (with Http2PingCloseRewrapHandler.INSTANCE), fires channelReadComplete on the child, and asserts the parent's LAST_CHILD_STREAM_READ_ACTIVITY_NANOS attribute advanced.

⚠️ AI-generated review — may be incorrect. Agree? → resolve the conversation. Disagree? → reply with your reasoning.

*/
private long effectiveLastReadActivityNanos(ChannelHandlerContext ctx) {
AtomicLong childReadActivity = ctx.channel().attr(LAST_CHILD_STREAM_READ_ACTIVITY_NANOS).get();
long childNanos = childReadActivity != null ? childReadActivity.get() : 0L;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Suggestion — Defensive fallback 0L is not a valid nanoTime sentinel

When childReadActivity is null (defensive path — should never happen since handlerAdded seeds it), the fallback 0L is not a valid System.nanoTime() value. The Java spec allows nanoTime() to return negative values (arbitrary origin). If the JVM's nanoTime origin is negative:

  • lastReadActivityNanos could be e.g. -500_000_000
  • Math.max(-500_000_000, 0L) = 0L — a timestamp in the "future" relative to the current clock
  • idleNanos = now - 0L would be negative → the connection would never appear idle → PINGs would never fire on that connection

The null path is essentially unreachable in practice, and the fix is trivial:

long childNanos = childReadActivity != null ? childReadActivity.get() : lastReadActivityNanos;

This makes the null case a pure no-op (falls back to parent-only behavior) rather than introducing a foreign sentinel.

⚠️ AI-generated review — may be incorrect. Agree? → resolve the conversation. Disagree? → reply with your reasoning.

// activity fix to prevent rapid-fire PING storms after write failure) does not
// block the second send within the same pingInterval.
setField(handler, "lastReadActivityNanos", longAgo);
setField(handler, "lastPingSendNanos", longAgo);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Recommendation — Test Gap: lastPingSendNanos send-throttle is tested by bypassing it, never by triggering it

The lastPingSendNanos field exists to prevent rapid-fire PING storms after write failure. This test proves the multi-tick failure path, but explicitly rewinds lastPingSendNanos to longAgo to bypass the throttle on the second tick (the comment at line 231-233 even acknowledges this).

No test ever asserts the converse: that a second call to maybeSendPing within the same pingIntervalNanos window (without rewinding lastPingSendNanos) does NOT send a second PING.

If lastPingSendNanos were accidentally removed from the idleAnchor computation at line 264 (long idleAnchor = Math.max(effectiveLastReadActivityNanos(ctx), lastPingSendNanos)), the PING-storm defense would disappear and all tests would still pass.

Consider adding a writeFailure_throttlesRapidResend test:

  1. First tick: write fails, consecutiveFailures = 1, lastPingSendNanos = now
  2. Rewind lastReadActivityNanos and child attr to stale, but do not rewind lastPingSendNanos
  3. Second tick: assert pingsSent and consecutiveFailures did not advance (throttle held)

⚠️ AI-generated review — may be incorrect. Agree? → resolve the conversation. Disagree? → reply with your reasoning.

@xinlian12

Copy link
Copy Markdown
Member

Review complete (50:38)

Posted 4 inline comment(s).

Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants