[WIP] Default thin-client to enabled and add proxy connectivity-probe gate#49437
[WIP] Default thin-client to enabled and add proxy connectivity-probe gate#49437jeet1995 wants to merge 6 commits into
Conversation
…robe gate Adds an EndpointOrchestrator that fans out POST /connectivity-probe to every thin-client regional endpoint after each topology refresh. SDK only routes data-plane traffic to thin-client (Gateway V2) when all regional probes succeed across N consecutive refresh cycles (configurable via COSMOS.THINCLIENT_PROBE_FAILURE_THRESHOLD, default 2); otherwise traffic falls back to Gateway V1 at the next refresh boundary. No mid-flight fallback. Caveats: - Probe wiring is skipped entirely for Direct mode and when HTTP/2 is not configured; controlled by RxDocumentClientImpl.useThinClient. - QueryPlan, metadata reads, and AllVersionsAndDeletes change feed continue to route through Compute Gateway (Gateway V1). - Probe failures are absorbed inside the orchestrator and the trigger is fire-and-forget on the GEM scheduler, so probe issues can never trip CosmosClient initialization or fail a topology refresh. - EndpointOrchestrator implements Closeable and is closed from GlobalEndpointManager.close() so no further probes are issued after client shutdown. THINCLIENT_ENABLED now defaults to true; opt out via COSMOS.THINCLIENT_ENABLED=false or COSMOS.THINCLIENT_PROBE_ENABLED=false. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
This PR makes thin-client (Gateway V2) routing default-on and adds an HTTP/2 connectivity-probe gate that periodically validates thin-client regional endpoints after topology refreshes, falling back to Gateway V1 when the proxy fleet is deemed unhealthy.
Changes:
- Default
COSMOS.THINCLIENT_ENABLEDtotrueand introduce new probe-related configuration knobs inConfigs. - Add
EndpointOrchestratorand wire it intoGlobalEndpointManagerrefresh flows; gate thin-client routing viaisProxyProbeHealthy(). - Add unit/integration-style tests covering orchestrator behavior, config parsing, and GEM wiring.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/RxDocumentClientImpl.java | Wires thin-client HttpClient into GEM during init; adds probe-health condition to thin-client routing predicate. |
| sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/routing/LocationCache.java | Exposes thin-client regional endpoints from the latest topology snapshot for probe fan-out. |
| sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/GlobalEndpointManager.java | Hosts and triggers the probe orchestrator after refreshes; exposes probe health/diagnostics; closes orchestrator on shutdown. |
| sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/EndpointOrchestrator.java | New component that executes per-region POST /connectivity-probe and maintains hysteresis-based health state. |
| sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/Configs.java | Flips thin-client default to enabled and adds probe enable/threshold/path configuration accessors. |
| sdk/cosmos/azure-cosmos/CHANGELOG.md | Documents the thin-client default flip and the new probe gating behavior. |
| sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/implementation/ThinClientProbeWiringTests.java | Verifies GEM wiring, probe triggering on refresh, and endpoint discovery via LocationCache. |
| sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/implementation/EndpointOrchestratorTests.java | Unit tests for orchestrator hysteresis, error handling, feature flag no-op, and request targeting. |
| sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/implementation/ConfigsTests.java | Adds coverage for new probe properties and updates thin-client default expectation. |
| * Fixed `UnsupportedOperationException` when using `readManyByPartitionKeys` for empty pages. - See [PR 49311](https://github.com/Azure/azure-sdk-for-java/pull/49311) | ||
|
|
||
| #### Other Changes | ||
| * Defaulted `COSMOS.THINCLIENT_ENABLED=true` and added an HTTP/2 connectivity-probe (`EndpointOrchestrator`) that gates thin-client (Gateway V2) data-plane routing on per-region probe health; thin-client only activates when probes are green for all regional endpoints across N consecutive topology refresh cycles, otherwise traffic falls back to Gateway V1. |
| private void triggerThinClientProbeCycle() { | ||
| try { | ||
| EndpointOrchestrator orchestrator = this.thinClientProbeOrchestrator.get(); | ||
| if (orchestrator == null) { | ||
| return; | ||
| } | ||
| if (!this.hasThinClientReadLocations.get()) { | ||
| return; | ||
| } | ||
| Set<URI> endpoints = this.locationCache.getThinClientRegionalEndpoints(); | ||
| if (endpoints.isEmpty()) { | ||
| return; | ||
| } | ||
| // Fire-and-forget: probe runs out-of-band on the global endpoint manager | ||
| // scheduler. Failures are absorbed inside runProbeCycle and reflected in the | ||
| // orchestrator's internal state, which is consulted at the next routing decision. | ||
| // We additionally guard against any synchronous throw here so a probe issue | ||
| // can never trip CosmosClient initialization or a topology refresh. | ||
| orchestrator | ||
| .runProbeCycle(endpoints) | ||
| .subscribeOn(CosmosSchedulers.GLOBAL_ENDPOINT_MANAGER_BOUNDED_ELASTIC) | ||
| .subscribe( | ||
| healthy -> { | ||
| if (logger.isDebugEnabled()) { | ||
| logger.debug("Thin-client probe cycle completed; proxyHealthy={}", healthy); | ||
| } | ||
| }, | ||
| t -> logger.debug("Thin-client probe cycle subscription error", t)); |
| // Now swap to a green client and run another cycle on a fresh orchestrator that already saw a red. | ||
| Map<URI, Integer> greenByEndpoint = new HashMap<>(); | ||
| greenByEndpoint.put(REGION_EAST, 200); | ||
| EndpointOrchestrator greenOrchestrator = new EndpointOrchestrator(mockClient(greenByEndpoint, new AtomicInteger(), false)); | ||
|
|
||
| // Drive greenOrchestrator into the unhealthy state manually by replaying a red first. | ||
| Map<URI, Integer> redOnly = new HashMap<>(); | ||
| redOnly.put(REGION_EAST, 503); | ||
| EndpointOrchestrator combo = new EndpointOrchestrator(toggleClient(REGION_EAST, 503, 200)); |
# Conflicts: # sdk/cosmos/azure-cosmos/CHANGELOG.md
|
/azp run java - cosmos - tests |
|
/azp run java - cosmos - spark |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run java - cosmos - kafka |
|
Azure Pipelines successfully started running 1 pipeline(s). |
1 similar comment
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
@sdkReviewAgent |
jeet1995
left a comment
There was a problem hiding this comment.
Code Review: Thin Client Probe Flow
Great work on wiring up the thin-client connectivity probe. I've reviewed the design and implementation, specifically looking for Reactor/Netty lifecycles, concurrency, and API contracts.
Here are a few important findings:
*1. [Resource Lifecycle / Memory Leak] Dangling Subscription in \EndpointOrchestrator*
File: \sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/EndpointOrchestrator.java, Line 204
By using .map\ and a fire-and-forget .subscribe()\ on
esponse.body(), the buffer draining process escapes the parent \Mono\ lifecycle. If the proxy trickles data or hangs on sending the body, this background task will consume resources indefinitely without backpressure or timeout because it's detached from the \send's \perProbeTimeout.
Recommendation: Use .flatMap\ instead to properly chain the asynchronous draining, and apply a timeout to the body draining phase:
\\java
.flatMap(response -> {
int status = response.statusCode();
boolean ok = status == 200;
if (!ok) {
logger.debug("Thin-client probe to {} returned status {}", regionalEndpoint, status);
}
// Drain body so reactor-netty releases the underlying buffer, tied to the Reactor lifecycle
return response.body()
.doOnNext(buf -> {
if (buf != null) {
buf.release();
}
})
.ignoreElement()
.timeout(this.perProbeTimeout) // Prevent slow-draining bodies from hanging the cycle indefinitely
.doFinally(s -> safeClose(response))
.thenReturn(new ProbeResult(regionalEndpoint, ok, "status:" + status))
.onErrorResume(t -> Mono.just(new ProbeResult(regionalEndpoint, ok, "status:" + status)));
})
\\
2. [Resource Lifecycle] Uncancelled In-Flight Probe Cycles
File: \sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/GlobalEndpointManager.java, Line 466
The \orchestrator.runProbeCycle(endpoints).subscribe(...)\ returns a \Disposable\ which is not stored or tracked. If \GlobalEndpointManager.close()\ is called while a probe cycle is in-flight, the cycle will not be cancelled.
Recommendation: Store the \Disposable\ in an \AtomicReference\ and explicitly .dispose()\ it during \close(), and also before starting a new cycle to prevent overlapping probes if topology refreshes rapidly.
3. [Documentation / Consistency] Optimistic vs Pessimistic Startup
File: \sdk/cosmos/azure-cosmos/CHANGELOG.md, Line 27
The changelog wording implies a pessimistic startup constraint: "only activates when probes are green... across N consecutive topology refresh cycles". However, \EndpointOrchestrator\ implements an optimistic startup (\proxyHealthy\ defaults to \ rue\ and trips to \alse\ after N \RED\ cycles), and only requires 1 \GREEN\ cycle to restore health.
Recommendation: Update the changelog to accurately reflect the optimistic startup and fallback logic so customers understand that thin-client activates immediately on SDK init.
Nit: In \LocationCache.getThinClientRegionalEndpoints(), reading \ his.locationInfo\ outside a synchronized block is technically a stale read if another thread updates it, though it's safe here because \GlobalEndpointManager\ calls it from within the write lock.
| if (orchestrator == null) { | ||
| return; | ||
| } | ||
| if (!this.hasThinClientReadLocations.get()) { |
There was a problem hiding this comment.
🟡 Recommendation — Correctness: hasThinClientReadLocations and getThinClientRegionalEndpoints() can disagree, bypassing probe safety net
if (!this.hasThinClientReadLocations.get()) {
return;
}
Set<URI> endpoints = this.locationCache.getThinClientRegionalEndpoints();
if (endpoints.isEmpty()) {
return;
}hasThinClientReadLocations is set from the raw databaseAccount.getThinClientReadableLocations() (line 493), while getThinClientRegionalEndpoints() reads from LocationCache.availableReadRegionalRoutingContextsByRegionName — which requires region-name matching between thin-client and gateway locations. If the names don't match (e.g., normalization difference), the thin-client endpoint is silently dropped from the LocationCache map.
When they disagree: hasThinClientReadLocations=true → routing gate passes in useThinClientStoreModel(). But endpoints.isEmpty()=true → no probe fires → proxyHealthy stays at its optimistic default true. The probe safety net is completely bypassed.
Failure scenario: Service returns a thin-client region name that doesn't match the normalized gateway region key. The LocationCache.updateLocationCache() code that sets RegionalRoutingContext.setThinclientRegionalEndpoint() catches and logs the NPE (line 1055-1062) but the endpoint is lost.
Suggestion: Either derive hasThinClientReadLocations from locationCache.getThinClientRegionalEndpoints().isEmpty() (single source of truth), or add a safeguard in triggerThinClientProbeCycle() that marks the probe unhealthy when hasThinClientReadLocations=true but endpoints are empty.
| private boolean useThinClientStoreModel(RxDocumentServiceRequest request) { | ||
| if (!useThinClient | ||
| || !this.globalEndpointManager.hasThinClientReadLocations() | ||
| || !this.globalEndpointManager.isProxyProbeHealthy() |
There was a problem hiding this comment.
🟡 Recommendation — Test Coverage: No test proves the routing fallback when probe is unhealthy
|| !this.globalEndpointManager.isProxyProbeHealthy()This is the most critical behavioral change in the PR — the probe health gates whether data-plane traffic routes to thin-client or falls back to Gateway V1. But no test verifies the routing consequence: that when isProxyProbeHealthy() returns false, useThinClientStoreModel() returns false and getStoreProxy() returns gatewayProxy instead of thinProxy.
The existing tests verify the probe state machine (health flag flips) and the GEM integration (probe fires on refresh), but stop short of proving the downstream routing behavior. If someone later refactors the condition order in useThinClientStoreModel() or accidentally removes this check, no test would catch it.
Suggestion: Add a test (e.g., in a new ThinClientRoutingGateTests or extending ThinClientProbeWiringTests) that:
- Wires a GEM with a stubbed orchestrator returning all-503.
- Drives the probe to unhealthy (cross threshold).
- Asserts
useThinClientStoreModel(documentPointReadRequest)returnsfalse. - Restores probe to healthy and asserts it returns
trueagain.
| */ | ||
| public Set<URI> getThinClientRegionalEndpoints() { | ||
| UnmodifiableMap<String, RegionalRoutingContext> byRegion = | ||
| this.locationInfo.availableReadRegionalRoutingContextsByRegionName; |
There was a problem hiding this comment.
🟢 Suggestion — Coverage Gap: Only read-region thin-client endpoints are probed
UnmodifiableMap<String, RegionalRoutingContext> byRegion =
this.locationInfo.availableReadRegionalRoutingContextsByRegionName;This method collects thin-client endpoints exclusively from the read regional routing contexts. However, useThinClientStoreModel() routes writes (point operations, batch) through thin-client too. If a write-only region has a separate thin-client endpoint (stored in availableWriteRegionalRoutingContextsByRegionName), that endpoint is never probed.
Consequence: If the write-region thin-client proxy goes down but read-region probes pass, isProxyHealthy() stays true and write traffic continues routing to the dead proxy.
This may be dormant in practice if Cosmos DB always returns the same regions for both read and write thin-client locations. But the gap exists conceptually, and adding write-region endpoints to the probe set would be a straightforward defense-in-depth improvement.
| } | ||
|
|
||
| private static HttpResponse stubResponse(int status) { | ||
| ByteBuf empty = Unpooled.EMPTY_BUFFER; |
There was a problem hiding this comment.
🟢 Suggestion — Test Quality: Unpooled.EMPTY_BUFFER singleton causes silent body-drain failures
ByteBuf empty = Unpooled.EMPTY_BUFFER;
return new HttpResponse() {
...
@Override public Mono<ByteBuf> body() { return Mono.just(empty); }
...
};Unpooled.EMPTY_BUFFER is a global singleton with refCnt=1. The production code calls buf.release() on it. After the first probe endpoint drains the body, refCnt drops to 0. Every subsequent buf.release() (for additional probe endpoints in the same test, or across tests) throws IllegalReferenceCountException, which is silently swallowed by the t -> {} error handler.
This means the body-drain path is exercised correctly only for the first probe in the first test — all subsequent body drains silently fail. The tests still pass because assertions only check the health flag, not body-drain behavior.
In production, ReactorNettyHttpResponse.body() on an empty response returns Mono.empty() (via ByteBufFlux.fromInbound().aggregate() on an empty stream), so onNext never fires and buf.release() is never called.
Suggestion: Return Mono.empty() from body() to match real HTTP/2 empty-body behavior:
@Override public Mono<ByteBuf> body() { return Mono.empty(); }| this.lastFailedEndpoints.set(Collections.unmodifiableSet(failedEndpoints)); | ||
|
|
||
| if (cycleGreen) { | ||
| int prior = this.consecutiveFailures.getAndSet(0); |
There was a problem hiding this comment.
🟡 Recommendation — Correctness: Asymmetric recovery can cause oscillation under flapping
int prior = this.consecutiveFailures.getAndSet(0);
...
this.proxyHealthy.set(true);The probe requires N consecutive RED cycles (default 2) to flip to unhealthy, but only 1 GREEN cycle to immediately restore health. This asymmetry creates an oscillation risk if a region is flapping (intermittently returning 200/503).
Scenario: With threshold=2, a flapping region produces: RED → RED → flip unhealthy → GREEN → restore healthy → RED → RED → flip unhealthy → GREEN → restore… The routing target flip-flops at every refresh boundary that happens to land on a GREEN probe.
The Rust Cosmos SDK addresses this with jittered failback — after marking unhealthy, it transitions through a ProbeCandidate state with a recovery cooldown before fully restoring health.
Suggestion: Consider either:
- Requiring M consecutive GREEN cycles to restore (matching the RED threshold), or
- Adding a minimum cooldown duration before re-enabling (e.g.,
if (Instant.now().isAfter(lastFailureAt + cooldown)))
This would prevent rapid routing oscillation while still allowing recovery.
| logger.debug("Thin-client probe to {} returned status {}", regionalEndpoint, status); | ||
| } | ||
| // Drain body so reactor-netty releases the underlying buffer. | ||
| response.body() |
There was a problem hiding this comment.
🟡 Recommendation — Resource Handling: Detached body drain with inert cleanup
response.body()
.doFinally(s -> safeClose(response))
.subscribe(buf -> { if (buf != null) buf.release(); }, t -> { });
return new ProbeResult(regionalEndpoint, ok, "status:" + status);This creates a detached subscription inside .map() — the ProbeResult is returned to the outer Mono immediately, while the body drain runs independently as an orphaned reactive chain. Two issues:
-
safeClose(response)is a no-op.HttpResponse.close()is an empty method (line 108-110 ofHttpResponse.java), andReactorNettyHttpResponsedoes not override it. ThedoFinallyprovides zero cleanup — it only gives a false sense of safety. -
SDK convention is to compose body consumption into the main chain. Across the codebase, body consumption is always part of the reactive pipeline:
ClientTelemetry.java:232uses.flatMap(HttpResponse::bodyAsString),HttpClientUtils.java:42useshttpResponse.bodyAsString(),ErrorUtils.java:20chainsbodyAsString(). This detached subscribe diverges from that pattern.
For an empty probe response, this is likely benign in practice. But if the error handler t -> {} fires (e.g., connection reset mid-body), the error is silently swallowed AND safeClose does nothing — the HTTP/2 stream cleanup is entirely deferred to reactor-netty internals.
Suggestion: Compose the body drain into the returned Mono:
return this.httpClient
.send(request, this.perProbeTimeout)
.flatMap(response -> {
int status = response.statusCode();
boolean ok = status == 200;
return response.bodyAsString()
.defaultIfEmpty("")
.map(ignored -> new ProbeResult(regionalEndpoint, ok, "status:" + status));
})This matches SDK conventions and ensures the body is drained before the ProbeResult is emitted.
|
✅ Review complete (49:43) Posted 6 inline comment(s). Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage |
|
Deep review findings for PR #49437:
|
Review: thin-client connectivity-probe gateDeep review focused on Reactor correctness, dangling workers, NPE/ 🔴 HIGH-1 — Default-on + optimistic start + 5-min probe cadence + no mid-flight fallback = minutes-long outage window
When the proxy fleet is dead but the account/regions are healthy, nothing forces a topology refresh (503 paths in
Forced refreshes (410/Gone, region failover via Suggestion: a startup 🔴 HIGH-2 — Hysteresis state machine is neither single-flight nor ordered → eager and missed failover
Suggestion: an 🟡 MEDIUM — All-or-nothing cycle health is coarse
⚪ Minor
✅ Good: correct wiring-before- |
jeet1995
left a comment
There was a problem hiding this comment.
Submitting these findings as inline comments for visibility.
| logger.debug("Thin-client probe to {} returned status {}", regionalEndpoint, status); | ||
| } | ||
| // Drain body so reactor-netty releases the underlying buffer. | ||
| response.body() |
There was a problem hiding this comment.
[Resource Lifecycle / Memory Leak] — Detached reactive chain (Dangling subscription).
By using .map and a fire-and-forget .subscribe() on response.body(), the buffer draining process escapes the parent Mono lifecycle. If the proxy trickles data or hangs on sending the body, this background task will consume resources indefinitely without backpressure or timeout because it is detached from the sends perProbeTimeout.
Recommendation: Use .flatMap instead to properly chain the asynchronous draining:
.flatMap(response -> {
int status = response.statusCode();
boolean ok = status == 200;
if (!ok) {
logger.debug("Thin-client probe to {} returned status {}", regionalEndpoint, status);
}
return response.body()
.doOnNext(buf -> {
if (buf != null) buf.release();
})
.ignoreElement()
.timeout(this.perProbeTimeout)
.doFinally(s -> safeClose(response))
.thenReturn(new ProbeResult(regionalEndpoint, ok, "status:" + status))
.onErrorResume(t -> Mono.just(new ProbeResult(regionalEndpoint, ok, "status:" + status)));
})|
|
||
| #### Other Changes | ||
| * Added HTTP/2 PING keepalive (default ON) for Gateway service endpoints to detect silently-broken connections. - See [PR 49095](https://github.com/Azure/azure-sdk-for-java/pull/49095) | ||
| * Defaulted `COSMOS.THINCLIENT_ENABLED=true` and added an HTTP/2 connectivity-probe (`EndpointOrchestrator`) that gates thin-client (Gateway V2) data-plane routing on per-region probe health; thin-client only activates when probes are green for all regional endpoints across N consecutive topology refresh cycles, otherwise traffic falls back to Gateway V1. |
There was a problem hiding this comment.
[Documentation / Consistency]
The wording here implies a pessimistic startup constraint: "only activates when probes are green... across N consecutive topology refresh cycles".
However, EndpointOrchestrator implements an optimistic startup (proxyHealthy defaults to true and trips to false after N RED cycles), and only requires 1 GREEN cycle to restore health.
Should update the changelog to accurately reflect the optimistic startup and fallback logic so customers understand that thin-client activates immediately.
| // We additionally guard against any synchronous throw here so a probe issue | ||
| // can never trip CosmosClient initialization or a topology refresh. | ||
| orchestrator | ||
| .runProbeCycle(endpoints) |
There was a problem hiding this comment.
[Resource Lifecycle] — Uncancelled In-Flight Probe Cycles.
The orchestrator.runProbeCycle(endpoints).subscribe(...) returns a Disposable which is not stored or tracked. If GlobalEndpointManager.close() is called while a probe cycle is in-flight, the cycle will not be cancelled.
Recommendation: Store the Disposable in an AtomicReference<Disposable> and explicitly .dispose() it during close(), and also before starting a new cycle to prevent overlapping probes if topology refreshes rapidly.
…in lifecycle, cancellable in-flight probes - EndpointOrchestrator: fold body-drain into probe Mono via flatMap+then(perProbeTimeout) so a slow/hanging response body cannot leak resources outside the cycle budget (Copilot #1, deep-review #3, jeet HIGH-2 minor). - EndpointOrchestrator: add single-flight CAS (cycleInProgress) plus monotonic cycle id; closed-check inside applyCycleResult drops late results so a post-close cycle cannot mutate health state (deep-review #1+#2, jeet HIGH-2). - EndpointOrchestrator: re-evaluate closed/feature-flag/endpoints at subscription time via Mono.defer so GEM.close() cancellation is honored before any HTTP I/O is issued. - GlobalEndpointManager: retain probe Disposable in AtomicReference; close() now disposes the in-flight probe subscription so probe work cannot outlive the GEM/CosmosClient (Copilot #2, deep-review #2). - CHANGELOG: moved entry to unreleased 4.82.0-beta.1, reworded to honestly describe optimistic startup, N=2 RED-to-fallback hysteresis, and Direct-mode/metadata exclusion (Copilot #3, deep-review #4). Tests: 45 unit tests pass (EndpointOrchestratorTests + ConfigsTests + ThinClientProbeWiringTests). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Addressed review feedback in d2ec2f8Thanks for the deep reviews. Pushed fixes for the convergent blocking issues: 1. Body-drain leak (Copilot #1, deep-review #3) 2. Cancellable in-flight probes on close (Copilot #2, deep-review #2) 3. Stale / overlapping cycles (deep-review #1, jeet HIGH-2) 4. CHANGELOG honesty (Copilot #3, deep-review #4) Tests: 45 unit tests pass ( Deferred (separate PR): the resolved-context NPE risk in |
- LocationCache.getThinClientRegionalEndpoints now walks both read and write region endpoint maps so single-master write-region failures still flip the probe gate. - EndpointOrchestrator.forceUnhealthy(reason) provides a non-HTTP path to flip the gate; GlobalEndpointManager calls it when topology says thin-client is eligible but no regional endpoint resolves. - Symmetric hysteresis: new COSMOS.THINCLIENT_PROBE_RECOVERY_THRESHOLD (default 1) so operators can require N consecutive GREEN cycles before flipping back to proxy. - Extracted RxDocumentClientImpl.useThinClientStoreModel(...) body into package-private static shouldUseThinClientStoreModel for direct unit testability; added ThinClientRoutingGateTests covering 9 routing paths. - EndpointOrchestratorTests.stubResponse now returns Mono.empty() to avoid Unpooled.EMPTY_BUFFER refCnt underflow across multiple probe calls. - Removed unused locals; added recoveryThresholdRequiresMultipleGreenCycles, forceUnhealthy_flipsGateToRedWithoutRunningProbe, forceUnhealthy_onClosedOrchestrator_isNoOp tests. All 57 unit tests in the touched files pass. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Second batch of fixes pushed in 3f1b1be. Per-comment mapping:
Plus three new EndpointOrchestratorTests: recoveryThresholdRequiresMultipleGreenCycles, forceUnhealthy_flipsGateToRedWithoutRunningProbe, forceUnhealthy_onClosedOrchestrator_isNoOp. Validated: |
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Third batch addressed in 66fca70 — quick summary:
|
…cleanup, fix gwV2Cto and ThinClient user-agent assertions - GlobalEndpointManager: convert thin-client probe trigger to a Mono<Void> chained into the topology-refresh reactor pipeline (replaces fire-and-forget subscribe). Removes thinClientProbeDisposable field and its close() handling since cancellation now propagates through the outer subscription. - EndpointProbeClient/EndpointProbeClientTests/ThinClientProbeWiringTests: replace inline FQNs with imports (java.io.Closeable, java.util.List, java.net.ConnectException, com.azure.cosmos.implementation.http.HttpHeaders). - ClientConfigDiagnosticsTest: compute gwV2Cto dynamically from Configs.isThinClientEnabled() so assertions remain valid after the default flip to true. - ConfigsTests: update default-threshold assertions from 2 to 1 to match DEFAULT_THINCLIENT_PROBE_FAILURE_THRESHOLD=1. - UserAgentContainerTest.UserAgentIntegration: expect '|F4' suffix because the ThinClient feature flag (1 << 2) is now included by default. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Pushed Round 4 / earlier feedback (re-confirmed):
Round 5 (new): Round 6 (new): Round 7 (new):
Verification:
|
|
/azp run java - cosmos - tests |
|
/azp run java - cosmos - spark |
|
/azp run java - cosmos - kafka |
|
Azure Pipelines successfully started running 1 pipeline(s). |
1 similar comment
|
Azure Pipelines successfully started running 1 pipeline(s). |
Summary
Adds an
EndpointOrchestratorthat fans outPOST /connectivity-probeto every thin-client regional endpoint after each topology refresh. The SDK only routes data-plane traffic through thin-client (Gateway V2) when all regional probes return HTTP 200 across N consecutive refresh cycles; otherwise traffic falls back to Gateway V1 at the next refresh boundary. No mid-flight fallback.COSMOS.THINCLIENT_ENABLEDnow defaults totrue. The new probe gate makes that safe by closing thin-client routing automatically if the proxy fleet is unreachable.Gating caveats
Http2ConnectionConfigis configured and effectively enabled.useThinClientStoreModelpredicate.CosmosClientinitialization or fail a topology refresh.EndpointOrchestratorimplementsCloseableand is closed fromGlobalEndpointManager.close(); no further probes are issued after client shutdown.Configuration
COSMOS.THINCLIENT_ENABLEDtrue(wasfalse)COSMOS.THINCLIENT_PROBE_ENABLEDtrueCOSMOS.THINCLIENT_PROBE_FAILURE_THRESHOLD2COSMOS.THINCLIENT_PROBE_PATH/connectivity-probeTests
EndpointOrchestrator(hysteresis, RED/GREEN flips, no-op gates).Configstests for the new properties (parse, fallback, invalid input).ThinClientProbeWiringTestsfor GEM integration (probe fires on refresh, healthy default, threshold flip, region discovery viaLocationCache).mvn -Punitonazure-cosmos-tests.ThinClientE2ETestcontinues to pass against a live multi-region thin-client account.Changelog
Single entry added under
4.81.0-beta.1-> Other Changes.