Skip to content

feat: Async Refresh for Regional Access Boundaries#1880

Open
vverman wants to merge 24 commits intogoogleapis:feat-tb-safrom
vverman:regional-access-boundary-update
Open

feat: Async Refresh for Regional Access Boundaries#1880
vverman wants to merge 24 commits intogoogleapis:feat-tb-safrom
vverman:regional-access-boundary-update

Conversation

@vverman
Copy link
Copy Markdown
Contributor

@vverman vverman commented Jan 25, 2026

Contains changes for the feature Regional Access Boundary (Previously Called Trust Boundaries).

The following are salient changes:

Calls to refresh RAB are now all async and in a separate thread.
Logic for refreshing RAB now exists in its own class for cleaner maintenance.
Self-signed jwts are within scope.
Changes to how we trigger RAB refresh and deal with refresh errors.

@vverman vverman requested review from a team January 25, 2026 21:35
@product-auto-label product-auto-label bot added the size: xl Pull request size is extra large. label Jan 25, 2026
@vverman vverman requested review from lqiu96 and nbayati January 25, 2026 21:49
@lqiu96 lqiu96 changed the base branch from feat-tb-sa to feat/agentic-identities-cloudrun February 3, 2026 19:16
@lqiu96 lqiu96 requested a review from a team as a code owner February 3, 2026 19:16
@vverman vverman changed the base branch from feat/agentic-identities-cloudrun to feat-tb-sa February 6, 2026 22:40
@nbayati
Copy link
Copy Markdown
Contributor

nbayati commented Mar 12, 2026

Potential minor issue worth checking out:
The use of private static mutable clock fields in RegionalAccessBoundary.java and RegionalAccessBoundaryManager.java is problematic for two main reasons:

Test Isolation: Since the clock is static, it is shared across all instances in the JVM. If one test mocks the clock using setClockForTest, it will inadvertently affect all other tests running in parallel or sequentially. This makes the test suite brittle and prone to non-deterministic failures.
Architectural Inconsistency: The rest of the library (such as OAuth2Credentials) uses instance-level clocks. Making these new classes use static clocks deviates from the established pattern and limits production flexibility (e.g., if a user wanted to customize the clock for a specific set of credentials).

Copy link
Copy Markdown
Member

@lqiu96 lqiu96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for creating the follow up testing issue. The parameter is not idea, but given our time commitments, I am fine with it.

Took another look and I think this looks good. Could someone from AION take another pass to double check this?

@nbayati
Copy link
Copy Markdown
Contributor

nbayati commented Mar 18, 2026

Discussed with Pranav offline, to avoid adding the disableRabRefreshForTest flag and keep the test suite stable, we’ve decided to keep GOOGLE_AUTH_TRUST_BOUNDARY_ENABLE_EXPERIMENT in this PR. We'll track the removal of that env var and the corresponding test refactor in issue #1898 to keep this PR focused.

@lsirac
Copy link
Copy Markdown
Contributor

lsirac commented Mar 19, 2026

Blocking issues:

  • getRequestMetadata(URI) is not fail-open and can crash on RAB lookup. refreshRegionalAccessBoundaryIfExpired(...) is called with no try/catch block.

  • Scheduling failures permanently block RAB refreshes. In RegionalAccessBoundaryManager.java, you acquire the lock through refreshFuture.compareAndSet(null, future) before calling executor.execute() or Thread.start(). That lock is only released inside the finally block of the refreshTask itself. If executor.execute(...) throws a RejectedExecutionException (or thread creation fails), the task never runs, the finally block is never reached, and RAB will never refresh again for that credential.

  • Stale x-allowed-locations headers can survive serialization round-trips: Serialization resets the live RAB manager but preserves the old x-allowed-locations inside cached OAuth request metadata. After deserialization, we start from the serialized metadata, see no current live RAB to overwrite it with, and we keep sending the old header until access token refresh. It's safer to send no header at all vs. a stale header.

  • Static mutable test hooks break test isolation: clock and maxRetryElapsedTimeMillis are declared as private static in RegionalAccessBoundary.java and RegionalAccessBoundaryManager.java and mutated via @VisibleForTesting setters. This will cause race conditions when test suites run in parallel. Use the established library pattern where Clock is an instance-level transient field (e.g. like in OAuth2Credentials).

  • There is inconsistent host scoping: The PR skips RAB injection only for .rep.googleapis.com endpoints. If a RAB is already cached, addRegionalAccessBoundaryToRequestMetadata attaches the header to all hosts.

  • Impersonated credentials ignore iamEndpointOverride on the RAB path. Is this intentional?

RAB lookup should be best-effort and non-blocking, and it should not accidentally become synchronous depending on the executor the caller passes in. Lets also make sure we cover all of these with tests as well.

@lsirac lsirac added the do not merge Indicates a pull request not ready for merge, due to either quality or timing. label Mar 19, 2026
vverman added 2 commits March 19, 2026 19:14
…thout a try catch block. 2. Lock acquiral for refreshFuture.compareAndSet(null, future) now fixed. 3. Oauth2Credentials isn't caching RAB which was earlier leading to serialization issues.
@vverman
Copy link
Copy Markdown
Contributor Author

vverman commented Mar 20, 2026

Thanks for the catches, addressed first 4 of Leo's comments with unit testing.

Regarding the rest of the 3 points

  1. The RAB header being sent to regional endpoints shouldn't be an issue as per the group's discussion. IIUC, this isn't a blocking issue.

  2. I believe this is an open question as to whether an IAM overriden RAB endpoint is even possible. Once that is answered, I can implement accordingly.

  3. The Google Credentials already has a synchronous requestMetadata which doesn't accept an executor. The async getRequestMetadata is the one which accepts a user provided executor which is a pool used to execute requests asynchronously. Here I feel we should respect the user's decision and use that pool for our async RAB refresh as well.

@vverman
Copy link
Copy Markdown
Contributor Author

vverman commented Mar 20, 2026

Env vars are back in @nbayati would appreciate a look!

@vverman vverman requested a review from nbayati March 20, 2026 06:34
@nbayati
Copy link
Copy Markdown
Contributor

nbayati commented Mar 24, 2026

Thanks for the catches, addressed first 4 of Leo's comments with unit testing.

Regarding the rest of the 3 points

  1. The RAB header being sent to regional endpoints shouldn't be an issue as per the group's discussion. IIUC, this isn't a blocking issue.
  2. IAM & STS endpoints are excluded from RAB scope.
  3. The Google Credentials already has a synchronous requestMetadata which doesn't accept an executor. The async getRequestMetadata is the one which accepts a user provided executor which is a pool used to execute requests asynchronously. Here I feel we should respect the user's decision and use that pool for our async RAB refresh as well.

Regarding 1, Seems like you've fixed this in the Using per-instance clocks. commit, so I think we're covered here.

Regarding 3, I see the point you are making about respecting the user’s choice for their own async request. However, I want to share a different perspective on side-effects and resource isolation. The RAB lookup is an internal library feature that happens under the hood. The user did not explicitly ask for it when they requested a token. If the user passes an Executor (per the async path), they are expecting that executor to handle their async request. They do not expect the library to piggyback on it for hidden I/O operations (like the RAB lookup network call).

If a user happens to pass a synchronous executor, their main API request will block and wait for a hidden RAB network trip that they didn't even ask for. I think since the RAB lookups are transparent to the users and are essentially happening behind the scenes, their resource consumption should be transparent too. If we use the unmanaged new Thread() fallback (or create a dedicated internal static pool), it ensures that the library never accidentally blocks a user's threads for internal calls.

@vverman
Copy link
Copy Markdown
Contributor Author

vverman commented Mar 25, 2026

If the user passes an Executor (per the async path), they are expecting that executor to handle their async request. They do not expect the library to piggyback on it for hidden I/O operations (like the RAB lookup network call).

IIUC: The executor is a thread pool that the user passes to the auth library to do background work. The expectation is that the lib won't spin up its own threads (which is an expensive operation). Using the caller's executor ensures that the library operates inside the sandbox the user defined for us.

If a user happens to pass a synchronous executor, their main API request will block and wait for a hidden RAB network trip that they didn't even ask for. I think since the RAB lookups are transparent to the users and are essentially happening behind the scenes, their resource consumption should be transparent too. If we use the unmanaged new Thread() fallback (or create a dedicated internal static pool), it ensures that the library never accidentally blocks a user's threads for internal calls.

IIUC: the getRequestMetadata which accepts a RequestMetadataCallback which says ->

* The callback that receives the result of the asynchronous {@link
 * Credentials#getRequestMetadata(java.net.URI, java.util.concurrent.Executor,
 * RequestMetadataCallback)}. Exactly one method should be called.
 *

Which means it is intended to be used as an async method. While the user could pass in a synchronous executor, I believe we shouldn't consider that the expected flow.

TBF I was doing it the way you are suggesting previously but me and @lqiu96 had a discussion about user choice with directExecutors and ended up changing it.

@vverman
Copy link
Copy Markdown
Contributor Author

vverman commented Mar 27, 2026

Executor is no longer used for async RAB refresh, we are now initiating a new Thread call instead.

@vverman vverman requested review from lqiu96 and nbayati March 27, 2026 21:04
if (cooldownState.compareAndSet(currentCooldownState, next)) {
LoggingUtils.log(
LOGGER_PROVIDER,
Level.INFO,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts on this: If we want RAB to transparent to the user, perhaps we should aim for a lower log level?

Perhaps either Debug or Warn? Info may be confusing as users may not have any idea what RAB is.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed it to fine so it is still available when the users want to debug.

@lsirac
Copy link
Copy Markdown
Contributor

lsirac commented Apr 1, 2026

Some more things:

  • We're spawning raw threads via new Thread(). Across a bunch of cred instances this is unbounded. It would be better to have a private executor / pool.
  • You made clock and maxRetryElapsedTimeMillis instance fields but environmentProvider in RegionalAccessBoundary.java remains a static mutable field mutated via a @VisibleForTesting setter

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do not merge Indicates a pull request not ready for merge, due to either quality or timing. size: xl Pull request size is extra large.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants