Skip to content

Add SPIFFE v2 URI-SAN based principal extraction (DEPEND-89172)#142

Open
Sanju98 wants to merge 3 commits into
linkedin:branch-3.6from
Sanju98:sanju98/spiffe-v2-auth
Open

Add SPIFFE v2 URI-SAN based principal extraction (DEPEND-89172)#142
Sanju98 wants to merge 3 commits into
linkedin:branch-3.6from
Sanju98:sanju98/spiffe-v2-auth

Conversation

@Sanju98
Copy link
Copy Markdown

@Sanju98 Sanju98 commented May 13, 2026

Description

Adds server-side support for SPIFFE-issued client certificates in ZooKeeper's X509 authentication. When configured by an operator, ZK accepts both SPIFFE v1 and v2 URI SANs as workload principals and resolves them to ACL domains via the existing znode-tree group-ACL provider. Aligned with the LISPIFFE-ID spec and the documented acl-tool ... --spiffe "<prefix>/*" wildcard idiom. JIRA: DEPEND-89172.

Main changes (zookeeper-server/src/main/java/.../auth/...):

  • X509AuthenticationUtil.matchAndExtractSpiffeSAN extracts the principal from SPIFFE URI SANs in two supported forms:

    • v2 (spiffe://<td>/v2/<path>) — principal is the full ILM UID (path-after-/v2/), e.g. spiffe://prod.lipki/v2/application/foo-mp/bar-appapplication/foo-mp/bar-app.
    • v1 workload (spiffe://<td>/v1/wl/<app-name>) — principal is just <app-name> (the wl/ type prefix is stripped, matching how legacy authZ handled v1 identities).

    Falls through to existing URN / Subject-DN extraction for:

    • User-identity URIs (/v<N>/user/...) — never promoted to a service principal (applies to both v1 and v2)
    • Non-{v1/wl, v2} paths (e.g. v1/wf/ workflow — not in scope for ZK)
    • URIs containing % in the path (defense-in-depth: URI.getRawPath() + reject % blocks %2F-smuggling and %75ser user-bypass)
    • Malformed URIs
  • X509AuthenticationConfig adds one operator-controlled SPIFFE knob:

    • ssl.x509.spiffe.sanMatchRegex — coarse trust-domain / version gate (recommended: ^spiffe://prod\.lipki/v[12]/.*$). Lazy-loaded via double-checked locking against a dedicated lock object with volatile fields, matching the pattern used by allowedClientIdAsAclDomains and peers in the same class.
  • ZkClientUriDomainMappingHelper recursively walks the znode subtree below each domain. Only leaf znodes are registered as keys (path joined by /), letting multi-segment SPIFFE UIDs be expressed as nested znodes (znode names can't contain /). getDomains(clientUri) does exact-match first, then segment-prefix walk-up on the /-aligned segments. The internal mapping field is volatile and getDomains snapshots the reference once at function entry. X509ZNodeGroupAclProvider's updater lambda calls helper.getDomains(clientId) so the walk-up is reachable on every TLS handshake.

Operator workflow:

Goal Create leaf znode
v2 MP-wide grant `//application/`
v2 app-specific grant `//application//`
v2 app+tag-specific grant `//application///`
v1 workload grant `//`
Legacy URN (unchanged) `//urn:li:servicePrincipal(...;...)`

Tests

The following tests were added/updated:

  • X509SpiffeAuthIntegrationTest (new file, 6 tests) — real BouncyCastle-signed certs through X509AuthenticationProvider.handleAuthentication. Covers: v1/wl happy-path extraction, v2 happy-path extraction, user-identity rejection, non-SPIFFE URI fall-through, plus two defense-in-depth cases (%2F-smuggling rejected, %75ser user-bypass rejected).
  • X509AuthTest (13 tests total; 9 SPIFFE tests added, 4 pre-existing kept) — mock-cert coverage: v1/wl happy path, v2 happy paths (application/... + workload/... forms), v1+v2 user-identity rejection, multi-SAN error path, URN+SPIFFE coexistence (SPIFFE preferred when both present), SPIFFE-configured-but-no-match falls to URN, SPIFFE-not-configured falls to URN.
  • ZkClientUriDomainMappingHelperTest (11 tests total; 9 new, 2 pre-existing kept):
    • testA2_RecursiveZNodeWalkRegistersMultiLevelClientUris — real znode tree exercises leaf-only registration + prefix walk-up
    • testA3_SpiffeCertResolvesThroughZNodeMappingToDomainAuthInfo — end-to-end: real SPIFFE cert → X509ZNodeGroupAclProvider.handleAuthentication → exact-match leaf in znode tree → (x509, <domain>) authInfo
    • testA4_SpiffeCertResolvesViaPrefixWalkUpToDomainAuthInfo — end-to-end via segment-prefix walk-up: real cert with a 4-segment principal resolves through a 2-segment MP-level leaf grant on the production code path
    • testC_* (6 tests) — in-memory getDomains unit tests: exact-match, single-prefix walk-up, multi-segment union, strict segment alignment, URN-style id (no walk-up), null clientUri
  • SpiffeAuthTestUtil (new) — shared test fixtures: BouncyCastle bootstrap, SPIFFE system-property set/clear, real cert builder with URI SANs, stub AcceptAllTrustManager / NoopKeyManager.
  • X509TestHelpers.newCertWithSans (new) — companion to newCert that installs caller-supplied SANs instead of the default localhost SANs.

Result of mvn test on the SPIFFE-related test classes:

```
Tests run: 30, Failures: 0, Errors: 0, Skipped: 0

  • X509AuthTest 13/13
  • X509SpiffeAuthIntegrationTest 6/6
  • ZkClientUriDomainMappingHelperTest 11/11
    ```

Changes that Break Backward Compatibility

None.

  • Until the operator sets ssl.x509.spiffe.sanMatchRegex, SPIFFE extraction is dormant (getSpiffeSanMatchPattern() returns null) and the existing URN / Subject-DN extraction flow is unchanged byte-for-byte.
  • Legacy URN-style ACL entries in /zookeeper/uri-domain-map/<domain>/urn:li:servicePrincipal(...;...) continue to resolve via exact-match in getDomains. The prefix walk-up only fires when the exact-match misses, and only if the client URI contains / — URN strings never do.
  • Three previously-introduced SPIFFE config keys (ssl.x509.spiffe.sanExtractRegex, ssl.x509.spiffe.sanExtractMatcherGroupIndex, ssl.x509.spiffe.clientIdPrefix) are removed. They were added in an earlier iteration of this work and were never set in any deployed configuration (verified against the prod `acm` ensemble's `zoo.cfg`). Under the doc-aligned design the principal is the full ILM UID (v2) or just the app-name (v1/wl), so per-cert extract regex / prefix knobs are no longer meaningful.

🤖 Generated with Claude Code

@Sanju98 Sanju98 requested a review from laxman-ch May 14, 2026 04:13
@Sanju98 Sanju98 self-assigned this May 14, 2026
@Sanju98 Sanju98 requested a review from abhilash1in May 14, 2026 04:13
@Sanju98
Copy link
Copy Markdown
Author

Sanju98 commented May 14, 2026

Code review

Found 2 issues:

  1. getSpiffeSanMatchPattern() uses unsynchronized lazy-init for non-volatile fields, racing on the per-handshake auth path. Other lazy-init fields in the same class (e.g. allowedClientIdAsAclDomains) use the established volatile + synchronized double-checked-lock pattern; the PR even adds volatile to ZkClientUriDomainMappingHelper.clientUriToDomainNames for the same reason, but misses these new fields. Two threads can both observe loaded=false and concurrently compile the regex; worse, a reader can see loaded=true and a stale-null spiffeSanMatchPattern due to JMM reordering.

https://github.com/Sanju98/zookeeper/blob/241d2d49eb262eab30147e513e9350689c3b15d6/zookeeper-server/src/main/java/org/apache/zookeeper/server/auth/X509AuthenticationConfig.java#L110-L115

https://github.com/Sanju98/zookeeper/blob/241d2d49eb262eab30147e513e9350689c3b15d6/zookeeper-server/src/main/java/org/apache/zookeeper/server/auth/X509AuthenticationConfig.java#L259-L268

  1. The new segment-prefix walk-up in ZkClientUriDomainMappingHelper.getDomains() is unreachable from the production auth path. The setDomainAuthUpdater lambda in X509ZNodeGroupAclProvider looks up the domain via clientUriToDomainNames.getOrDefault(clientId, emptySet()) directly on the snapshotted map, bypassing getDomains() entirely. As a result, MP-level prefix grants documented in the PR description and in the ZkClientUriDomainMappingHelper class javadoc (operator registers <root>/<domain>/application/<mp> as a leaf to grant all apps under that MP) will NOT work in production — clients presenting application/<mp>/<some-app>/<tag> receive empty domain authInfo. The new testA3_* test happens to register an exact-match leaf, so it does not exercise the broken path.

https://github.com/Sanju98/zookeeper/blob/241d2d49eb262eab30147e513e9350689c3b15d6/zookeeper-server/src/main/java/org/apache/zookeeper/server/auth/znode/groupacl/X509ZNodeGroupAclProvider.java#L165-L173

🤖 Generated with Claude Code

- If this code review was useful, please react with 👍. Otherwise, react with 👎.

Sanju98 added a commit to Sanju98/zookeeper that referenced this pull request May 14, 2026
…oduction path

Two fixes addressing the code review on PR linkedin#142:

1. X509AuthenticationConfig.getSpiffeSanMatchPattern now uses
   double-checked locking with volatile fields and a dedicated lock
   object, matching the pattern already used by allowedClientIdAsAclDomains
   and other lazy-loaded fields in the same class. The previous lazy-init
   was a data race on the per-handshake hot path.

2. X509ZNodeGroupAclProvider's setDomainAuthUpdater lambda now calls
   helper.getDomains(clientId) instead of the raw map's getOrDefault,
   so the segment-prefix walk-up added to ZkClientUriDomainMappingHelper
   is reachable from the production authentication path. This is the
   znode-tree analogue of LinkedIn's documented acl-tool wildcard idiom
   (`--spiffe "application/<mp>/*"`); without this fix, MP-level prefix
   grants documented in the class javadoc would silently no-op.

Adds testA4_SpiffeCertResolvesViaPrefixWalkUpToDomainAuthInfo: real
SPIFFE cert with a 4-segment principal resolves to an MP-level leaf
grant through the full provider->helper.getDomains path. This test
would have failed before fix linkedin#2 (the exact-match lookup misses the
4-segment principal against a 2-segment registered prefix).

All 30 SPIFFE-related tests pass (X509AuthTest 13 +
X509SpiffeAuthIntegrationTest 6 + ZkClientUriDomainMappingHelperTest 11).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sanju98 and others added 2 commits May 14, 2026 14:39
Adds server-side support for SPIFFE-issued client certificates in X509 auth.
The principal is the ILM UID (path-after-/v2/), aligned with the LinkedIn ILM
v2 design: trust-domain stripped, segment-prefix matching for ACL lookup.

Key changes:
- X509AuthenticationUtil.matchAndExtractSpiffeSAN extracts the ILM UID from
  v2 SPIFFE URIs. v1 SPIFFE URIs and user-identity URIs (/v<N>/user/...) fall
  through to existing URN / Subject-DN extraction.
- X509AuthenticationConfig adds spiffe.sanMatchRegex as the operator-
  controlled trust-domain gate.
- ZkClientUriDomainMappingHelper recursively walks the znode subtree below
  each domain. Only leaf znodes are registered as keys (path joined by '/'),
  letting multi-segment SPIFFE UIDs be expressed as nested znodes (whose
  names can't contain '/'). getDomains does exact-match then segment-prefix
  walk-up. clientUriToDomainNames is volatile with in-method snapshot.
- Defense-in-depth: SPIFFE path extraction uses URI.getRawPath() and rejects
  any path containing '%' to prevent URL-decoding bypass of identity checks.

Tests: 29/29 pass (X509AuthTest 13, X509SpiffeAuthIntegrationTest 6,
ZkClientUriDomainMappingHelperTest 10; includes end-to-end SPIFFE-cert
through X509ZNodeGroupAclProvider with real znode mapping).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…oduction path

Two fixes addressing the code review on PR linkedin#142:

1. X509AuthenticationConfig.getSpiffeSanMatchPattern now uses
   double-checked locking with volatile fields and a dedicated lock
   object, matching the pattern already used by allowedClientIdAsAclDomains
   and other lazy-loaded fields in the same class. The previous lazy-init
   was a data race on the per-handshake hot path.

2. X509ZNodeGroupAclProvider's setDomainAuthUpdater lambda now calls
   helper.getDomains(clientId) instead of the raw map's getOrDefault,
   so the segment-prefix walk-up added to ZkClientUriDomainMappingHelper
   is reachable from the production authentication path. This is the
   znode-tree analogue of LinkedIn's documented acl-tool wildcard idiom
   (`--spiffe "application/<mp>/*"`); without this fix, MP-level prefix
   grants documented in the class javadoc would silently no-op.

Adds testA4_SpiffeCertResolvesViaPrefixWalkUpToDomainAuthInfo: real
SPIFFE cert with a 4-segment principal resolves to an MP-level leaf
grant through the full provider->helper.getDomains path. This test
would have failed before fix linkedin#2 (the exact-match lookup misses the
4-segment principal against a 2-segment registered prefix).

All 30 SPIFFE-related tests pass (X509AuthTest 13 +
X509SpiffeAuthIntegrationTest 6 + ZkClientUriDomainMappingHelperTest 11).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Sanju98 Sanju98 force-pushed the sanju98/spiffe-v2-auth branch from 242b58b to 45a3776 Compare May 14, 2026 09:13
@Sanju98 Sanju98 changed the base branch from li-dev/base-3.6.3-znode-group-security to branch-3.6 May 14, 2026 09:13
Comment on lines +54 to +64
// User-identity SPIFFE certs (issued to humans, not workloads) must NOT be promoted to a
// service principal, otherwise a user credential would be granted service-level ACL access.
// See LISPIFFE-ID spec: https://github.com/linkedin-multiproduct/gopki/blob/master/LISPIFFE-ID.md
private static final Pattern SPIFFE_USER_IDENTITY_PATH_PATTERN =
Pattern.compile("^/v\\d+/user(/.*)?$");

// Matches LISPIFFE v2 workload paths and captures the ILM UID (the path after "/v2/").
// The canonical ILM v2 principal is the full path-after-v2 (e.g. "application/foo-mp/bar-app");
// ACL matching downstream is segment-prefix on this UID. v1 SPIFFE URIs are deliberately not
// matched here — they are a deprecated design with app-name collision across MPs.
private static final Pattern SPIFFE_V2_PATH_PATTERN = Pattern.compile("^/v2/(.+)$");
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned in another comment - do add support for both SPIFFE v1 and v2.

SPIFFE URI format: spiffe://<trust-domain/<version>/<unique-identity>

Examples of SPIFFE v1:
spiffe://trust-domain/v1/wl/<app-name>

Therefore parsing of the workload unique-identity will get wl/<app-name>
Remove the prefix wl to get the .

Examples SPIFFE v2:
User: spiffe://trust-domain/v2/user/<userId>
Online application: spiffe://trust-domain/v2/application/<ILM-unique-id>

Flyte Workflow (Although I think, this is not the usecase for ZK).

More details here: https://github.com/linkedin-multiproduct/gopki/blob/master/LISPIFFE-ID.md#2-uri-path

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @rgodha — addressed in 4a0533b.

The extractor now accepts both versions:

  • v2 (existing): spiffe://<td>/v2/<path> → principal is the full path-after-/v2/ (the ILM UID, e.g. application/foo-mp/bar-app).
  • v1 workload (new): spiffe://<td>/v1/wl/<app-name> → strip the wl/ type prefix; principal is just <app-name>. Matches how legacy authZ handled v1 identities.

User-identity URIs (/v<N>/user/...) are still rejected for both versions — they must never be promoted to a service principal. Other v1 paths (e.g. v1/wf/ workflow) fall through to URN/DN since they are not in scope for ZK.

Test testRealCertWithSpiffeV1WlUriSanIsExtracted builds a real BouncyCastle-signed cert with a v1/wl SAN and asserts the extracted principal is the bare app-name. All 30 SPIFFE-related tests pass locally.

Per @rgodha's review comment, the extractor now accepts both SPIFFE
versions instead of rejecting v1 outright:

- v2 (existing): spiffe://<td>/v2/<path> → principal is the full path
  after /v2/ (the ILM UID, e.g. "application/foo-mp/bar-app").
- v1 workload (new): spiffe://<td>/v1/wl/<app-name> → strip the "wl/"
  type prefix; principal is just the app-name. This matches how legacy
  authZ handled v1 identities. Other v1 paths (e.g. v1/wf/ workflow)
  still fall through to URN/DN.

User-identity URIs (/v<N>/user/...) continue to be rejected for both
versions — they must never be promoted to a service principal.

The test match regex broadens to ^spiffe://.*/v[12]/.*$ so existing
test scaffolding exercises both versions. testSpiffeV1FallsBackToDn
becomes testSpiffeV1WlAuth (now asserts extraction). The integration
test for v1 likewise flips from fall-back-to-DN to v1/wl extraction.

LISPIFFE-ID spec reference:
https://github.com/linkedin-multiproduct/gopki/blob/master/LISPIFFE-ID.md#2-uri-path

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Sanju98 Sanju98 force-pushed the sanju98/spiffe-v2-auth branch from 4c137df to 4a0533b Compare May 15, 2026 04:56
@Sanju98 Sanju98 requested a review from rgodha May 17, 2026 04:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants