feat(tonic-xds): gRFC A29 data-plane TLS connector wiring#2640
Conversation
gu0keno0
left a comment
There was a problem hiding this comment.
I'll continue to look at the other files.
| /// so each connection picks up the latest snapshot. Existing endpoint | ||
| /// channels keep their `EndpointChannel` instance (and any in-flight TLS | ||
| /// session) — only freshly-discovered endpoints see the swapped value. | ||
| pub(crate) type ConnectorSwap<S> = Arc<ArcSwap<Arc<dyn Connector<Service = S> + Send + Sync>>>; |
There was a problem hiding this comment.
Currently loadbalancer just take a Connector object, would it be possible to create a wrapper connector that provide atomic method to swap the inner one? Then we can reuse the same wrapper in LoadBalancer.
There was a problem hiding this comment.
Or the other way is like you commented there, make LoadBalancer call loadfull everytime, I just feel a wrapper might be simpler for callers. Yet it's up to you. If you want to do one way versus the other, maybe change it for LoadBalancer as well.
There was a problem hiding this comment.
yeah wrapping the load call makes sense, I'll do that.
|
|
||
| let connector_swap: ConnectorSwap<EndpointChannel<Channel>> = loop { | ||
| let Some(cluster) = cluster_watch.next().await else { | ||
| return; |
| tokio::select! { | ||
| Some(change) = endpoints.next() => { | ||
| if tx.send(change).await.is_err() { | ||
| return; |
| let manager = EndpointManager::new(Arc::clone(&connector_swap)); | ||
| let mut endpoints = manager.discover_endpoints(cache.watch_endpoints(&cluster_name)); | ||
|
|
||
| loop { | ||
| tokio::select! { | ||
| Some(change) = endpoints.next() => { |
There was a problem hiding this comment.
Note: there's no guarantee that EDS resource name == CDS cluster name, so I think we should change the logic there to allow changing of EDS resource after CDS updates. Usually people don't configure xds to do this, but the protocol totally allows it.
There was a problem hiding this comment.
So the current code actually handles EDS resource name correctly: in resource_manager.rs:228 the call to xds-client::watch is using eds_name. The confusion here is because we store EDS watches by cluster_name as the key (designed this way because of the ClusterDiscovery abstraction: cache updates is tied to a cluster's lifecycle).
here is my analysis of whether we should change to keying the EDS watch by EDS resource name:
pros:
- readability
- sharing of the watch channel if same EDS referenced by many CDS. Diffing is still owned by different tasks so not much save on compute just some storage.
cons: - need another layer of lifecycle management for eds updates, especially for the case "one eds reference is removed but one more cds reference it", so we'll end up having to refcount or reverse mapping. for context grpc-go/grpc-java basically does the refcount design through its gc-based lifecycle management of eds.
let me know what you think, I prefer keep the design as is now, possibly with more documentation to clarify this design.
There was a problem hiding this comment.
Sounds good to me, let's add a comment here about the EDS key name.
| return; | ||
| } | ||
| } | ||
| Some(cluster) = cluster_watch.next() => { |
There was a problem hiding this comment.
I think we should use biased to allow CDS to be handled first? It should be slightly better.
|
Filing the remaining review feedback as a follow-up PR for a cleaner review as they are all style / debug changes:
|
Motivation
Ref: #2444
Closes out gRFC A29 (xDS-Based TLS Security) in
tonic-xds. The cert provider foundation merged in #2593 and #2616 left the connector integration deferred — its pre-reqEndpoint::tls_config_with_verifierlanded in #2612.With that hook available, this PR wires up the per-cluster TLS connector.
Solution
Five commits, each independently reviewable:
file_watcherbackground refresh.Connectoron CDS update. Existing endpoint channels keep their connector; new connections pick up the latest. Invalid CDS updates are logged and the previous-good connector is kept.TlsConnector— for clusters withSome(security), the connector holds the A29 verifier plus an optional identity provider (mTLS).connect()fetches identity per call, assembles aClientTlsConfig, and builds the endpoint.Tests
Sync
refresh_oncetests infile_watcher, synthetic-cert chain tests for SAN matching,build_connectordispatch tests, and a counter-based test asserting identity is fetched perconnect(). Existingchannel.rsintegration tests exercise the cluster-aware discovery path end-to-end on the plaintext.Interop validation
End-to-end mTLS against Istio (Kind +
grpc-agenttemplate + SPIFFE creds)in YutaoMa/tonic-xds-istio-interop@tonic-pr-2640:
Confirmed successful run with logs
request ok request_num=1 … Nshown below. The harness uncovered thetwo parser-interop bugs fixed in
66d77c5a.