Add finalizer management for application credentials#685
Add finalizer management for application credentials#685Deydra71 wants to merge 1 commit intoopenstack-k8s-operators:mainfrom
Conversation
|
Build failed (check pipeline). Post https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/61090ad340f246b2a571069f7190fad7 ❌ openstack-k8s-operators-content-provider FAILURE in 8m 49s |
| if k8s_errors.IsNotFound(err) { | ||
| // If the ACID is set but the secret is not found, requeue to let the cache sync | ||
| if instance.Status.ACID != "" { | ||
| return ctrl.Result{RequeueAfter: time.Second * 5}, nil |
There was a problem hiding this comment.
If the ACID is set in status but the Secret is not found, the controller assumes an informer cache lag and requeues every 5 seconds. If the Secret was genuinely deleted (e.g. manual kubectl delete secret), this becomes an infinite requeue loop.
Consider adding a bounded retry - for example, an annotation counter or a timestamp check - and after a threshold (e.g. 30 seconds or N retries), fall through to doRotate = true so the controller self-heals instead of spinning indefinitely.
There was a problem hiding this comment.
We don't actually need the requeue anymore - we suppress reconcile trigger on create events, so this scenario can't happen anymore. I will remove the guard and that fixes the infinite loop possibility, which is valid and I replicated manually the problem when I manually deleted the finalziers.
| // finalizer should only be removed once all nodes across all NodeSets have | ||
| // been redeployed with the new credentials. This depends on per-node secret | ||
| // rotation tracking: https://github.com/openstack-k8s-operators/openstack-operator/pull/1781 | ||
| func hasConsumerFinalizer(secret *corev1.Secret) bool { |
There was a problem hiding this comment.
The substring check strings.Contains(f, "-ac-consumer") could match unrelated finalizers that happen to contain that substring. A tighter check would reduce false-positive risk:
if strings.HasPrefix(f, "openstack.org/") && strings.HasSuffix(f, "-ac-consumer") {
return true
}This still matches all service-operator consumer finalizers (openstack.org/barbican-ac-consumer, openstack.org/cinder-ac-consumer, etc.) while being robust against accidental collisions.
|
|
||
| for i := range secretList.Items { | ||
| s := &secretList.Items[i] | ||
| if protected[s.Name] || hasConsumerFinalizer(s) || !controllerutil.ContainsFinalizer(s, acSecretFinalizer) { |
There was a problem hiding this comment.
The !controllerutil.ContainsFinalizer(s, acSecretFinalizer) guard creates a permanent orphan scenario. If the controller crashes between line 625 (removing the protection finalizer via Update) and line 631 (deleting the Secret), the Secret is left without a finalizer. On the next reconcile, this guard causes the loop to skip it, and it is never cleaned up.
One fix is to remove the ContainsFinalizer pre-condition entirely - if the Secret is not protected by name and has no consumer finalizer, it should be deleted regardless of whether it still carries the protection finalizer:
if protected[s.Name] || hasConsumerFinalizer(s) {
continue
}Alternatively, invert the order: issue the Delete first (the API server won't actually remove the object while the finalizer exists), then remove the finalizer. That way, even on crash between the two operations, the object is already marked for deletion and the next reconcile just removes the finalizer.
There was a problem hiding this comment.
Yes, we don't need that additional check. Old mutable secrets with only the protection finalizers should be safe since the protected is checked before hasConsumerFinalizer if service operator would be late to add consumer finalizer to the secret.
I don't think inversing the order is a good way to go, we would need to handle the Terminating state, because deletion would be blocked.
| @@ -455,26 +728,25 @@ func (r *ApplicationCredentialReconciler) storeACSecret( | |||
|
|
|||
| op, err := controllerutil.CreateOrPatch(ctx, helperObj.GetClient(), secret, func() error { | |||
There was a problem hiding this comment.
Since each rotation produces a unique Secret name (ac-<svc>-<first5>-secret), CreateOrPatch will effectively always Create. However, if the controller crashes after the Create succeeds but before the status is patched, the retry will find the Secret already exists and attempt a Patch - which will try to update .data on an immutable Secret and fail.
Consider replacing this with a plain Create + IsAlreadyExists check:
err := helperObj.GetClient().Create(ctx, secret)
if k8s_errors.IsAlreadyExists(err) {
logger.Info("Immutable AC secret already exists (likely retry), proceeding", "secret", secretName)
return secretName, nil
}
if err != nil {
return "", fmt.Errorf("failed to create immutable AC secret %s: %w", secretName, err)
}| userID string, | ||
| ) error { | ||
| logger := r.GetLogger(ctx) | ||
| serviceName := strings.TrimPrefix(instance.Name, "ac-") |
There was a problem hiding this comment.
This derivation assumes every KeystoneApplicationCredential CR name follows the ac-<service> convention. If a CR is created with a name that doesn't start with ac-, TrimPrefix returns the full name unchanged and the resulting service label / label-based queries silently target the wrong set of Secrets.
Consider either:
- Validating the naming convention via a webhook at admission time, or
- Adding an explicit
serviceNamefield to the Spec so the derivation is unnecessary and the contract is explicit.
There was a problem hiding this comment.
openstack-operator creates AC CRs with a name convention defined in keystone-operator -https://github.com/openstack-k8s-operators/openstack-operator/blob/main/internal/openstack/applicationcredential.go#L113
It's still possible to manually create the AC CR ( we do that for tests bypassing openstack-op), but we want openstack-operator to be the sole creator of AC CR based on config in controlplane CR.
We could create additional api/v1beta1/keystoneapplicationcredential_webhook.go , let's see what other reviewers comment.
mauricioharley
left a comment
There was a problem hiding this comment.
Thanks for the work on this, Veronika. The overall architecture is solid - immutable per-rotation secrets, Keystone-side revocation, and consumer finalizer coordination are the right design choices.
I left five inline comments on areas that could lead to subtle issues in production:
- Infinite requeue loop (line 223) - unbounded retry when a Secret is genuinely deleted
- Fragile substring match (line 564) -
hasConsumerFinalizercould match unrelated finalizers - Orphaned secrets (line 606) - crash between finalizer removal and delete leaves permanently uncleaned secrets
- Immutable secret retry failure (line 729) -
CreateOrPatchfails on retry against an immutable Secret - Implicit naming convention (line 584) -
TrimPrefixsilently misbehaves if CR name doesn't followac-<service>
Please address these before merging.
7af98fc to
45357af
Compare
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: Deydra71 The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
45357af to
9a3a2ad
Compare
|
Note: I used the |
|
/retest |
9a3a2ad to
f70bc06
Compare
Depends-On: openstack-k8s-operators/keystone-operator#685 Signed-off-by: Veronika Fisarova <vfisarov@redhat.com>
| "application-credentials": "true", | ||
| "application-credential-service": serviceName, | ||
| }, | ||
| Finalizers: []string{acSecretFinalizer}, |
There was a problem hiding this comment.
@Deydra71 I see we have both a "producer" finalizer (this line), and a "consumer" finalizer, which is set by the service that uses the secret.
Is it accurate to think that this is useful for rotation purposes? In other words I imagine that the consumer moves the finalizer from the old secret to the new one. At that point the old secret has no consumer finalizer, but keystone needs to keep it alive until the AC is revoked. Then it can explicitly remove the
producer finalizer and delete the secret.
There was a problem hiding this comment.
Yes, the description is correct.
openstack.org/ac-secret-protection is set by keystone op at creation that prevents the secret from being deleted before the AC is revoked in Keystone. The current secret and previous used secret are protected by this finalizer.
openstack.org/<service>-ac-consumer is set by the service operator while it's actively using that secret.
| // would indicate an ACID prefix collision (two different Keystone AC IDs | ||
| // whose first 5 characters are identical, producing the same secret name). | ||
| existing := &corev1.Secret{} | ||
| if getErr := helperObj.GetClient().Get(ctx, types.NamespacedName{Namespace: ac.Namespace, Name: secretName}, existing); getErr != nil { |
There was a problem hiding this comment.
wondering if we should rely on lib-common GetSecret here [1].
I guess we also have an option to get some data directly [2], so you could get .Data[keystonev1.ACIDSecretKey] and compare it here. Not sure this might deserve a dedicated helper but reusing existing functions/patterns might help to keep the code readable.
[1] https://github.com/openstack-k8s-operators/lib-common/blob/main/modules/common/secret/secret.go#L77
[2] https://github.com/openstack-k8s-operators/lib-common/blob/main/modules/common/secret/secret.go#L438
There was a problem hiding this comment.
I didn't use it because the lib-common GetSecret computes hash we don't need here. So I chose the raw get. But I see now that in other operators we generally use it even when the hash is just discarded, so I will do that too too keep the pattern.
| } | ||
|
|
||
| secretList := &corev1.SecretList{} | ||
| if err := helperObj.GetClient().List(ctx, secretList, |
There was a problem hiding this comment.
It would be useful to have a lib-common function to filter secrets by labels, so we can shorten this code and keep here just the relevant logic. It would be also useful for service operators in case we need to list secrets by label.
There was a problem hiding this comment.
We actually already have it - https://github.com/openstack-k8s-operators/lib-common/blob/main/modules/common/secret/secret.go#L97 , thanks!
| logger.Info("Could not get user ID, skipping revocation during AC CR delete", "error", err) | ||
| } else { | ||
| seen := make(map[string]bool) | ||
| for i := range secretList.Items { |
There was a problem hiding this comment.
I see we list secrets many times in this function with the purpose of doing some processing and identify which one should be deleted. I'm wondering if we can make this function more linear and merge some processing steps (e.g. we build a list of secrets we can revoke and we run a massive revocation in a single call, maybe with an helper that wraps revokeKeystoneAC).
I find hard to follow the logic of this function, and you should consider splitting the logic into smaller pieces that flatten the logic.
There was a problem hiding this comment.
There's sadly no available call to revoke multiple ACs in one API call, so the wrapper would jsut loop internally over the list of ACs meant to be revoked, but at the end revoking them one by one anyway.
But with the other changes you sugegsted I think we simplified the reconcileDelete pretty well:
- merged the revocation loop and the finalizer removal loop into one pass
- replaced
client.Listwithoko_secret.GetSecrets(one call) - building the Keystone client once upfront and not nesting the whole revocation logic inside the
keystoneAPI != nilblock
Please let me know if it makes more sense now, or if it's there some additional simplification/sugegstion.
| } | ||
|
|
||
| secretList := &corev1.SecretList{} | ||
| if err := helperObj.GetClient().List(ctx, secretList, |
There was a problem hiding this comment.
Here we do the same iteration (by label) that can be realized through a lib-common helper (see my previous comment on reconcileDelete).
| } | ||
|
|
||
| fresh := &corev1.Secret{} | ||
| if err := helperObj.GetClient().Get(ctx, types.NamespacedName{Namespace: s.Namespace, Name: s.Name}, fresh); err != nil { |
There was a problem hiding this comment.
why do we .Get( a secret (stored in s) that we already have? I'm not sure this logic is required here (L630 - L635)
There was a problem hiding this comment.
At first I thought that we should re Get to guard against stale resourceVersion on Update, because between List on #604 and Update on #636 we call revokeKeystoneAC and in the meantime of that external call some service operator could modify the secret changint the resourceVersion.
However if we already simplify it to use GetSecrets() the List result is going to be fresh, so I don't think it's necessary with the other agreed changes.
There was a problem hiding this comment.
Actually, I think that even if the conflict would happen the reconcile will retry again, because if conflict happens then it means there's some change on the resource and Owns() watch is triggered and enqueue new recopncile
| } | ||
| logger.Info("Removed protection finalizer from AC secret", "secret", s.Name) | ||
| } | ||
| if err := helperObj.GetClient().Delete(ctx, fresh); err != nil && !k8s_errors.IsNotFound(err) { |
There was a problem hiding this comment.
Do we need the Delete? I'm wondering if RemoveFinalizer + Update is enough to see the Secret deleted.
There was a problem hiding this comment.
The secret still has owner reference to the AC CR that exists, so it wouldn't be collected unless the CR is deleted. For security reasons we don't want to keep unused AC secrets laying around, so we want to revoke AC in Keystone when these are fulfilled:
- it's not current
SecretName - it's not
PreviousSecretName - it doesn't have consumer finalizer
| ) error { | ||
| logger := r.GetLogger(ctx) | ||
| serviceName := strings.TrimPrefix(instance.Name, "ac-") | ||
| protected := make(map[string]bool, 2) |
There was a problem hiding this comment.
do you need a protected map here?
I think (just theory) you can remove from L596 to L602 and simply have on L617 (where protected is used):
if s.Name == instance.Status.SecretName || s.Name == instance.Status.PreviousSecretName || hasConsumerFinalizer(s) {
...it might be a cleaner approach as we do not inference (when we read the code) the hidden detail about string comparison.
There was a problem hiding this comment.
Yes, totally agree with you. The map is an overkill
Deleted unused `GetApplicationCredentialFromSecret` function and introduce immutable per-rotation AC secrets with deterministic names, add Keystone-side revocation of unused rotated ACs, and suppress Owns() create events on the secret watch to prevent a race condition caused by stale informer cach and sometimes causing additional AC secret to be created and deleted immediately during rotation. Signed-off-by: Veronika Fisarova <vfisarov@redhat.com> Assisted-by: Claude Opus 4.6 <noreply@anthropic.com>
f70bc06 to
da55e5f
Compare
|
@Deydra71: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Jira: OSPRH-28176, OSPRH-27512
Application Credential dev-doc: https://github.com/openstack-k8s-operators/dev-docs/blob/main/application_credentials.md
GetApplicationCredentialFromSecretfunctionapplication-credential-service: barbicanpreviousSecretNameused for tracking previously used AC secretapplication-credential-servicelabel is added to any existing secret that lacks it, making it visible to the label-based deletion. This ensures old secrets are properly revoked and cleaned up on the next rotation or CR deletion, and prevents orphaned protection finalizersEach service operator that consumes an AC secret now places a
openstack.org/<service>-ac-consumerfinalizer on the AC secret it is actively using. This ensures the keystone-operator cannot revoke or clean up secret while a service is still holding a reference to it.NOTE: This PR doesn't incldue changes to tracking services that have credentials deployed on EDPM, that depends on openstack-k8s-operators/openstack-operator#1781
Tested with openstack-k8s-operators/barbican-operator#356
Assisted-by: Claude Opus 4.6 noreply@anthropic.com