Skip to content

Add finalizer management for application credentials#685

Open
Deydra71 wants to merge 1 commit intoopenstack-k8s-operators:mainfrom
Deydra71:appcred-service-labels
Open

Add finalizer management for application credentials#685
Deydra71 wants to merge 1 commit intoopenstack-k8s-operators:mainfrom
Deydra71:appcred-service-labels

Conversation

@Deydra71
Copy link
Copy Markdown
Contributor

@Deydra71 Deydra71 commented Apr 1, 2026

Jira: OSPRH-28176, OSPRH-27512

Application Credential dev-doc: https://github.com/openstack-k8s-operators/dev-docs/blob/main/application_credentials.md

  • Delete unused GetApplicationCredentialFromSecret function
  • Add a service label to the AC secret for easy discovery, e.g. application-credential-service: barbican
  • Introduce immutable per-rotation AC secrets with deterministic names
  • Add a new status field previousSecretName used for tracking previously used AC secret
  • Add Keystone-side revocation of unused rotated ACs - currently used secret and previously used secret are protected, pre-previous secret is deleted and revoked
  • Suppress Owns() create events on the secret watch to prevent a race condition caused by stale informer cach and sometimes causing additional AC secret to be created and deleted immediately during app cred rotation
  • Add migration support for old mutable secrets - on the first reconcile after upgrade, the application-credential-service label is added to any existing secret that lacks it, making it visible to the label-based deletion. This ensures old secrets are properly revoked and cleaned up on the next rotation or CR deletion, and prevents orphaned protection finalizers

Each service operator that consumes an AC secret now places a openstack.org/<service>-ac-consumer finalizer on the AC secret it is actively using. This ensures the keystone-operator cannot revoke or clean up secret while a service is still holding a reference to it.

NOTE: This PR doesn't incldue changes to tracking services that have credentials deployed on EDPM, that depends on openstack-k8s-operators/openstack-operator#1781

Tested with openstack-k8s-operators/barbican-operator#356

Assisted-by: Claude Opus 4.6 noreply@anthropic.com

@Deydra71 Deydra71 requested review from fmount, stuggi and vakwetu April 1, 2026 07:19
@openshift-ci openshift-ci bot requested review from abays and afaranha April 1, 2026 07:19
@softwarefactory-project-zuul
Copy link
Copy Markdown

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/61090ad340f246b2a571069f7190fad7

openstack-k8s-operators-content-provider FAILURE in 8m 49s
⚠️ keystone-operator-kuttl SKIPPED Skipped due to failed job openstack-k8s-operators-content-provider (non-voting)
⚠️ keystone-operator-tempest SKIPPED Skipped due to failed job openstack-k8s-operators-content-provider (non-voting)

if k8s_errors.IsNotFound(err) {
// If the ACID is set but the secret is not found, requeue to let the cache sync
if instance.Status.ACID != "" {
return ctrl.Result{RequeueAfter: time.Second * 5}, nil
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the ACID is set in status but the Secret is not found, the controller assumes an informer cache lag and requeues every 5 seconds. If the Secret was genuinely deleted (e.g. manual kubectl delete secret), this becomes an infinite requeue loop.

Consider adding a bounded retry - for example, an annotation counter or a timestamp check - and after a threshold (e.g. 30 seconds or N retries), fall through to doRotate = true so the controller self-heals instead of spinning indefinitely.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't actually need the requeue anymore - we suppress reconcile trigger on create events, so this scenario can't happen anymore. I will remove the guard and that fixes the infinite loop possibility, which is valid and I replicated manually the problem when I manually deleted the finalziers.

// finalizer should only be removed once all nodes across all NodeSets have
// been redeployed with the new credentials. This depends on per-node secret
// rotation tracking: https://github.com/openstack-k8s-operators/openstack-operator/pull/1781
func hasConsumerFinalizer(secret *corev1.Secret) bool {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The substring check strings.Contains(f, "-ac-consumer") could match unrelated finalizers that happen to contain that substring. A tighter check would reduce false-positive risk:

if strings.HasPrefix(f, "openstack.org/") && strings.HasSuffix(f, "-ac-consumer") {
    return true
}

This still matches all service-operator consumer finalizers (openstack.org/barbican-ac-consumer, openstack.org/cinder-ac-consumer, etc.) while being robust against accidental collisions.


for i := range secretList.Items {
s := &secretList.Items[i]
if protected[s.Name] || hasConsumerFinalizer(s) || !controllerutil.ContainsFinalizer(s, acSecretFinalizer) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The !controllerutil.ContainsFinalizer(s, acSecretFinalizer) guard creates a permanent orphan scenario. If the controller crashes between line 625 (removing the protection finalizer via Update) and line 631 (deleting the Secret), the Secret is left without a finalizer. On the next reconcile, this guard causes the loop to skip it, and it is never cleaned up.

One fix is to remove the ContainsFinalizer pre-condition entirely - if the Secret is not protected by name and has no consumer finalizer, it should be deleted regardless of whether it still carries the protection finalizer:

if protected[s.Name] || hasConsumerFinalizer(s) {
    continue
}

Alternatively, invert the order: issue the Delete first (the API server won't actually remove the object while the finalizer exists), then remove the finalizer. That way, even on crash between the two operations, the object is already marked for deletion and the next reconcile just removes the finalizer.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we don't need that additional check. Old mutable secrets with only the protection finalizers should be safe since the protected is checked before hasConsumerFinalizer if service operator would be late to add consumer finalizer to the secret.

I don't think inversing the order is a good way to go, we would need to handle the Terminating state, because deletion would be blocked.

@@ -455,26 +728,25 @@ func (r *ApplicationCredentialReconciler) storeACSecret(

op, err := controllerutil.CreateOrPatch(ctx, helperObj.GetClient(), secret, func() error {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since each rotation produces a unique Secret name (ac-<svc>-<first5>-secret), CreateOrPatch will effectively always Create. However, if the controller crashes after the Create succeeds but before the status is patched, the retry will find the Secret already exists and attempt a Patch - which will try to update .data on an immutable Secret and fail.

Consider replacing this with a plain Create + IsAlreadyExists check:

err := helperObj.GetClient().Create(ctx, secret)
if k8s_errors.IsAlreadyExists(err) {
    logger.Info("Immutable AC secret already exists (likely retry), proceeding", "secret", secretName)
    return secretName, nil
}
if err != nil {
    return "", fmt.Errorf("failed to create immutable AC secret %s: %w", secretName, err)
}

userID string,
) error {
logger := r.GetLogger(ctx)
serviceName := strings.TrimPrefix(instance.Name, "ac-")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This derivation assumes every KeystoneApplicationCredential CR name follows the ac-<service> convention. If a CR is created with a name that doesn't start with ac-, TrimPrefix returns the full name unchanged and the resulting service label / label-based queries silently target the wrong set of Secrets.

Consider either:

  • Validating the naming convention via a webhook at admission time, or
  • Adding an explicit serviceName field to the Spec so the derivation is unnecessary and the contract is explicit.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

openstack-operator creates AC CRs with a name convention defined in keystone-operator -https://github.com/openstack-k8s-operators/openstack-operator/blob/main/internal/openstack/applicationcredential.go#L113

It's still possible to manually create the AC CR ( we do that for tests bypassing openstack-op), but we want openstack-operator to be the sole creator of AC CR based on config in controlplane CR.

We could create additional api/v1beta1/keystoneapplicationcredential_webhook.go , let's see what other reviewers comment.

Copy link
Copy Markdown

@mauricioharley mauricioharley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the work on this, Veronika. The overall architecture is solid - immutable per-rotation secrets, Keystone-side revocation, and consumer finalizer coordination are the right design choices.

I left five inline comments on areas that could lead to subtle issues in production:

  1. Infinite requeue loop (line 223) - unbounded retry when a Secret is genuinely deleted
  2. Fragile substring match (line 564) - hasConsumerFinalizer could match unrelated finalizers
  3. Orphaned secrets (line 606) - crash between finalizer removal and delete leaves permanently uncleaned secrets
  4. Immutable secret retry failure (line 729) - CreateOrPatch fails on retry against an immutable Secret
  5. Implicit naming convention (line 584) - TrimPrefix silently misbehaves if CR name doesn't follow ac-<service>

Please address these before merging.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci bot commented Apr 1, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Deydra71
Once this PR has been reviewed and has the lgtm label, please ask for approval from mauricioharley. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@Deydra71 Deydra71 force-pushed the appcred-service-labels branch from 45357af to 9a3a2ad Compare April 7, 2026 10:48
@Deydra71
Copy link
Copy Markdown
Contributor Author

Deydra71 commented Apr 7, 2026

Note: I used the /code-review skill available through https://github.com/fmount/openstack-k8s-agent-tools/ on the latest changes

@Deydra71
Copy link
Copy Markdown
Contributor Author

Deydra71 commented Apr 8, 2026

/retest

"application-credentials": "true",
"application-credential-service": serviceName,
},
Finalizers: []string{acSecretFinalizer},
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Deydra71 I see we have both a "producer" finalizer (this line), and a "consumer" finalizer, which is set by the service that uses the secret.
Is it accurate to think that this is useful for rotation purposes? In other words I imagine that the consumer moves the finalizer from the old secret to the new one. At that point the old secret has no consumer finalizer, but keystone needs to keep it alive until the AC is revoked. Then it can explicitly remove the
producer finalizer and delete the secret.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the description is correct.

openstack.org/ac-secret-protection is set by keystone op at creation that prevents the secret from being deleted before the AC is revoked in Keystone. The current secret and previous used secret are protected by this finalizer.

openstack.org/<service>-ac-consumer is set by the service operator while it's actively using that secret.

// would indicate an ACID prefix collision (two different Keystone AC IDs
// whose first 5 characters are identical, producing the same secret name).
existing := &corev1.Secret{}
if getErr := helperObj.GetClient().Get(ctx, types.NamespacedName{Namespace: ac.Namespace, Name: secretName}, existing); getErr != nil {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wondering if we should rely on lib-common GetSecret here [1].
I guess we also have an option to get some data directly [2], so you could get .Data[keystonev1.ACIDSecretKey] and compare it here. Not sure this might deserve a dedicated helper but reusing existing functions/patterns might help to keep the code readable.

[1] https://github.com/openstack-k8s-operators/lib-common/blob/main/modules/common/secret/secret.go#L77
[2] https://github.com/openstack-k8s-operators/lib-common/blob/main/modules/common/secret/secret.go#L438

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't use it because the lib-common GetSecret computes hash we don't need here. So I chose the raw get. But I see now that in other operators we generally use it even when the hash is just discarded, so I will do that too too keep the pattern.

}

secretList := &corev1.SecretList{}
if err := helperObj.GetClient().List(ctx, secretList,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be useful to have a lib-common function to filter secrets by labels, so we can shorten this code and keep here just the relevant logic. It would be also useful for service operators in case we need to list secrets by label.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logger.Info("Could not get user ID, skipping revocation during AC CR delete", "error", err)
} else {
seen := make(map[string]bool)
for i := range secretList.Items {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see we list secrets many times in this function with the purpose of doing some processing and identify which one should be deleted. I'm wondering if we can make this function more linear and merge some processing steps (e.g. we build a list of secrets we can revoke and we run a massive revocation in a single call, maybe with an helper that wraps revokeKeystoneAC).
I find hard to follow the logic of this function, and you should consider splitting the logic into smaller pieces that flatten the logic.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's sadly no available call to revoke multiple ACs in one API call, so the wrapper would jsut loop internally over the list of ACs meant to be revoked, but at the end revoking them one by one anyway.

But with the other changes you sugegsted I think we simplified the reconcileDelete pretty well:

  • merged the revocation loop and the finalizer removal loop into one pass
  • replaced client.List with oko_secret.GetSecrets (one call)
  • building the Keystone client once upfront and not nesting the whole revocation logic inside the keystoneAPI != nil block

Please let me know if it makes more sense now, or if it's there some additional simplification/sugegstion.

}

secretList := &corev1.SecretList{}
if err := helperObj.GetClient().List(ctx, secretList,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we do the same iteration (by label) that can be realized through a lib-common helper (see my previous comment on reconcileDelete).

}

fresh := &corev1.Secret{}
if err := helperObj.GetClient().Get(ctx, types.NamespacedName{Namespace: s.Namespace, Name: s.Name}, fresh); err != nil {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we .Get( a secret (stored in s) that we already have? I'm not sure this logic is required here (L630 - L635)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At first I thought that we should re Get to guard against stale resourceVersion on Update, because between List on #604 and Update on #636 we call revokeKeystoneAC and in the meantime of that external call some service operator could modify the secret changint the resourceVersion.

However if we already simplify it to use GetSecrets() the List result is going to be fresh, so I don't think it's necessary with the other agreed changes.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I think that even if the conflict would happen the reconcile will retry again, because if conflict happens then it means there's some change on the resource and Owns() watch is triggered and enqueue new recopncile

}
logger.Info("Removed protection finalizer from AC secret", "secret", s.Name)
}
if err := helperObj.GetClient().Delete(ctx, fresh); err != nil && !k8s_errors.IsNotFound(err) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need the Delete? I'm wondering if RemoveFinalizer + Update is enough to see the Secret deleted.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The secret still has owner reference to the AC CR that exists, so it wouldn't be collected unless the CR is deleted. For security reasons we don't want to keep unused AC secrets laying around, so we want to revoke AC in Keystone when these are fulfilled:

  • it's not current SecretName
  • it's not PreviousSecretName
  • it doesn't have consumer finalizer

) error {
logger := r.GetLogger(ctx)
serviceName := strings.TrimPrefix(instance.Name, "ac-")
protected := make(map[string]bool, 2)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you need a protected map here?
I think (just theory) you can remove from L596 to L602 and simply have on L617 (where protected is used):

if s.Name == instance.Status.SecretName || s.Name == instance.Status.PreviousSecretName || hasConsumerFinalizer(s) {
...

it might be a cleaner approach as we do not inference (when we read the code) the hidden detail about string comparison.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, totally agree with you. The map is an overkill

Deleted unused `GetApplicationCredentialFromSecret` function and introduce immutable per-rotation AC secrets with deterministic names,
add Keystone-side revocation of unused rotated ACs, and suppress Owns() create events on the secret
watch to prevent a race condition caused by stale informer cach and sometimes causing additional AC secret to be created and deleted immediately during rotation.

Signed-off-by: Veronika Fisarova <vfisarov@redhat.com>
Assisted-by: Claude Opus 4.6 <noreply@anthropic.com>
@Deydra71 Deydra71 force-pushed the appcred-service-labels branch from f70bc06 to da55e5f Compare April 10, 2026 08:07
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci bot commented Apr 10, 2026

@Deydra71: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/functional da55e5f link true /test functional

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants