Skip to content

fix: replace shared retryAttempts global with per-Cluster annotation in deregistration handler#485

Open
mdryaan wants to merge 1 commit into
kubeslice:masterfrom
mdryaan:fix/cluster-reconciler-global-retry-counter
Open

fix: replace shared retryAttempts global with per-Cluster annotation in deregistration handler#485
mdryaan wants to merge 1 commit into
kubeslice:masterfrom
mdryaan:fix/cluster-reconciler-global-retry-counter

Conversation

@mdryaan
Copy link
Copy Markdown

@mdryaan mdryaan commented May 14, 2026

Description

handleClusterDeletion was using a package-level var retryAttempts = 0 to cap deregistration retries at MAX_CLUSTER_DEREGISTRATION_ATTEMPTS = 3. This had two problems:

  • All Cluster objects shared the same counter. If two clusters were being deleted simultaneously, failures from one exhausted the retry budget for the other.
  • Restarting the operator pod reset the counter to zero, so the "stop after 3 attempts" guarantee didn't actually hold across restarts.

Removed the package-level var and moved attempt tracking into the Cluster object itself using the annotation kubeslice.io/deregistration-attempts. On each failed createDeregisterJob call, the annotation is incremented and patched back viaclient.MergeFrom. Each Cluster CR now has its own independent counter that survives pod restarts and concurrent deletions of other clusters.

Fixes #484

How Has This Been Tested?

  • make fmt — no formatting changes
  • make vet — no issues
  • make build — builds cleanly

Checklist:

  • The title of the PR states what changed and the related issues number (used for the release note).
  • Does this PR requires documentation updates?
  • I've updated documentation as required by this PR.
  • I have ran go fmt
  • I have updated the helm chart as required by this PR.
  • I have performed a self-review of my own code.
  • I have commented my code, particularly in hard-to-understand areas.
  • I have tested it for all user roles.
  • I have added all the required unit test cases.
  • I have verified the E2E test cases with new code changes.
  • I have added all the required E2E test cases.

Does this PR introduce a breaking change?

Cluster deregistration attempt count is now tracked via the annotation
`kubeslice.io/deregistration-attempts` on each Cluster CR instead of an in-memory
counter. Existing clusters mid-deletion will lose their in-flight attempt count on
upgrade and restart from zero — they will get up to 3 fresh attempts.

Signed-off-by: mdryaan <alikhurshid842001@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: retryAttempts is a package-level variable — shared across all Cluster objects and survives controller restarts

1 participant