Skip to content

fix: deploy csi node daemonset for each wekaclient#2582

Open
assafgi wants to merge 1 commit into
mainfrom
06-09-fix_deploy_csi_node_daemonset_for_each_wekaclient
Open

fix: deploy csi node daemonset for each wekaclient#2582
assafgi wants to merge 1 commit into
mainfrom
06-09-fix_deploy_csi_node_daemonset_for_each_wekaclient

Conversation

@assafgi

@assafgi assafgi commented Jun 9, 2026

Copy link
Copy Markdown
Member

No description provided.

@assafgi assafgi marked this pull request as ready for review June 9, 2026 16:12
Copilot AI review requested due to automatic review settings June 9, 2026 16:12
@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown

Dependency Review

✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.

Scanned Files

None

@assafgi assafgi requested a review from a team as a code owner June 9, 2026 16:12

assafgi commented Jun 9, 2026

Copy link
Copy Markdown
Member Author

How to use the Graphite Merge Queue

Add the label main-merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

An organization admin has required the Graphite Merge Queue in this repository.

Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.

This stack of pull requests is managed by Graphite. Learn more about stacking.

@claude

claude Bot commented Jun 9, 2026

Copy link
Copy Markdown

Claude finished @assafgi's task in 5m 13s —— View job


PR Review: deploy CSI node DaemonSet for each WekaClient

  • Read changed files & diff
  • Analyze for bugs, quality, security, performance
  • Post review feedback

The change moves CSI node DaemonSet ownership from a single shared per-CSI-group DaemonSet to one DaemonSet per WekaClient, deploys/undeploys it in the client reconciler, adds ClientName to the hash spec, and cleans up the old shared DaemonSet. The overall direction is sound and the cleanup of the legacy pod-migration code is a nice simplification. A few issues worth addressing:


🔴 1. DeployCsiNodeDaemonSetForClient uses a non-idempotent Create on a step that runs every reconcile

client_reconciler_loop.go:1451 — the deploy step is configured with SkipStepStateCheck: true, so it executes on every reconcile, but the body calls plain c.Create(ctx, daemonSetSpec):

err = c.Create(ctx, daemonSetSpec)
if err != nil {
    return fmt.Errorf("failed to create CSI node daemonset for client %s: %w", c.wekaClient.Name, err)
}

After the first successful reconcile the DaemonSet exists, so every subsequent reconcile returns AlreadyExists and the step logs an error (swallowed by ContinueOnError). This is permanent error-log spam in steady state, and it relies on the engine not flipping CondCsiNodeDaemonSetDeployed back to false on error — if it does, the downstream UpdateCsiNodeDaemonSet step (which is gated on that condition being true) would stop running and updates would silently break.

The old deployCsiNodeDaemonSet guarded this with the !o.csiNodeDaemonSetExists predicate; that guard was dropped here. Suggest ignoring AlreadyExists (the existing getExistingCsiNodeDaemonSet helper + UpdateCsiNodeDaemonSet already cover the update path):

err = c.Create(ctx, daemonSetSpec)
if err != nil && !apierrors.IsNotFound(err) && !apierrors.IsAlreadyExists(err) {
    return fmt.Errorf("failed to create CSI node daemonset for client %s: %w", c.wekaClient.Name, err)
}

Fix this →


🟠 2. GetCSINodeDaemonSetNameForClient truncation can cause name collisions

daemonset.go:95:

base := strings.ReplaceAll(csiGroupName, ".", "-") + "-csi-node-" + strings.ReplaceAll(clientName, ".", "-")
if len(base) > 63 {
    base = base[:63]
}
base = strings.TrimRight(base, "-")

Two distinct clients whose names share the prefix that survives truncation will map to the same DaemonSet name — they'd fight over one DaemonSet, and undeploy of one would delete the other's. Two concerns:

  • A blind [:63] truncation is collision-prone; prefer appending a short hash of the full name (e.g. first 8 hex of a sha256 of clientName) and truncating the prefix, not the whole string.
  • Why 63 specifically? DaemonSet object names are DNS-1123 subdomains (max 253). The name isn't used as a label value here (only as the object key), so the 63 cap looks overly tight and increases collision risk. If 63 is required for some derived label, that derivation should use the same truncated value to stay consistent.

Fix this →


🟡 3. Per-client DaemonSet owner reference is the operator Deployment, not the WekaClient

DeployCsiNodeDaemonSetForClient sets the owner reference to the operator Deployment (matching the old shared model). Since the DaemonSet is now per-client, setting the owner to the WekaClient CR would give automatic garbage collection as a safety net if the explicit UndeployCsiNodeDaemonSetForClient in finalizeClient is ever skipped or fails. As written, a failed finalize path can orphan the DaemonSet. Worth considering owning it by the client (the finalize-time delete then becomes belt-and-suspenders).


🟢 Minor / nits

  • cleanupOldSharedCsiNodeDaemonSet (deploy_csi.go:439) logs "Deleted old shared CSI node daemonset successfully" even when the Delete raced into a NotFound (it's ignored but still falls through to the success log). Cosmetic.
  • Good call adding ClientName to CsiNodeHashableSpec so the per-client hash diverges and the update path triggers correctly.
  • Nice cleanup removing the legacy cleanupLegacyCsiNodePods polling logic (~150 lines) — the new flow is much easier to follow.

Summary

Solid refactor toward per-client CSI node DaemonSets. The main thing to fix before merge is the non-idempotent Create (#1), which will spam errors every reconcile and is fragile w.r.t. the condition gating the update step. The name-truncation collision (#2) is a correctness risk for clusters with many similarly-named clients. Owner-reference (#3) is a robustness improvement worth considering.

Note: I was unable to run go build/make test in this environment (sandbox restrictions), so I reviewed statically — CI should confirm compilation.
· branch 06-09-fix_deploy_csi_node_daemonset_for_each_wekaclient

@graphite-app

graphite-app Bot commented Jun 9, 2026

Copy link
Copy Markdown

Graphite Automations

"Add anton/matt/sergey/kristina as reviwers on operator PRs" took an action on this PR • (06/09/26)

2 reviewers were added to this PR based on Anton Bykov's automation.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR changes CSI node deployment from a single shared DaemonSet (per CSI group) to a per-WekaClient DaemonSet model, and adjusts CSI deployment/cleanup logic accordingly within the operator reconciliation flow.

Changes:

  • Add per-client CSI node DaemonSet deploy/undeploy handling in the WekaClient reconciliation loop and update the node DaemonSet update path to target the per-client DaemonSet name.
  • Remove CSI node DaemonSet management from the shared DeployCsiOperation and replace it with cleanup of the legacy shared DaemonSet name.
  • Update CSI node DaemonSet naming and hashing to incorporate a “client identifier” input.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 7 comments.

File Description
internal/controllers/wekaclient/client_reconciler_loop.go Adds per-client CSI node DaemonSet deploy/undeploy steps and switches update/get paths to per-client naming.
internal/controllers/operations/deploy_csi.go Removes node DaemonSet deploy/undeploy from the shared CSI operation and adds cleanup for the legacy shared DaemonSet.
internal/controllers/operations/csi/daemonset.go Updates CSI node DaemonSet name/hash generation to accept a client identifier and generate per-client names.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +1470 to +1473
err = c.Create(ctx, daemonSetSpec)
if err != nil {
return fmt.Errorf("failed to create CSI node daemonset for client %s: %w", c.wekaClient.Name, err)
}
Comment on lines +1455 to +1456
daemonSetSpec, err := csi.NewCsiNodeDaemonSet(ctx, c.GetCSIGroup(), c.wekaClient, c.wekaClient.Name, c.nodes)
if err != nil {
ctx, logger := instrumentation.CreateLogSpan(ctx, "UndeployCsiNodeDaemonSetForClient")
defer logger.End()

nodeDaemonSetName := csi.GetCSINodeDaemonSetNameForClient(c.GetCSIGroup(), c.wekaClient.Name)
Comment on lines 1514 to 1515
targetDaemonSet, err := csi.NewCsiNodeDaemonSet(ctx, c.GetCSIGroup(), c.wekaClient, c.wekaClient.Name, c.nodes)
if err != nil {
defer logger.End()

nodeDaemonSetName := csi.GetCSINodeDaemonSetName(c.GetCSIGroup())
nodeDaemonSetName := csi.GetCSINodeDaemonSetNameForClient(c.GetCSIGroup(), c.wekaClient.Name)
Comment on lines +95 to +103
func GetCSINodeDaemonSetNameForClient(csiGroupName, clientName string) string {
base := strings.ReplaceAll(csiGroupName, ".", "-") + "-csi-node-" + strings.ReplaceAll(clientName, ".", "-")
if len(base) > 63 {
base = base[:63]
}
// Ensure name doesn't end with a hyphen
base = strings.TrimRight(base, "-")
return base
}
Comment on lines 99 to +102
&lifecycle.SimpleStep{
Name: "CleanupLegacyCsiNodePods",
Run: o.cleanupLegacyCsiNodePods,
Predicates: lifecycle.Predicates{
func() bool { return !o.csiNodeDaemonSetExists },
}},
&lifecycle.SimpleStep{
Name: "DeployCsiNodeDaemonSet",
Run: o.deployCsiNodeDaemonSet,
Predicates: lifecycle.Predicates{
func() bool { return !o.csiNodeDaemonSetExists },
}},
Name: "CleanupOldSharedCsiNodeDaemonSet",
Run: o.cleanupOldSharedCsiNodeDaemonSet,
},
@assafgi assafgi force-pushed the 06-09-fix_deploy_csi_node_daemonset_for_each_wekaclient branch from d17f806 to 66c4644 Compare June 9, 2026 16:19
@claude

claude Bot commented Jun 9, 2026

Copy link
Copy Markdown

Claude finished @assafgi's task in 1m 23s —— View job


Re-review (post force-push)

The branch was amended at 16:19Z, but the diff is a Graphite restack — the review-relevant code is unchanged from my previous review. None of the prior findings were addressed, so they all still stand:

  • Read current state of changed files
  • Verify findings against current code

🔴 1. Non-idempotent Create still present — the merge blocker

client_reconciler_loop.goDeployCsiNodeDaemonSetForClient step is SkipStepStateCheck: true (runs every reconcile) but the body is still:

err = c.Create(ctx, daemonSetSpec)
if err != nil {
    return fmt.Errorf("failed to create CSI node daemonset for client %s: %w", c.wekaClient.Name, err)
}

Once the DaemonSet exists, every subsequent reconcile returns AlreadyExists → permanent error-log spam, and fragile w.r.t. the CondCsiNodeDaemonSetDeployed gate on the UpdateCsiNodeDaemonSet step. Ignore AlreadyExists:

err = c.Create(ctx, daemonSetSpec)
if err != nil && !apierrors.IsAlreadyExists(err) {
    return fmt.Errorf("failed to create CSI node daemonset for client %s: %w", c.wekaClient.Name, err)
}

Fix this →

🟠 2. GetCSINodeDaemonSetNameForClient still uses blind base[:63]

daemonset.go — clients sharing a 63-char prefix collapse to the same DaemonSet name and fight/delete each other. Append a short sha256 suffix of the full client identifier and truncate the prefix instead. Also note Copilot's namespace point: the name is derived from c.wekaClient.Name only — two same-named WekaClients in different namespaces collide in the operator namespace. Use a namespace-qualified identifier consistently across create/get/update/delete (all four currently pass c.wekaClient.Name).

Fix this →

🟡 3. Owner ref still the operator Deployment

DeployCsiNodeDaemonSetForClient owns the per-client DaemonSet by the operator Deployment. Owning it by the WekaClient CR would give GC as a safety net if UndeployCsiNodeDaemonSetForClient is ever skipped/fails. Still worth considering.

🟢 Note on Copilot's upgrade-downtime concern

cleanupOldSharedCsiNodeDaemonSet runs in the shared DeployCsiOperation (before per-client DaemonSets reconcile). On upgrade this can briefly leave no CSI node DaemonSet until each WekaClient reconciles. Worth confirming whether a short gap is acceptable here or whether cleanup should be deferred until per-client DaemonSets are confirmed present.


Summary

The force-push didn't change any reviewed code — finding #1 (non-idempotent Create) remains the blocker and should be fixed before merge. #2 (truncation + namespace qualification) is a real collision risk. I did not implement changes since this was a review trigger; reply with @claude implement the fixes if you'd like me to apply #1 and #2.
· branch 06-09-fix_deploy_csi_node_daemonset_for_each_wekaclient

Copilot AI review requested due to automatic review settings June 9, 2026 17:26
@assafgi assafgi force-pushed the 06-09-fix_deploy_csi_node_daemonset_for_each_wekaclient branch from 66c4644 to 5daf734 Compare June 9, 2026 17:26
@claude

claude Bot commented Jun 9, 2026

Copy link
Copy Markdown

Claude finished @assafgi's task in 1m 34s —— View job


Re-review (fixes applied) ✅

The new push to 5daf734c addresses the findings from my previous reviews. Verified against current code:

  • Read current state of changed files
  • Verify each prior finding against the code

✅ 1. Non-idempotent Createfixed

client_reconciler_loop.go:1470-1477 now treats AlreadyExists as success:

err = c.Create(ctx, daemonSetSpec)
if err != nil {
    if apierrors.IsAlreadyExists(err) {
        logger.Info("Per-client CSI node daemonset already exists, skipping creation", ...)
        return nil
    }
    return fmt.Errorf("failed to create CSI node daemonset for client %s: %w", ...)
}

No more steady-state error spam, and the CondCsiNodeDaemonSetDeployed gate stays true so UpdateCsiNodeDaemonSet keeps running. 👍

✅ 2. Name truncation + namespace qualification — fixed

daemonset.go:98-108 now (a) includes clientNamespace in the name, and (b) appends a sha256 suffix instead of blind truncation:

base := strings.ReplaceAll(csiGroupName, ".", "-") + "-csi-node-" + clientNamespace + "-" + strings.ReplaceAll(clientName, ".", "-")
if len(base) > 63 {
    hash := fmt.Sprintf("%x", sha256.Sum256([]byte(base)))[:8]
    base = base[:63-9] + "-" + hash   // 54 + "-" + 8 = 63
}
base = strings.TrimRight(base, "-")

The hash is derived from the full base (incl. namespace), so same-named clients in different namespaces no longer collide. All four paths — create (NewCsiNodeDaemonSet), get (getExistingCsiNodeDaemonSet:1558), update (UpdateCsiNodeDaemonSet:1518), delete (UndeployCsiNodeDaemonSetForClient:1486) — now pass (Name, Namespace) consistently. ClientName/ClientNamespace were also added to CsiNodeHashableSpec, so the per-client hash diverges correctly. 👍

✅ 4. Upgrade-downtime (Copilot) — addressed

deploy_csi.go:458-473 now defers deleting the legacy shared DaemonSet until the per-client one is confirmed present, closing the gap for the reconciling client.


🟡 Residual notes (non-blocking)

  • Sibling clients in a shared CSI group still have a brief gap on upgrade. cleanupOldSharedCsiNodeDaemonSet checks only this client's per-client DaemonSet (o.wekaClient) before deleting the group-shared DaemonSet. If clients A and B share a CSI group, A finishing first deletes the shared DS while B hasn't yet deployed its own — B's nodes lose the CSI node plugin until B reconciles. Mitigated for the reconciling client, not its siblings. Probably acceptable, but worth confirming.
  • Owner reference (prior minikube 1.32 #3) is still the operator Deployment rather than the WekaClient CR. Explicit UndeployCsiNodeDaemonSetForClient in finalize covers normal teardown; owning by the CR would add GC as a safety net if finalize is ever skipped. Optional robustness improvement.

Summary

All blocking findings from the previous reviews are resolved — the idempotent Create, namespace-qualified + hash-suffixed naming, and deferred legacy cleanup all look correct. No blockers remain. The two residual notes are optional.

Note: go build/make test are blocked by sandbox restrictions in this environment, so this is a static verification — CI should confirm compilation.
· branch 06-09-fix_deploy_csi_node_daemonset_for_each_wekaclient

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants