Skip to content

fix: process ensure nics per node#2585

Open
assafgi wants to merge 1 commit into
06-09-fix_deploy_csi_node_daemonset_for_each_wekaclientfrom
06-10-fix_process_ensure_nics_per_node
Open

fix: process ensure nics per node#2585
assafgi wants to merge 1 commit into
06-09-fix_deploy_csi_node_daemonset_for_each_wekaclientfrom
06-10-fix_process_ensure_nics_per_node

Conversation

@assafgi

@assafgi assafgi commented Jun 10, 2026

Copy link
Copy Markdown
Member

No description provided.

@github-actions

github-actions Bot commented Jun 10, 2026

Copy link
Copy Markdown

Dependency Review

✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.

Scanned Files

None

assafgi commented Jun 10, 2026

Copy link
Copy Markdown
Member Author

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more


How to use the Graphite Merge Queue

Add the label main-merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

An organization admin has required the Graphite Merge Queue in this repository.

Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.

This stack of pull requests is managed by Graphite. Learn more about stacking.

@claude

claude Bot commented Jun 10, 2026

Copy link
Copy Markdown

Claude finished @assafgi's task in 1m 12s —— View job


Code Review: process ensure nics per node


This is a clean, sensible refactor. The old flow gated all node patching behind a PollResults barrier that required every container's ExecutionResult to be ready. The new ProcessResult patches each node as soon as its own container reports, and only returns a WaitError at the end if some are still pending. That matches the PR title ("per node") and is a genuine improvement for partial progress. Capturing o.results before returning the WaitError is also a nice touch.

A couple of things worth a look:

🔴 Idempotency check can permanently skip the status update

ensure_nics.go:236-239

if annotations[domain.WEKANICs] == string(nicsBytes) {
    // Already patched (idempotent — avoids unnecessary updates on re-polls)
    return nil
}

The early return short-circuits both the annotation update and the extended-resource status update (Capacity/Allocatable, lines 248-250). Consider this sequence:

  1. Reconcile A: annotation Update succeeds (server now has the annotation), then Status().Update fails (e.g. conflict) → patchNodeNICs returns an error, step fails.
  2. Reconcile B: GetNode returns the node with the annotation already equal to nicsBytes → early return nilthe status Capacity/Allocatable are never written.

So a transient failure on the status update leaves the node permanently without the WEKANICs extended resources, even though the operation reports success. The annotation and the status are written non-atomically, so using only the annotation as the "already patched" sentinel is unsafe.

Suggestion: gate just the annotation Update on the annotation diff, but always reconcile the status capacity/allocatable (which is itself idempotent — writing the same quantity is a no-op-ish update). For example:

needsAnnotation := annotations[domain.WEKANICs] != string(nicsBytes)
if needsAnnotation {
    annotations[domain.WEKANICs] = string(nicsBytes)
    node.Annotations = annotations
    if err := o.client.Update(ctx, node); err != nil {
        return lifecycle.NewWaitError(err)
    }
}
// always ensure extended resources match
node.Status.Capacity[domain.WEKANICs] = ...
node.Status.Allocatable[domain.WEKANICs] = ...

(or compare the existing capacity quantity and skip only when both annotation and capacity already match). Fix this →

🟡 Pre-existing: nil node.Status.Capacity/Allocatable map (lines 248-249)

node.Status.Capacity[domain.WEKANICs] = ... indexes the map without a nil guard. Real nodes always have these populated by the kubelet, so it's unlikely to bite, but since this code is being moved/restructured anyway, a defensive if node.Status.Capacity == nil { ... } (matching the annotation nil-guard above it) would be consistent and panic-safe. Not introduced by this PR — just noting.

🟢 Minor notes

  • The removal of the standalone PollResults step means EnsureNICsOperation now diverges from the sibling operations (sign_drives.go, discover_drives.go, etc.) which still use a separate PollResults. That's fine — it's the point of this change — but the divergence is worth keeping in mind for consistency if those are ever unified.
  • Confirmed there are no remaining references to the deleted EnsureNICsOperation.PollResults, so the build should be clean.

Overall: good change, the incremental per-node patching is the right direction. The idempotency/status-skip interaction is the one item I'd recommend addressing before merge.
· branch 06-10-fix_process_ensure_nics_per_node

@assafgi assafgi marked this pull request as ready for review June 10, 2026 11:41
@assafgi assafgi requested a review from a team as a code owner June 10, 2026 11:41
@claude

claude Bot commented Jun 10, 2026

Copy link
Copy Markdown

Claude finished @assafgi's task in 45s —— View job


Code Review: process ensure nics per node

  • Check whether anything changed since my last review
  • Re-confirm the open finding still applies

No new commits since my previous review — HEAD is still 6e0ee7b4 and internal/controllers/operations/ensure_nics.go is unchanged. Exiting early; my earlier review stands.

The one item still worth addressing before merge is the idempotency/status-skip interaction:

🔴 ensure_nics.go:236-239 — the early return nil when the annotation already matches short-circuits the extended-resource status update (Capacity/Allocatable, lines 248-250). If the annotation Update previously succeeded but Status().Update failed, a later reconcile sees the matching annotation and never writes the extended resources, yet reports success. The annotation and status are written non-atomically, so the annotation alone is an unsafe "already patched" sentinel.

Suggested fix: gate only the annotation Update on the annotation diff, and always reconcile the (idempotent) status capacity/allocatable. Fix this →
· branch 06-10-fix_process_ensure_nics_per_node

@graphite-app

graphite-app Bot commented Jun 10, 2026

Copy link
Copy Markdown

Graphite Automations

"Add anton/matt/sergey/kristina as reviwers on operator PRs" took an action on this PR • (06/10/26)

2 reviewers were added to this PR based on Anton Bykov's automation.

@assafgi assafgi changed the base branch from 05-31-fix_add_admission_control_to_block_finalizers_removal_not_made_by_the_operator_service_account to graphite-base/2585 June 10, 2026 11:47
@assafgi assafgi force-pushed the 06-10-fix_process_ensure_nics_per_node branch from 6e0ee7b to 7bc18ca Compare June 10, 2026 11:47
@assafgi assafgi force-pushed the graphite-base/2585 branch from c9d3508 to 5daf734 Compare June 10, 2026 11:47
@assafgi assafgi changed the base branch from graphite-base/2585 to 06-09-fix_deploy_csi_node_daemonset_for_each_wekaclient June 10, 2026 11:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant