NVIDIA-596: Enable dpu healthcheck #2941
Conversation
|
@tsorya: This pull request references NVIDIA-596 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the sub-task to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
1 similar comment
|
@tsorya: This pull request references NVIDIA-596 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the sub-task to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughAdd DPU node lease configuration: new bootstrap fields and defaults, parse/validate ConfigMap keys, expose string values to templates, inject env vars into ovnkube DaemonSet for DPU modes, pass lease flags from node script to ovnkube, add ConfigMap defaults and tests. Changes
Sequence Diagram(s)sequenceDiagram
participant ConfigMap as hardware-offload ConfigMap
participant Bootstrap as bootstrapOVNConfig
participant Renderer as template renderer
participant K8sAPI as Kubernetes API (DaemonSet)
participant NodeScript as ovnkube node script
participant ovnkube as ovnkube process
ConfigMap->>Bootstrap: read dpu-node-lease-* keys
Bootstrap->>Bootstrap: parse & validate values
Bootstrap->>Renderer: provide DpuNodeLeaseRenewInterval/DpuNodeLeaseDuration (strings)
Renderer->>K8sAPI: create/update DaemonSet with env vars (conditional on node mode)
K8sAPI->>NodeScript: schedule/run ovnkube node script (on node)
NodeScript->>NodeScript: build dpu_lease_flags from env vars
NodeScript->>ovnkube: invoke ovnkube with ${dpu_lease_flags}
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 8 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (8 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
62c31b1 to
b5a3d66
Compare
|
/retest-required |
|
Blocked by k8snetworkplumbingwg/multus-cni#1490 |
|
@tsorya Could you please help rebase this PR, then I can build an image to run some pre-merge testing. |
1eb0381 to
6b9ed3a
Compare
done |
|
@yingwang-0320: This PR has been marked as verified by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
| daemon-config.json: | | ||
| { | ||
| "cniVersion": "0.3.1", | ||
| "cniVersion": "1.1.0", |
There was a problem hiding this comment.
Multus was updated to the new version which enables CNI status
There was a problem hiding this comment.
UPdated PR description
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: tsorya The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
@tsorya: This pull request references NVIDIA-596 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the sub-task to target the "5.0.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/network/ovn_kubernetes.go`:
- Around line 1094-1102: If either ovnConfigResult.DpuNodeLeaseRenewInterval or
ovnConfigResult.DpuNodeLeaseDuration is 0 we should normalize both to 0 so the
disable semantics are consistent; update the logic around the current checks to
first detect if either field == 0 and set both
ovnConfigResult.DpuNodeLeaseRenewInterval = 0 and
ovnConfigResult.DpuNodeLeaseDuration = 0, otherwise keep the existing validation
that when both are non-zero and DpuNodeLeaseDuration <=
DpuNodeLeaseRenewInterval you log the warning and reset to
DPU_NODE_LEASE_RENEW_INTERVAL_DEFAULT and DPU_NODE_LEASE_DURATION_DEFAULT.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 9e7b87eb-40e5-48db-92c5-608250f639d9
📒 Files selected for processing (3)
bindata/network/ovn-kubernetes/common/008-script-lib.yamlhack/hardware-offload-config.yamlpkg/network/ovn_kubernetes.go
✅ Files skipped from review due to trivial changes (1)
- hack/hardware-offload-config.yaml
🚧 Files skipped from review as they are similar to previous changes (1)
- bindata/network/ovn-kubernetes/common/008-script-lib.yaml
| // Setting either value to 0 disables the DPU health check. | ||
| // When both are non-zero, duration must be greater than interval. | ||
| if ovnConfigResult.DpuNodeLeaseRenewInterval != 0 && ovnConfigResult.DpuNodeLeaseDuration != 0 && | ||
| ovnConfigResult.DpuNodeLeaseDuration <= ovnConfigResult.DpuNodeLeaseRenewInterval { | ||
| klog.Warningf("dpu-node-lease-duration-in-seconds (%d) must be greater than dpu-node-lease-renew-interval-in-seconds (%d), using defaults", | ||
| ovnConfigResult.DpuNodeLeaseDuration, ovnConfigResult.DpuNodeLeaseRenewInterval) | ||
| ovnConfigResult.DpuNodeLeaseRenewInterval = DPU_NODE_LEASE_RENEW_INTERVAL_DEFAULT | ||
| ovnConfigResult.DpuNodeLeaseDuration = DPU_NODE_LEASE_DURATION_DEFAULT | ||
| } |
There was a problem hiding this comment.
Normalize disable semantics when either lease value is 0.
Line 1094 says either value disables health check, but current flow can still render flags when renew interval is non-zero and duration is 0. Please normalize both fields to 0 when either is 0 so behavior is consistent and not dependent on downstream flag parsing behavior.
Proposed fix
- // Setting either value to 0 disables the DPU health check.
- // When both are non-zero, duration must be greater than interval.
- if ovnConfigResult.DpuNodeLeaseRenewInterval != 0 && ovnConfigResult.DpuNodeLeaseDuration != 0 &&
- ovnConfigResult.DpuNodeLeaseDuration <= ovnConfigResult.DpuNodeLeaseRenewInterval {
+ // Setting either value to 0 disables the DPU health check.
+ if ovnConfigResult.DpuNodeLeaseRenewInterval == 0 || ovnConfigResult.DpuNodeLeaseDuration == 0 {
+ ovnConfigResult.DpuNodeLeaseRenewInterval = 0
+ ovnConfigResult.DpuNodeLeaseDuration = 0
+ } else if ovnConfigResult.DpuNodeLeaseDuration <= ovnConfigResult.DpuNodeLeaseRenewInterval {
+ // When both are non-zero, duration must be greater than interval.
klog.Warningf("dpu-node-lease-duration-in-seconds (%d) must be greater than dpu-node-lease-renew-interval-in-seconds (%d), using defaults",
ovnConfigResult.DpuNodeLeaseDuration, ovnConfigResult.DpuNodeLeaseRenewInterval)
ovnConfigResult.DpuNodeLeaseRenewInterval = DPU_NODE_LEASE_RENEW_INTERVAL_DEFAULT
ovnConfigResult.DpuNodeLeaseDuration = DPU_NODE_LEASE_DURATION_DEFAULT
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // Setting either value to 0 disables the DPU health check. | |
| // When both are non-zero, duration must be greater than interval. | |
| if ovnConfigResult.DpuNodeLeaseRenewInterval != 0 && ovnConfigResult.DpuNodeLeaseDuration != 0 && | |
| ovnConfigResult.DpuNodeLeaseDuration <= ovnConfigResult.DpuNodeLeaseRenewInterval { | |
| klog.Warningf("dpu-node-lease-duration-in-seconds (%d) must be greater than dpu-node-lease-renew-interval-in-seconds (%d), using defaults", | |
| ovnConfigResult.DpuNodeLeaseDuration, ovnConfigResult.DpuNodeLeaseRenewInterval) | |
| ovnConfigResult.DpuNodeLeaseRenewInterval = DPU_NODE_LEASE_RENEW_INTERVAL_DEFAULT | |
| ovnConfigResult.DpuNodeLeaseDuration = DPU_NODE_LEASE_DURATION_DEFAULT | |
| } | |
| // Setting either value to 0 disables the DPU health check. | |
| if ovnConfigResult.DpuNodeLeaseRenewInterval == 0 || ovnConfigResult.DpuNodeLeaseDuration == 0 { | |
| ovnConfigResult.DpuNodeLeaseRenewInterval = 0 | |
| ovnConfigResult.DpuNodeLeaseDuration = 0 | |
| } else if ovnConfigResult.DpuNodeLeaseDuration <= ovnConfigResult.DpuNodeLeaseRenewInterval { | |
| // When both are non-zero, duration must be greater than interval. | |
| klog.Warningf("dpu-node-lease-duration-in-seconds (%d) must be greater than dpu-node-lease-renew-interval-in-seconds (%d), using defaults", | |
| ovnConfigResult.DpuNodeLeaseDuration, ovnConfigResult.DpuNodeLeaseRenewInterval) | |
| ovnConfigResult.DpuNodeLeaseRenewInterval = DPU_NODE_LEASE_RENEW_INTERVAL_DEFAULT | |
| ovnConfigResult.DpuNodeLeaseDuration = DPU_NODE_LEASE_DURATION_DEFAULT | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/network/ovn_kubernetes.go` around lines 1094 - 1102, If either
ovnConfigResult.DpuNodeLeaseRenewInterval or
ovnConfigResult.DpuNodeLeaseDuration is 0 we should normalize both to 0 so the
disable semantics are consistent; update the logic around the current checks to
first detect if either field == 0 and set both
ovnConfigResult.DpuNodeLeaseRenewInterval = 0 and
ovnConfigResult.DpuNodeLeaseDuration = 0, otherwise keep the existing validation
that when both are non-zero and DpuNodeLeaseDuration <=
DpuNodeLeaseRenewInterval you log the warning and reset to
DPU_NODE_LEASE_RENEW_INTERVAL_DEFAULT and DPU_NODE_LEASE_DURATION_DEFAULT.
|
7db199b to
556c81f
Compare
|
@tsorya: This pull request references NVIDIA-596 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the sub-task to target the "5.0.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
1 similar comment
|
@tsorya: This pull request references NVIDIA-596 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the sub-task to target the "5.0.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@tsorya: This pull request references NVIDIA-596 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the sub-task to target the "5.0.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
pkg/network/ovn_kubernetes.go (1)
1077-1092: Consider more specific error messages for different failure cases.The current warning logs the same message for both parse errors and negative values. A more descriptive message could help operators distinguish between a malformed value (e.g., "abc") and an invalid negative number (e.g., "-5").
♻️ Suggested improvement
if val, exists := cm.Data["dpu-node-lease-renew-interval-in-seconds"]; exists { parsed, err := strconv.Atoi(val) - if err == nil && parsed >= 0 { - ovnConfigResult.DpuNodeLeaseRenewInterval = parsed - } else { - klog.Warningf("Invalid dpu-node-lease-renew-interval-in-seconds %q, using default %d", val, DPU_NODE_LEASE_RENEW_INTERVAL_DEFAULT) + if err != nil { + klog.Warningf("dpu-node-lease-renew-interval-in-seconds %q is not a valid integer, using default %d", val, DPU_NODE_LEASE_RENEW_INTERVAL_DEFAULT) + } else if parsed < 0 { + klog.Warningf("dpu-node-lease-renew-interval-in-seconds %q must be non-negative, using default %d", val, DPU_NODE_LEASE_RENEW_INTERVAL_DEFAULT) + } else { + ovnConfigResult.DpuNodeLeaseRenewInterval = parsed } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/network/ovn_kubernetes.go` around lines 1077 - 1092, The warning messages for parsing "dpu-node-lease-renew-interval-in-seconds" and "dpu-node-lease-duration-in-seconds" conflate parse errors and negative-value validation; update the logic in the block that reads cm.Data for those keys (the code that sets ovnConfigResult.DpuNodeLeaseRenewInterval and ovnConfigResult.DpuNodeLeaseDuration) to distinguish cases: if strconv.Atoi returns an error, log a klog.Warningf indicating the value is malformed (include the raw val and the corresponding default constant DPU_NODE_LEASE_RENEW_INTERVAL_DEFAULT or DPU_NODE_LEASE_DURATION_DEFAULT), and if parsed < 0, log a different klog.Warningf stating the value is negative/invalid (include parsed and the default). Ensure behavior (only assign when err==nil && parsed>=0) remains unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@pkg/network/ovn_kubernetes.go`:
- Around line 1077-1092: The warning messages for parsing
"dpu-node-lease-renew-interval-in-seconds" and
"dpu-node-lease-duration-in-seconds" conflate parse errors and negative-value
validation; update the logic in the block that reads cm.Data for those keys (the
code that sets ovnConfigResult.DpuNodeLeaseRenewInterval and
ovnConfigResult.DpuNodeLeaseDuration) to distinguish cases: if strconv.Atoi
returns an error, log a klog.Warningf indicating the value is malformed (include
the raw val and the corresponding default constant
DPU_NODE_LEASE_RENEW_INTERVAL_DEFAULT or DPU_NODE_LEASE_DURATION_DEFAULT), and
if parsed < 0, log a different klog.Warningf stating the value is
negative/invalid (include parsed and the default). Ensure behavior (only assign
when err==nil && parsed>=0) remains unchanged.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 519e2181-fd4f-42bd-83ee-ac609d2937f5
📒 Files selected for processing (10)
bindata/network/multus/multus.yamlbindata/network/ovn-kubernetes/common/008-script-lib.yamlbindata/network/ovn-kubernetes/managed/ovnkube-node.yamlbindata/network/ovn-kubernetes/self-hosted/ovnkube-node.yamlhack/hardware-offload-config.yamlpkg/bootstrap/types.gopkg/network/kube_proxy_test.gopkg/network/ovn_kubernetes.gopkg/network/ovn_kubernetes_dpu_host_test.gopkg/network/ovn_kubernetes_test.go
✅ Files skipped from review due to trivial changes (2)
- pkg/network/kube_proxy_test.go
- bindata/network/ovn-kubernetes/managed/ovnkube-node.yaml
🚧 Files skipped from review as they are similar to previous changes (6)
- hack/hardware-offload-config.yaml
- bindata/network/multus/multus.yaml
- bindata/network/ovn-kubernetes/common/008-script-lib.yaml
- bindata/network/ovn-kubernetes/self-hosted/ovnkube-node.yaml
- pkg/network/ovn_kubernetes_dpu_host_test.go
- pkg/network/ovn_kubernetes_test.go
556c81f to
113ebfa
Compare
Add configurable DPU node lease health monitoring to detect when the DPU-side OVN-Kubernetes component is down or not installed. Without this, pods are scheduled to DPU-accelerated nodes regardless of DPU readiness, causing silent 2-minute CNI ADD timeouts with no visibility or automated remediation. DPU lease configuration: - Read dpu-node-lease-renew-interval and dpu-node-lease-duration from the hardware-offload-config ConfigMap (defaults: 10s / 40s). - Inject OVNKUBE_NODE_LEASE_RENEW_INTERVAL and OVNKUBE_NODE_LEASE_DURATION env vars into ovnkube-controller for dpu-host/dpu node modes. - Script-lib translates env vars into --dpu-node-lease-renew-interval and --dpu-node-lease-duration CLI flags for ovnkube-node. - Setting renew-interval to 0 disables the health check; duration must always be > 0 (required by ovn-kubernetes). - Lease namespace is derived via downward API (fieldRef). Jira: https://issues.redhat.com/browse/NVIDIA-596 Made-with: Cursor Signed-off-by: Igal Tsoiref <itsoiref@redhat.com>
Required to support the CNI STATUS and GC verbs used by the DPU health check to report NetworkReady=false when the DPU lease expires. CNI 1.1.0 is backward compatible with 0.3.1; existing ADD/DEL/CHECK operations are unchanged. Upstream Multus fix merged: k8snetworkplumbingwg/multus-cni#1490. Jira: https://issues.redhat.com/browse/NVIDIA-616 Made-with: Cursor Signed-off-by: Igal Tsoiref <itsoiref@redhat.com>
113ebfa to
e8d6ace
Compare
|
@tsorya: This pull request references NVIDIA-596 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the sub-task to target the "5.0.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
Done |
|
/retest-required |
|
@tsorya: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
NVIDIA-596: Enable DPU healthcheck and bump Multus CNI to 1.1.0
Add configurable DPU node lease health monitoring to detect when the
DPU-side OVN-Kubernetes component is down or not installed. Without
this, pods are scheduled to DPU-accelerated nodes regardless of DPU
readiness, causing silent 2-minute CNI ADD timeouts with no visibility
or automated remediation.
DPU lease configuration:
the hardware-offload-config ConfigMap (defaults: 10s / 40s).
env vars into ovnkube-controller for dpu-host/dpu node modes.
and --dpu-node-lease-duration CLI flags for ovnkube-node.
normalized to 0 when either is 0.
Bump Multus CNI API version to 1.1.0:
health check to report NetworkReady=false when the DPU lease expires.
operations are unchanged.
Made-with: Cursor
Jira: https://issues.redhat.com/browse/NVIDIA-596
Summary by CodeRabbit
New Features
Tests