Skip to content

NVIDIA-596: Enable dpu healthcheck #2941

Open
tsorya wants to merge 2 commits intoopenshift:masterfrom
tsorya:jkary-dpu-health-check
Open

NVIDIA-596: Enable dpu healthcheck #2941
tsorya wants to merge 2 commits intoopenshift:masterfrom
tsorya:jkary-dpu-health-check

Conversation

@tsorya
Copy link
Copy Markdown
Contributor

@tsorya tsorya commented Mar 19, 2026

NVIDIA-596: Enable DPU healthcheck and bump Multus CNI to 1.1.0
Add configurable DPU node lease health monitoring to detect when the
DPU-side OVN-Kubernetes component is down or not installed. Without
this, pods are scheduled to DPU-accelerated nodes regardless of DPU
readiness, causing silent 2-minute CNI ADD timeouts with no visibility
or automated remediation.

DPU lease configuration:

  • Read dpu-node-lease-renew-interval and dpu-node-lease-duration from
    the hardware-offload-config ConfigMap (defaults: 10s / 40s).
  • Inject OVNKUBE_NODE_LEASE_RENEW_INTERVAL and OVNKUBE_NODE_LEASE_DURATION
    env vars into ovnkube-controller for dpu-host/dpu node modes.
  • Script-lib translates env vars into --dpu-node-lease-renew-interval
    and --dpu-node-lease-duration CLI flags for ovnkube-node.
  • Setting either value to 0 disables the health check; both are
    normalized to 0 when either is 0.
  • Lease namespace is derived via downward API (fieldRef).

Bump Multus CNI API version to 1.1.0:

Made-with: Cursor

Jira: https://issues.redhat.com/browse/NVIDIA-596

Summary by CodeRabbit

  • New Features

    • Added DPU node-lease configuration with configurable renew-interval and duration, applied conditionally for DPU node modes and exposed to node agent.
    • Added default lease values and validation to ensure sane renew/duration settings.
    • Updated Multus CNI spec to v1.1.0.
  • Tests

    • Added tests verifying lease env var rendering and behavior across deployment modes.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Mar 19, 2026
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

openshift-ci-robot commented Mar 19, 2026

@tsorya: This pull request references NVIDIA-596 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the sub-task to target the "4.22.0" version, but no target version was set.

Details

In response to this:

NVIDIA-596: pass DPU lease config via env vars on dpu-host/dpu DaemonSets

Add configurable DPU node lease renew interval and duration as env vars on ovnkube-controller, gated to dpu-host/dpu modes. Script-lib builds CLI flags from env vars. Values read from hardware-offload-config ConfigMap with defaults 10s/40s. Setting either to 0 disables the health check. Lease namespace derived via fieldRef.

Jira: https://issues.redhat.com/browse/NVIDIA-596

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

1 similar comment
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

openshift-ci-robot commented Mar 19, 2026

@tsorya: This pull request references NVIDIA-596 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the sub-task to target the "4.22.0" version, but no target version was set.

Details

In response to this:

NVIDIA-596: pass DPU lease config via env vars on dpu-host/dpu DaemonSets

Add configurable DPU node lease renew interval and duration as env vars on ovnkube-controller, gated to dpu-host/dpu modes. Script-lib builds CLI flags from env vars. Values read from hardware-offload-config ConfigMap with defaults 10s/40s. Setting either to 0 disables the health check. Lease namespace derived via fieldRef.

Jira: https://issues.redhat.com/browse/NVIDIA-596

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 19, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Add DPU node lease configuration: new bootstrap fields and defaults, parse/validate ConfigMap keys, expose string values to templates, inject env vars into ovnkube DaemonSet for DPU modes, pass lease flags from node script to ovnkube, add ConfigMap defaults and tests.

Changes

Cohort / File(s) Summary
Bootstrap & render
pkg/network/ovn_kubernetes.go, pkg/bootstrap/types.go
Add DpuNodeLeaseRenewInterval and DpuNodeLeaseDuration fields and defaults; read/validate hardware-offload-config keys and expose stringified values to template render data.
Templates / Manifests
bindata/network/ovn-kubernetes/managed/ovnkube-node.yaml, bindata/network/ovn-kubernetes/self-hosted/ovnkube-node.yaml
Conditionally inject OVNKUBE_NODE_LEASE_RENEW_INTERVAL and OVNKUBE_NODE_LEASE_DURATION env vars into ovnkube-controller when .OVN_NODE_MODE is dpu-host or dpu and renew interval ≠ "0".
Node script (ovnkube CLI)
bindata/network/ovn-kubernetes/common/008-script-lib.yaml
Introduce dpu_lease_flags, append --dpu-node-lease-renew-interval/--dpu-node-lease-duration when env vars set, and include ${dpu_lease_flags} in the ovnkube command args.
Default ConfigMap
hack/hardware-offload-config.yaml
Add dpu-node-lease-renew-interval-in-seconds: "10" and dpu-node-lease-duration-in-seconds: "40" to ConfigMap data.
Tests & fixtures
pkg/network/.../kube_proxy_test.go, pkg/network/ovn_kubernetes_test.go, pkg/network/ovn_kubernetes_dpu_host_test.go
Update fixtures to include new lease fields with defaults; add tests that render templates and assert presence/absence and exact values of lease env vars for full, dpu-host, and dpu modes.
Other manifest
bindata/network/multus/multus.yaml
Bump embedded daemon-config.json cniVersion from "0.3.1" to "1.1.0".

Sequence Diagram(s)

sequenceDiagram
    participant ConfigMap as hardware-offload ConfigMap
    participant Bootstrap as bootstrapOVNConfig
    participant Renderer as template renderer
    participant K8sAPI as Kubernetes API (DaemonSet)
    participant NodeScript as ovnkube node script
    participant ovnkube as ovnkube process

    ConfigMap->>Bootstrap: read dpu-node-lease-* keys
    Bootstrap->>Bootstrap: parse & validate values
    Bootstrap->>Renderer: provide DpuNodeLeaseRenewInterval/DpuNodeLeaseDuration (strings)
    Renderer->>K8sAPI: create/update DaemonSet with env vars (conditional on node mode)
    K8sAPI->>NodeScript: schedule/run ovnkube node script (on node)
    NodeScript->>NodeScript: build dpu_lease_flags from env vars
    NodeScript->>ovnkube: invoke ovnkube with ${dpu_lease_flags}
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 8 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 42.86% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Structure And Quality ⚠️ Warning Test code violates requirement 4 (Assertion Messages) and requirement 5 (Consistency with Codebase) due to multiple assertions without meaningful failure messages. Add meaningful failure messages to all bare Gomega assertions in both TestOVNKubernetesLeaseEnvVars and TestDpuLeaseConfig tests to match established codebase patterns.
✅ Passed checks (8 passed)
Check name Status Explanation
Title check ✅ Passed The title 'NVIDIA-596: Enable dpu healthcheck' clearly and specifically describes the main objective of this pull request: enabling DPU (Data Processing Unit) health check functionality, which is the primary purpose across all the file changes.
Stable And Deterministic Test Names ✅ Passed Two new Go test functions with static, deterministic names follow stable naming conventions. Test cases use static string names defined in struct literals with no dynamic data, ensuring consistency across runs.
Microshift Test Compatibility ✅ Passed PR adds only standard Go unit tests in pkg/network package using testing package, not Ginkgo e2e tests. Tests cover configuration rendering and environment variables, all compatible with MicroShift.
Single Node Openshift (Sno) Test Compatibility ✅ Passed Tests added are standard Go unit tests using testing.T, not Ginkgo e2e tests. SNO compatibility check is not applicable.
Topology-Aware Scheduling Compatibility ✅ Passed PR changes add only DPU lease configuration via environment variables with no scheduling constraints, affinity rules, replica logic, or topology-dependent features.
Ote Binary Stdout Contract ✅ Passed The PR changes do not violate the OTE Binary Stdout Contract. Modifications use klog.Warningf() for validation logging (writes to stderr), with no direct stdout writes in process-level code.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed The new tests render Kubernetes DaemonSet templates and verify environment variables using only fake/mock clients with no external connectivity, confirming IPv6-only and disconnected CI environment compatibility.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci bot requested review from jcaamano and pperiyasamy March 19, 2026 04:15
@tsorya tsorya marked this pull request as draft March 19, 2026 12:04
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 19, 2026
@tsorya tsorya force-pushed the jkary-dpu-health-check branch from 62c31b1 to b5a3d66 Compare March 20, 2026 03:45
@tsorya tsorya marked this pull request as ready for review March 20, 2026 03:46
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 20, 2026
@openshift-ci openshift-ci bot requested review from danwinship and pliurh March 20, 2026 03:46
@tsorya
Copy link
Copy Markdown
Contributor Author

tsorya commented Mar 20, 2026

/retest-required

@tsorya
Copy link
Copy Markdown
Contributor Author

tsorya commented Mar 20, 2026

Blocked by k8snetworkplumbingwg/multus-cni#1490

@openshift-ci openshift-ci bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 20, 2026
@yingwang-0320
Copy link
Copy Markdown

@tsorya Could you please help rebase this PR, then I can build an image to run some pre-merge testing.

@tsorya tsorya force-pushed the jkary-dpu-health-check branch from 1eb0381 to 6b9ed3a Compare March 31, 2026 02:12
@openshift-ci openshift-ci bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 31, 2026
@tsorya
Copy link
Copy Markdown
Contributor Author

tsorya commented Mar 31, 2026

@tsorya Could you please help rebase this PR, then I can build an image to run some pre-merge testing.

done

@yingwang-0320
Copy link
Copy Markdown

/verified by pre-merge testing.
@tsorya I built image with this PR and ran CNO and multicast cases, all passed.
But I can't build an image with both #2941 and #2944, because there's conflict in file:bindata/network/ovn-kubernetes/common/008-script-lib.yaml

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Mar 31, 2026
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@yingwang-0320: This PR has been marked as verified by pre-merge testing..

Details

In response to this:

/verified by pre-merge testing.
@tsorya I built image with this PR and ran CNO and multicast cases, all passed.
But I can't build an image with both #2941 and #2944, because there's conflict in file:bindata/network/ovn-kubernetes/common/008-script-lib.yaml

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Comment thread bindata/network/ovn-kubernetes/common/008-script-lib.yaml Outdated
Comment thread bindata/network/ovn-kubernetes/common/008-script-lib.yaml Outdated
Comment thread bindata/network/ovn-kubernetes/common/008-script-lib.yaml Outdated
Comment thread hack/hardware-offload-config.yaml
daemon-config.json: |
{
"cniVersion": "0.3.1",
"cniVersion": "1.1.0",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multus was updated to the new version which enables CNI status

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UPdated PR description

@openshift-ci-robot openshift-ci-robot removed the verified Signifies that the PR passed pre-merge verification criteria label Apr 16, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci bot commented Apr 16, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: tsorya
Once this PR has been reviewed and has the lgtm label, please assign tssurya for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot
Copy link
Copy Markdown
Contributor

openshift-ci-robot commented Apr 16, 2026

@tsorya: This pull request references NVIDIA-596 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the sub-task to target the "5.0.0" version, but no target version was set.

Details

In response to this:

NVIDIA-596: pass DPU lease config via env vars on dpu-host/dpu DaemonSets

Add configurable DPU node lease renew interval and duration as env vars on ovnkube-controller, gated to dpu-host/dpu modes. Script-lib builds CLI flags from env vars. Values read from hardware-offload-config ConfigMap with defaults 10s/40s. Setting either to 0 disables the health check. Lease namespace derived via fieldRef.

Jira: https://issues.redhat.com/browse/NVIDIA-596

Summary by CodeRabbit

  • New Features

  • Added DPU node lease configuration support with customizable renewal intervals and durations for improved stability in hardware-accelerated networking environments

  • Updated Multus CNI plugin to support specification version 1.1.0

  • Tests

  • Added test coverage for DPU node lease environment variable configuration across different deployment modes

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/network/ovn_kubernetes.go`:
- Around line 1094-1102: If either ovnConfigResult.DpuNodeLeaseRenewInterval or
ovnConfigResult.DpuNodeLeaseDuration is 0 we should normalize both to 0 so the
disable semantics are consistent; update the logic around the current checks to
first detect if either field == 0 and set both
ovnConfigResult.DpuNodeLeaseRenewInterval = 0 and
ovnConfigResult.DpuNodeLeaseDuration = 0, otherwise keep the existing validation
that when both are non-zero and DpuNodeLeaseDuration <=
DpuNodeLeaseRenewInterval you log the warning and reset to
DPU_NODE_LEASE_RENEW_INTERVAL_DEFAULT and DPU_NODE_LEASE_DURATION_DEFAULT.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 9e7b87eb-40e5-48db-92c5-608250f639d9

📥 Commits

Reviewing files that changed from the base of the PR and between 6b9ed3a and 7db199b.

📒 Files selected for processing (3)
  • bindata/network/ovn-kubernetes/common/008-script-lib.yaml
  • hack/hardware-offload-config.yaml
  • pkg/network/ovn_kubernetes.go
✅ Files skipped from review due to trivial changes (1)
  • hack/hardware-offload-config.yaml
🚧 Files skipped from review as they are similar to previous changes (1)
  • bindata/network/ovn-kubernetes/common/008-script-lib.yaml

Comment thread pkg/network/ovn_kubernetes.go Outdated
Comment on lines +1094 to +1102
// Setting either value to 0 disables the DPU health check.
// When both are non-zero, duration must be greater than interval.
if ovnConfigResult.DpuNodeLeaseRenewInterval != 0 && ovnConfigResult.DpuNodeLeaseDuration != 0 &&
ovnConfigResult.DpuNodeLeaseDuration <= ovnConfigResult.DpuNodeLeaseRenewInterval {
klog.Warningf("dpu-node-lease-duration-in-seconds (%d) must be greater than dpu-node-lease-renew-interval-in-seconds (%d), using defaults",
ovnConfigResult.DpuNodeLeaseDuration, ovnConfigResult.DpuNodeLeaseRenewInterval)
ovnConfigResult.DpuNodeLeaseRenewInterval = DPU_NODE_LEASE_RENEW_INTERVAL_DEFAULT
ovnConfigResult.DpuNodeLeaseDuration = DPU_NODE_LEASE_DURATION_DEFAULT
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Normalize disable semantics when either lease value is 0.

Line 1094 says either value disables health check, but current flow can still render flags when renew interval is non-zero and duration is 0. Please normalize both fields to 0 when either is 0 so behavior is consistent and not dependent on downstream flag parsing behavior.

Proposed fix
-		// Setting either value to 0 disables the DPU health check.
-		// When both are non-zero, duration must be greater than interval.
-		if ovnConfigResult.DpuNodeLeaseRenewInterval != 0 && ovnConfigResult.DpuNodeLeaseDuration != 0 &&
-			ovnConfigResult.DpuNodeLeaseDuration <= ovnConfigResult.DpuNodeLeaseRenewInterval {
+		// Setting either value to 0 disables the DPU health check.
+		if ovnConfigResult.DpuNodeLeaseRenewInterval == 0 || ovnConfigResult.DpuNodeLeaseDuration == 0 {
+			ovnConfigResult.DpuNodeLeaseRenewInterval = 0
+			ovnConfigResult.DpuNodeLeaseDuration = 0
+		} else if ovnConfigResult.DpuNodeLeaseDuration <= ovnConfigResult.DpuNodeLeaseRenewInterval {
+			// When both are non-zero, duration must be greater than interval.
 			klog.Warningf("dpu-node-lease-duration-in-seconds (%d) must be greater than dpu-node-lease-renew-interval-in-seconds (%d), using defaults",
 				ovnConfigResult.DpuNodeLeaseDuration, ovnConfigResult.DpuNodeLeaseRenewInterval)
 			ovnConfigResult.DpuNodeLeaseRenewInterval = DPU_NODE_LEASE_RENEW_INTERVAL_DEFAULT
 			ovnConfigResult.DpuNodeLeaseDuration = DPU_NODE_LEASE_DURATION_DEFAULT
 		}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// Setting either value to 0 disables the DPU health check.
// When both are non-zero, duration must be greater than interval.
if ovnConfigResult.DpuNodeLeaseRenewInterval != 0 && ovnConfigResult.DpuNodeLeaseDuration != 0 &&
ovnConfigResult.DpuNodeLeaseDuration <= ovnConfigResult.DpuNodeLeaseRenewInterval {
klog.Warningf("dpu-node-lease-duration-in-seconds (%d) must be greater than dpu-node-lease-renew-interval-in-seconds (%d), using defaults",
ovnConfigResult.DpuNodeLeaseDuration, ovnConfigResult.DpuNodeLeaseRenewInterval)
ovnConfigResult.DpuNodeLeaseRenewInterval = DPU_NODE_LEASE_RENEW_INTERVAL_DEFAULT
ovnConfigResult.DpuNodeLeaseDuration = DPU_NODE_LEASE_DURATION_DEFAULT
}
// Setting either value to 0 disables the DPU health check.
if ovnConfigResult.DpuNodeLeaseRenewInterval == 0 || ovnConfigResult.DpuNodeLeaseDuration == 0 {
ovnConfigResult.DpuNodeLeaseRenewInterval = 0
ovnConfigResult.DpuNodeLeaseDuration = 0
} else if ovnConfigResult.DpuNodeLeaseDuration <= ovnConfigResult.DpuNodeLeaseRenewInterval {
// When both are non-zero, duration must be greater than interval.
klog.Warningf("dpu-node-lease-duration-in-seconds (%d) must be greater than dpu-node-lease-renew-interval-in-seconds (%d), using defaults",
ovnConfigResult.DpuNodeLeaseDuration, ovnConfigResult.DpuNodeLeaseRenewInterval)
ovnConfigResult.DpuNodeLeaseRenewInterval = DPU_NODE_LEASE_RENEW_INTERVAL_DEFAULT
ovnConfigResult.DpuNodeLeaseDuration = DPU_NODE_LEASE_DURATION_DEFAULT
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/network/ovn_kubernetes.go` around lines 1094 - 1102, If either
ovnConfigResult.DpuNodeLeaseRenewInterval or
ovnConfigResult.DpuNodeLeaseDuration is 0 we should normalize both to 0 so the
disable semantics are consistent; update the logic around the current checks to
first detect if either field == 0 and set both
ovnConfigResult.DpuNodeLeaseRenewInterval = 0 and
ovnConfigResult.DpuNodeLeaseDuration = 0, otherwise keep the existing validation
that when both are non-zero and DpuNodeLeaseDuration <=
DpuNodeLeaseRenewInterval you log the warning and reset to
DPU_NODE_LEASE_RENEW_INTERVAL_DEFAULT and DPU_NODE_LEASE_DURATION_DEFAULT.

@danwinship
Copy link
Copy Markdown
Contributor

  • please squash the new commit back into the first commit.
  • and update the multus commit to have a commit message explaining why you changed it
  • look into what coderabbit said
  • if you want to undo the renaming ("-in-seconds") you can... I guess there's an argument to be made for being consistent with the ovn-k option name

@tsorya tsorya force-pushed the jkary-dpu-health-check branch from 7db199b to 556c81f Compare April 16, 2026 23:38
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

openshift-ci-robot commented Apr 16, 2026

@tsorya: This pull request references NVIDIA-596 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the sub-task to target the "5.0.0" version, but no target version was set.

Details

In response to this:

NVIDIA-596: pass DPU lease config via env vars on dpu-host/dpu DaemonSets

NVIDIA-596: Enable DPU healthcheck and bump Multus CNI to 1.1.0
Add configurable DPU node lease health monitoring to detect when the
DPU-side OVN-Kubernetes component is down or not installed. Without
this, pods are scheduled to DPU-accelerated nodes regardless of DPU
readiness, causing silent 2-minute CNI ADD timeouts with no visibility
or automated remediation.

DPU lease configuration:

  • Read dpu-node-lease-renew-interval and dpu-node-lease-duration from
    the hardware-offload-config ConfigMap (defaults: 10s / 40s).
  • Inject OVNKUBE_NODE_LEASE_RENEW_INTERVAL and OVNKUBE_NODE_LEASE_DURATION
    env vars into ovnkube-controller for dpu-host/dpu node modes.
  • Script-lib translates env vars into --dpu-node-lease-renew-interval
    and --dpu-node-lease-duration CLI flags for ovnkube-node.
  • Setting either value to 0 disables the health check; both are
    normalized to 0 when either is 0.
  • Lease namespace is derived via downward API (fieldRef).

Bump Multus CNI API version to 1.1.0:

Made-with: Cursor

Jira: https://issues.redhat.com/browse/NVIDIA-596

Summary by CodeRabbit

  • New Features

  • Added DPU node lease configuration support with customizable renewal intervals and durations for improved stability in hardware-accelerated networking environments

  • Updated Multus CNI plugin to support specification version 1.1.0

  • Tests

  • Added test coverage for DPU node lease environment variable configuration across different deployment modes

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

1 similar comment
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

openshift-ci-robot commented Apr 16, 2026

@tsorya: This pull request references NVIDIA-596 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the sub-task to target the "5.0.0" version, but no target version was set.

Details

In response to this:

NVIDIA-596: pass DPU lease config via env vars on dpu-host/dpu DaemonSets

NVIDIA-596: Enable DPU healthcheck and bump Multus CNI to 1.1.0
Add configurable DPU node lease health monitoring to detect when the
DPU-side OVN-Kubernetes component is down or not installed. Without
this, pods are scheduled to DPU-accelerated nodes regardless of DPU
readiness, causing silent 2-minute CNI ADD timeouts with no visibility
or automated remediation.

DPU lease configuration:

  • Read dpu-node-lease-renew-interval and dpu-node-lease-duration from
    the hardware-offload-config ConfigMap (defaults: 10s / 40s).
  • Inject OVNKUBE_NODE_LEASE_RENEW_INTERVAL and OVNKUBE_NODE_LEASE_DURATION
    env vars into ovnkube-controller for dpu-host/dpu node modes.
  • Script-lib translates env vars into --dpu-node-lease-renew-interval
    and --dpu-node-lease-duration CLI flags for ovnkube-node.
  • Setting either value to 0 disables the health check; both are
    normalized to 0 when either is 0.
  • Lease namespace is derived via downward API (fieldRef).

Bump Multus CNI API version to 1.1.0:

Made-with: Cursor

Jira: https://issues.redhat.com/browse/NVIDIA-596

Summary by CodeRabbit

  • New Features

  • Added DPU node lease configuration support with customizable renewal intervals and durations for improved stability in hardware-accelerated networking environments

  • Updated Multus CNI plugin to support specification version 1.1.0

  • Tests

  • Added test coverage for DPU node lease environment variable configuration across different deployment modes

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link
Copy Markdown
Contributor

openshift-ci-robot commented Apr 16, 2026

@tsorya: This pull request references NVIDIA-596 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the sub-task to target the "5.0.0" version, but no target version was set.

Details

In response to this:

NVIDIA-596: pass DPU lease config via env vars on dpu-host/dpu DaemonSets

NVIDIA-596: Enable DPU healthcheck and bump Multus CNI to 1.1.0
Add configurable DPU node lease health monitoring to detect when the
DPU-side OVN-Kubernetes component is down or not installed. Without
this, pods are scheduled to DPU-accelerated nodes regardless of DPU
readiness, causing silent 2-minute CNI ADD timeouts with no visibility
or automated remediation.

DPU lease configuration:

  • Read dpu-node-lease-renew-interval and dpu-node-lease-duration from
    the hardware-offload-config ConfigMap (defaults: 10s / 40s).
  • Inject OVNKUBE_NODE_LEASE_RENEW_INTERVAL and OVNKUBE_NODE_LEASE_DURATION
    env vars into ovnkube-controller for dpu-host/dpu node modes.
  • Script-lib translates env vars into --dpu-node-lease-renew-interval
    and --dpu-node-lease-duration CLI flags for ovnkube-node.
  • Setting either value to 0 disables the health check; both are
    normalized to 0 when either is 0.
  • Lease namespace is derived via downward API (fieldRef).

Bump Multus CNI API version to 1.1.0:

Made-with: Cursor

Jira: https://issues.redhat.com/browse/NVIDIA-596

Summary by CodeRabbit

  • New Features

  • Added DPU node-lease configuration with configurable renew-interval and duration, applied conditionally for DPU node modes and exposed to node agent.

  • Added default lease values and validation to ensure sane renew/duration settings.

  • Updated Multus CNI spec to v1.1.0.

  • Tests

  • Added tests verifying lease env var rendering and behavior across deployment modes.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
pkg/network/ovn_kubernetes.go (1)

1077-1092: Consider more specific error messages for different failure cases.

The current warning logs the same message for both parse errors and negative values. A more descriptive message could help operators distinguish between a malformed value (e.g., "abc") and an invalid negative number (e.g., "-5").

♻️ Suggested improvement
 if val, exists := cm.Data["dpu-node-lease-renew-interval-in-seconds"]; exists {
     parsed, err := strconv.Atoi(val)
-    if err == nil && parsed >= 0 {
-        ovnConfigResult.DpuNodeLeaseRenewInterval = parsed
-    } else {
-        klog.Warningf("Invalid dpu-node-lease-renew-interval-in-seconds %q, using default %d", val, DPU_NODE_LEASE_RENEW_INTERVAL_DEFAULT)
+    if err != nil {
+        klog.Warningf("dpu-node-lease-renew-interval-in-seconds %q is not a valid integer, using default %d", val, DPU_NODE_LEASE_RENEW_INTERVAL_DEFAULT)
+    } else if parsed < 0 {
+        klog.Warningf("dpu-node-lease-renew-interval-in-seconds %q must be non-negative, using default %d", val, DPU_NODE_LEASE_RENEW_INTERVAL_DEFAULT)
+    } else {
+        ovnConfigResult.DpuNodeLeaseRenewInterval = parsed
     }
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/network/ovn_kubernetes.go` around lines 1077 - 1092, The warning messages
for parsing "dpu-node-lease-renew-interval-in-seconds" and
"dpu-node-lease-duration-in-seconds" conflate parse errors and negative-value
validation; update the logic in the block that reads cm.Data for those keys (the
code that sets ovnConfigResult.DpuNodeLeaseRenewInterval and
ovnConfigResult.DpuNodeLeaseDuration) to distinguish cases: if strconv.Atoi
returns an error, log a klog.Warningf indicating the value is malformed (include
the raw val and the corresponding default constant
DPU_NODE_LEASE_RENEW_INTERVAL_DEFAULT or DPU_NODE_LEASE_DURATION_DEFAULT), and
if parsed < 0, log a different klog.Warningf stating the value is
negative/invalid (include parsed and the default). Ensure behavior (only assign
when err==nil && parsed>=0) remains unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@pkg/network/ovn_kubernetes.go`:
- Around line 1077-1092: The warning messages for parsing
"dpu-node-lease-renew-interval-in-seconds" and
"dpu-node-lease-duration-in-seconds" conflate parse errors and negative-value
validation; update the logic in the block that reads cm.Data for those keys (the
code that sets ovnConfigResult.DpuNodeLeaseRenewInterval and
ovnConfigResult.DpuNodeLeaseDuration) to distinguish cases: if strconv.Atoi
returns an error, log a klog.Warningf indicating the value is malformed (include
the raw val and the corresponding default constant
DPU_NODE_LEASE_RENEW_INTERVAL_DEFAULT or DPU_NODE_LEASE_DURATION_DEFAULT), and
if parsed < 0, log a different klog.Warningf stating the value is
negative/invalid (include parsed and the default). Ensure behavior (only assign
when err==nil && parsed>=0) remains unchanged.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 519e2181-fd4f-42bd-83ee-ac609d2937f5

📥 Commits

Reviewing files that changed from the base of the PR and between 7db199b and 556c81f.

📒 Files selected for processing (10)
  • bindata/network/multus/multus.yaml
  • bindata/network/ovn-kubernetes/common/008-script-lib.yaml
  • bindata/network/ovn-kubernetes/managed/ovnkube-node.yaml
  • bindata/network/ovn-kubernetes/self-hosted/ovnkube-node.yaml
  • hack/hardware-offload-config.yaml
  • pkg/bootstrap/types.go
  • pkg/network/kube_proxy_test.go
  • pkg/network/ovn_kubernetes.go
  • pkg/network/ovn_kubernetes_dpu_host_test.go
  • pkg/network/ovn_kubernetes_test.go
✅ Files skipped from review due to trivial changes (2)
  • pkg/network/kube_proxy_test.go
  • bindata/network/ovn-kubernetes/managed/ovnkube-node.yaml
🚧 Files skipped from review as they are similar to previous changes (6)
  • hack/hardware-offload-config.yaml
  • bindata/network/multus/multus.yaml
  • bindata/network/ovn-kubernetes/common/008-script-lib.yaml
  • bindata/network/ovn-kubernetes/self-hosted/ovnkube-node.yaml
  • pkg/network/ovn_kubernetes_dpu_host_test.go
  • pkg/network/ovn_kubernetes_test.go

@tsorya tsorya force-pushed the jkary-dpu-health-check branch from 556c81f to 113ebfa Compare April 17, 2026 00:20
tsorya added 2 commits April 16, 2026 20:28
Add configurable DPU node lease health monitoring to detect when the
DPU-side OVN-Kubernetes component is down or not installed. Without
this, pods are scheduled to DPU-accelerated nodes regardless of DPU
readiness, causing silent 2-minute CNI ADD timeouts with no visibility
or automated remediation.

DPU lease configuration:
- Read dpu-node-lease-renew-interval and dpu-node-lease-duration from
  the hardware-offload-config ConfigMap (defaults: 10s / 40s).
- Inject OVNKUBE_NODE_LEASE_RENEW_INTERVAL and OVNKUBE_NODE_LEASE_DURATION
  env vars into ovnkube-controller for dpu-host/dpu node modes.
- Script-lib translates env vars into --dpu-node-lease-renew-interval
  and --dpu-node-lease-duration CLI flags for ovnkube-node.
- Setting renew-interval to 0 disables the health check; duration
  must always be > 0 (required by ovn-kubernetes).
- Lease namespace is derived via downward API (fieldRef).

Jira: https://issues.redhat.com/browse/NVIDIA-596
Made-with: Cursor
Signed-off-by: Igal Tsoiref <itsoiref@redhat.com>
Required to support the CNI STATUS and GC verbs used by the DPU
health check to report NetworkReady=false when the DPU lease expires.
CNI 1.1.0 is backward compatible with 0.3.1; existing ADD/DEL/CHECK
operations are unchanged.

Upstream Multus fix merged: k8snetworkplumbingwg/multus-cni#1490.

Jira: https://issues.redhat.com/browse/NVIDIA-616
Made-with: Cursor
Signed-off-by: Igal Tsoiref <itsoiref@redhat.com>
@tsorya tsorya force-pushed the jkary-dpu-health-check branch from 113ebfa to e8d6ace Compare April 17, 2026 00:31
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

openshift-ci-robot commented Apr 17, 2026

@tsorya: This pull request references NVIDIA-596 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the sub-task to target the "5.0.0" version, but no target version was set.

Details

In response to this:

NVIDIA-596: Enable DPU healthcheck and bump Multus CNI to 1.1.0
Add configurable DPU node lease health monitoring to detect when the
DPU-side OVN-Kubernetes component is down or not installed. Without
this, pods are scheduled to DPU-accelerated nodes regardless of DPU
readiness, causing silent 2-minute CNI ADD timeouts with no visibility
or automated remediation.

DPU lease configuration:

  • Read dpu-node-lease-renew-interval and dpu-node-lease-duration from
    the hardware-offload-config ConfigMap (defaults: 10s / 40s).
  • Inject OVNKUBE_NODE_LEASE_RENEW_INTERVAL and OVNKUBE_NODE_LEASE_DURATION
    env vars into ovnkube-controller for dpu-host/dpu node modes.
  • Script-lib translates env vars into --dpu-node-lease-renew-interval
    and --dpu-node-lease-duration CLI flags for ovnkube-node.
  • Setting either value to 0 disables the health check; both are
    normalized to 0 when either is 0.
  • Lease namespace is derived via downward API (fieldRef).

Bump Multus CNI API version to 1.1.0:

Made-with: Cursor

Jira: https://issues.redhat.com/browse/NVIDIA-596

Summary by CodeRabbit

  • New Features

  • Added DPU node-lease configuration with configurable renew-interval and duration, applied conditionally for DPU node modes and exposed to node agent.

  • Added default lease values and validation to ensure sane renew/duration settings.

  • Updated Multus CNI spec to v1.1.0.

  • Tests

  • Added tests verifying lease env var rendering and behavior across deployment modes.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@tsorya
Copy link
Copy Markdown
Contributor Author

tsorya commented Apr 17, 2026

  • please squash the new commit back into the first commit.
  • and update the multus commit to have a commit message explaining why you changed it
  • look into what coderabbit said
  • if you want to undo the renaming ("-in-seconds") you can... I guess there's an argument to be made for being consistent with the ovn-k option name

Done

@tsorya
Copy link
Copy Markdown
Contributor Author

tsorya commented Apr 19, 2026

/retest-required

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci bot commented Apr 19, 2026

@tsorya: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/security e8d6ace link false /test security
ci/prow/e2e-aws-ovn-rhcos10-techpreview e8d6ace link false /test e2e-aws-ovn-rhcos10-techpreview
ci/prow/e2e-metal-ipi-ovn-dualstack-bgp-local-gw e8d6ace link true /test e2e-metal-ipi-ovn-dualstack-bgp-local-gw

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants