fix(SREP-4825: Fix stale UpgradeNodeDrainFailedSRE alert blocking upgrade completion) by devppratik · Pull Request #611 · openshift/managed-upgrade-operator

devppratik · 2026-05-10T11:47:36Z

What type of PR is this?

bug

What this PR does / why we need it?

When a node is deleted/replaced during upgrade, the upgradeoperator_node_drain_timeout metric persists in Prometheus, causing the UpgradeNodeDrainFailedSRE alert to continue firing for non-existent nodes. This blocks the ClusterHealthyAfterUpgrade stage indefinitely.

For detailed analysis refer to comment in JIRA

This commit changes the following

NodeKeeper controller now resets the node drain metric when a node deletion is detected during reconciliation, preventing stale metrics from accumulating.
PostUpgradeHealthCheck step resets all node drain metrics after successful upgrade completion, ensuring any edge-case stale metrics are cleaned up.

Both changes use the existing ResetAllMetricNodeDrainFailed() method and only affect node-specific metrics without impacting other alert functionality.

Which Jira/Github issue(s) this PR fixes?

Fixes #SREP-4825

Special notes for your reviewer:

Pre-checks (if applicable):

Tested latest changes against a cluster
Included documentation changes with PR

…rade completion)

openshift-ci-robot · 2026-05-10T11:47:40Z

@devppratik: This pull request references SREP-4825 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "5.0.0" version, but no target version was set.

Details

In response to this:

What type of PR is this?

bug

What this PR does / why we need it?

When a node is deleted/replaced during upgrade, the upgradeoperator_node_drain_timeout metric persists in Prometheus, causing the UpgradeNodeDrainFailedSRE alert to continue firing for non-existent nodes. This blocks the ClusterHealthyAfterUpgrade stage indefinitely.

This commit changes the following

NodeKeeper controller now resets the node drain metric when a node deletion is detected during reconciliation, preventing stale metrics from accumulating.

PostUpgradeHealthCheck step resets all node drain metrics after successful upgrade completion, ensuring any edge-case stale metrics are cleaned up.

Both changes use the existing ResetAllMetricNodeDrainFailed() method and only affect node-specific metrics without impacting other alert functionality.

Which Jira/Github issue(s) this PR fixes?

Fixes #SREP-4825

Special notes for your reviewer:

Pre-checks (if applicable):

Tested latest changes against a cluster

Included documentation changes with PR

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

coderabbitai · 2026-05-10T11:47:49Z

Warning

Rate limit exceeded

@devppratik has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 51 minutes and 5 seconds before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 2130d60a-9c9f-4af9-9afb-1106c29a768a

📥 Commits

Reviewing files that changed from the base of the PR and between 17d1d43 and da2b194.

📒 Files selected for processing (2)

controllers/nodekeeper/nodekeeper_controller.go
controllers/nodekeeper/nodekeeper_controller_test.go

Walkthrough

The PR adds node drain failure metric reset behavior in two places: when a node is deleted during reconciliation in the NodeKeeper controller, and after successful post-upgrade health checks complete. Tests are updated to expect the new metric calls.

Changes

Node Drain Metrics Reset

Layer / File(s)	Summary
Core Implementation `controllers/nodekeeper/nodekeeper_controller.go`, `pkg/upgraders/healthcheckstep.go`	NodeKeeper controller resets `NodeDrainFailed` metric when a target node is deleted (NotFound). PostUpgradeHealthCheck resets all node drain metrics after critical alerts and cluster operators checks pass.
Test Updates `controllers/nodekeeper/nodekeeper_controller_test.go`, `pkg/upgraders/healthcheckstep_test.go`	Test comment added for node deletion metric reset scenario. Mock assertions updated across four post-upgrade health check test cases to expect `ResetAllMetricNodeDrainFailed()` calls.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

🚥 Pre-merge checks | ✅ 11 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Test Structure And Quality	⚠️ Warning	Line 240 of nodekeeper_controller_test.go contains only a comment with no test implementation. Also, test assertions lack meaningful failure messages.	Add NotFound test case and meaningful assertion messages to Expect calls following Gomega patterns.

✅ Passed checks (11 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly summarizes the main change: fixing a stale alert issue that blocks upgrade completion by resetting node drain metrics when nodes are deleted or after successful upgrades.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names	✅ Passed	All Ginkgo test names in modified test files are stable and deterministic. No new test titles were introduced, and existing test names contain only static, descriptive strings with no dynamic content.
Microshift Test Compatibility	✅ Passed	No new Ginkgo e2e tests were added. Only unit tests were modified in controllers/ and pkg/ directories. The check applies only to e2e tests in test/e2e/, so it is not applicable.
Single Node Openshift (Sno) Test Compatibility	✅ Passed	E2E tests added don't make explicit multi-node or HA assumptions. They test operator installation, RBAC, UpgradeConfig state, and metrics using standard OpenShift APIs available on SNO.
Topology-Aware Scheduling Compatibility	✅ Passed	PR introduces no scheduling constraints, deployment manifests, or topology assumptions. Changes are metrics handling in controller code only.
Ote Binary Stdout Contract	✅ Passed	PR introduces no OTE Binary Stdout Contract violations. All logging uses logr.Logger properly. No direct stdout writes in process-level code. No changes to main() initialization.
Ipv6 And Disconnected Network Test Compatibility	✅ Passed	The PR does not add new Ginkgo e2e tests. Changes are limited to unit tests in controllers/nodekeeper and pkg/upgraders packages. No e2e tests were modified. Check is not applicable.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

openshift-ci · 2026-05-10T11:47:54Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: devppratik
Once this PR has been reviewed and has the lgtm label, please assign bmeng for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

controllers/nodekeeper/nodekeeper_controller.go (1)

75-80: ⚡ Quick win

Consider logging when metrics client creation fails.

Currently, if NewClient returns an error, the metric reset is silently skipped. While PostUpgradeHealthCheck will eventually clean up all node drain metrics, logging the error would improve observability when debugging why a metric wasn't immediately reset.

📊 Proposed logging enhancement

 if errors.IsNotFound(err) {
     // Node was deleted - reset the metric for this node to prevent stale alerts
     metricsClient, err := r.MetricsClientBuilder.NewClient(r.Client)
-    if err == nil {
+    if err != nil {
+        reqLogger.Error(err, fmt.Sprintf("Failed to create metrics client for deleted node %s, metric reset skipped", request.Name))
+    } else {
         reqLogger.Info(fmt.Sprintf("Node %s deleted, resetting NodeDrainFailed metric", request.Name))
         metricsClient.ResetMetricNodeDrainFailed(request.Name)
     }
     return reconcile.Result{}, nil
 }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@controllers/nodekeeper/nodekeeper_controller.go` around lines 75 - 80, When
creating the metrics client via r.MetricsClientBuilder.NewClient(r.Client) in
the node-deleted branch, the code currently ignores the error; update the block
to log the failure when err != nil before skipping the
ResetMetricNodeDrainFailed call. Specifically, after calling
r.MetricsClientBuilder.NewClient(r.Client) and getting an error, emit a
structured log (e.g., reqLogger.Error(err, "failed to create metrics client for
resetting NodeDrainFailed", "node", request.Name)) so failures to obtain the
client are visible, while keeping the existing successful-path behavior that
calls metricsClient.ResetMetricNodeDrainFailed(request.Name).

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@controllers/nodekeeper/nodekeeper_controller_test.go`:
- Line 240: Add a test in nodekeeper_controller_test.go that either removes the
stray comment or implements the missing scenario: mock the controller's kube
Client.Get to return a NotFound error (errors.IsNotFound) for the node request
and ensure the metrics client mock receives a call to ResetMetricNodeDrainFailed
with the expected node name; locate the relevant controller logic around
Client.Get and ResetMetricNodeDrainFailed in nodekeeper_controller.go and assert
the metric reset is invoked when Get returns NotFound, or if you prefer not to
add a test, delete the misleading comment.

---

Nitpick comments:
In `@controllers/nodekeeper/nodekeeper_controller.go`:
- Around line 75-80: When creating the metrics client via
r.MetricsClientBuilder.NewClient(r.Client) in the node-deleted branch, the code
currently ignores the error; update the block to log the failure when err != nil
before skipping the ResetMetricNodeDrainFailed call. Specifically, after calling
r.MetricsClientBuilder.NewClient(r.Client) and getting an error, emit a
structured log (e.g., reqLogger.Error(err, "failed to create metrics client for
resetting NodeDrainFailed", "node", request.Name)) so failures to obtain the
client are visible, while keeping the existing successful-path behavior that
calls metricsClient.ResetMetricNodeDrainFailed(request.Name).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 90c91649-0b6d-46bf-97b3-c70e630da936

📥 Commits

Reviewing files that changed from the base of the PR and between eb0dded and 17d1d43.

📒 Files selected for processing (4)

controllers/nodekeeper/nodekeeper_controller.go
controllers/nodekeeper/nodekeeper_controller_test.go
pkg/upgraders/healthcheckstep.go
pkg/upgraders/healthcheckstep_test.go

codecov-commenter · 2026-05-10T12:02:20Z

Codecov Report

❌ Patch coverage is 62.50000% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 53.75%. Comparing base (eb0dded) to head (da2b194).

Files with missing lines	Patch %	Lines
controllers/nodekeeper/nodekeeper_controller.go	50.00%	2 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #611      +/-   ##
==========================================
+ Coverage   53.59%   53.75%   +0.15%     
==========================================
  Files         123      123              
  Lines        6165     6173       +8     
==========================================
+ Hits         3304     3318      +14     
+ Misses       2668     2662       -6     
  Partials      193      193

Files with missing lines	Coverage Δ
pkg/upgraders/healthcheckstep.go	`81.05% <100.00%> (+0.40%)`	⬆️
controllers/nodekeeper/nodekeeper_controller.go	`54.11% <50.00%> (+3.48%)`	⬆️

... and 1 file with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

devppratik · 2026-05-20T15:19:48Z

/test validate

devppratik · 2026-05-21T16:41:28Z

/test validate

openshift-ci · 2026-05-21T16:44:49Z

@devppratik: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/validate	`da2b194`	link	true	`/test validate`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

fix(SREP-4825: Fix stale UpgradeNodeDrainFailedSRE alert blocking upg…

17d1d43

…rade completion)

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label May 10, 2026

openshift-ci Bot requested review from bmeng and rbhilare May 10, 2026 11:47

coderabbitai Bot reviewed May 10, 2026

View reviewed changes

Comment thread controllers/nodekeeper/nodekeeper_controller_test.go Outdated

fix coderabbit comments

da2b194

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(SREP-4825: Fix stale UpgradeNodeDrainFailedSRE alert blocking upgrade completion)#611

fix(SREP-4825: Fix stale UpgradeNodeDrainFailedSRE alert blocking upgrade completion)#611
devppratik wants to merge 2 commits into
openshift:masterfrom
devppratik:SREP-4825

devppratik commented May 10, 2026 •

edited

Loading

Uh oh!

openshift-ci-robot commented May 10, 2026 •

edited by atlassian Bot

Loading

What type of PR is this?

What this PR does / why we need it?

Which Jira/Github issue(s) this PR fixes?

Special notes for your reviewer:

Pre-checks (if applicable):

Uh oh!

coderabbitai Bot commented May 10, 2026 •

edited

Loading

Rate limit exceeded

❌ Failed checks (1 warning)

Uh oh!

openshift-ci Bot commented May 10, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

codecov-commenter commented May 10, 2026

Uh oh!

devppratik commented May 20, 2026

Uh oh!

devppratik commented May 21, 2026

Uh oh!

openshift-ci Bot commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

devppratik commented May 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this?

What this PR does / why we need it?

Which Jira/Github issue(s) this PR fixes?

Special notes for your reviewer:

Pre-checks (if applicable):

Uh oh!

openshift-ci-robot commented May 10, 2026 • edited by atlassian Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this?

What this PR does / why we need it?

Which Jira/Github issue(s) this PR fixes?

Special notes for your reviewer:

Pre-checks (if applicable):

Uh oh!

coderabbitai Bot commented May 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

openshift-ci Bot commented May 10, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codecov-commenter commented May 10, 2026

Codecov Report

Uh oh!

devppratik commented May 20, 2026

Uh oh!

devppratik commented May 21, 2026

Uh oh!

openshift-ci Bot commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

devppratik commented May 10, 2026 •

edited

Loading

openshift-ci-robot commented May 10, 2026 •

edited by atlassian Bot

Loading

coderabbitai Bot commented May 10, 2026 •

edited

Loading