Skip to content

fix(SREP-4825: Fix stale UpgradeNodeDrainFailedSRE alert blocking upgrade completion)#611

Open
devppratik wants to merge 2 commits into
openshift:masterfrom
devppratik:SREP-4825
Open

fix(SREP-4825: Fix stale UpgradeNodeDrainFailedSRE alert blocking upgrade completion)#611
devppratik wants to merge 2 commits into
openshift:masterfrom
devppratik:SREP-4825

Conversation

@devppratik
Copy link
Copy Markdown
Contributor

@devppratik devppratik commented May 10, 2026

What type of PR is this?

bug

What this PR does / why we need it?

When a node is deleted/replaced during upgrade, the upgradeoperator_node_drain_timeout metric persists in Prometheus, causing the UpgradeNodeDrainFailedSRE alert to continue firing for non-existent nodes. This blocks the ClusterHealthyAfterUpgrade stage indefinitely.

For detailed analysis refer to comment in JIRA

This commit changes the following

  1. NodeKeeper controller now resets the node drain metric when a node deletion is detected during reconciliation, preventing stale metrics from accumulating.
  2. PostUpgradeHealthCheck step resets all node drain metrics after successful upgrade completion, ensuring any edge-case stale metrics are cleaned up.

Both changes use the existing ResetAllMetricNodeDrainFailed() method and only affect node-specific metrics without impacting other alert functionality.

Which Jira/Github issue(s) this PR fixes?

Fixes #SREP-4825

Special notes for your reviewer:

Pre-checks (if applicable):

  • Tested latest changes against a cluster
  • Included documentation changes with PR

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label May 10, 2026
@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented May 10, 2026

@devppratik: This pull request references SREP-4825 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "5.0.0" version, but no target version was set.

Details

In response to this:

What type of PR is this?

bug

What this PR does / why we need it?

When a node is deleted/replaced during upgrade, the upgradeoperator_node_drain_timeout metric persists in Prometheus, causing the UpgradeNodeDrainFailedSRE alert to continue firing for non-existent nodes. This blocks the ClusterHealthyAfterUpgrade stage indefinitely.

This commit changes the following

  1. NodeKeeper controller now resets the node drain metric when a node deletion is detected during reconciliation, preventing stale metrics from accumulating.
  2. PostUpgradeHealthCheck step resets all node drain metrics after successful upgrade completion, ensuring any edge-case stale metrics are cleaned up.

Both changes use the existing ResetAllMetricNodeDrainFailed() method and only affect node-specific metrics without impacting other alert functionality.

Which Jira/Github issue(s) this PR fixes?

Fixes #SREP-4825

Special notes for your reviewer:

Pre-checks (if applicable):

  • Tested latest changes against a cluster
  • Included documentation changes with PR

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 10, 2026

Warning

Rate limit exceeded

@devppratik has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 51 minutes and 5 seconds before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 2130d60a-9c9f-4af9-9afb-1106c29a768a

📥 Commits

Reviewing files that changed from the base of the PR and between 17d1d43 and da2b194.

📒 Files selected for processing (2)
  • controllers/nodekeeper/nodekeeper_controller.go
  • controllers/nodekeeper/nodekeeper_controller_test.go

Walkthrough

The PR adds node drain failure metric reset behavior in two places: when a node is deleted during reconciliation in the NodeKeeper controller, and after successful post-upgrade health checks complete. Tests are updated to expect the new metric calls.

Changes

Node Drain Metrics Reset

Layer / File(s) Summary
Core Implementation
controllers/nodekeeper/nodekeeper_controller.go, pkg/upgraders/healthcheckstep.go
NodeKeeper controller resets NodeDrainFailed metric when a target node is deleted (NotFound). PostUpgradeHealthCheck resets all node drain metrics after critical alerts and cluster operators checks pass.
Test Updates
controllers/nodekeeper/nodekeeper_controller_test.go, pkg/upgraders/healthcheckstep_test.go
Test comment added for node deletion metric reset scenario. Mock assertions updated across four post-upgrade health check test cases to expect ResetAllMetricNodeDrainFailed() calls.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

🚥 Pre-merge checks | ✅ 11 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Test Structure And Quality ⚠️ Warning Line 240 of nodekeeper_controller_test.go contains only a comment with no test implementation. Also, test assertions lack meaningful failure messages. Add NotFound test case and meaningful assertion messages to Expect calls following Gomega patterns.
✅ Passed checks (11 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: fixing a stale alert issue that blocks upgrade completion by resetting node drain metrics when nodes are deleted or after successful upgrades.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed All Ginkgo test names in modified test files are stable and deterministic. No new test titles were introduced, and existing test names contain only static, descriptive strings with no dynamic content.
Microshift Test Compatibility ✅ Passed No new Ginkgo e2e tests were added. Only unit tests were modified in controllers/ and pkg/ directories. The check applies only to e2e tests in test/e2e/, so it is not applicable.
Single Node Openshift (Sno) Test Compatibility ✅ Passed E2E tests added don't make explicit multi-node or HA assumptions. They test operator installation, RBAC, UpgradeConfig state, and metrics using standard OpenShift APIs available on SNO.
Topology-Aware Scheduling Compatibility ✅ Passed PR introduces no scheduling constraints, deployment manifests, or topology assumptions. Changes are metrics handling in controller code only.
Ote Binary Stdout Contract ✅ Passed PR introduces no OTE Binary Stdout Contract violations. All logging uses logr.Logger properly. No direct stdout writes in process-level code. No changes to main() initialization.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed The PR does not add new Ginkgo e2e tests. Changes are limited to unit tests in controllers/nodekeeper and pkg/upgraders packages. No e2e tests were modified. Check is not applicable.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot requested review from bmeng and rbhilare May 10, 2026 11:47
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 10, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: devppratik
Once this PR has been reviewed and has the lgtm label, please assign bmeng for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
controllers/nodekeeper/nodekeeper_controller.go (1)

75-80: ⚡ Quick win

Consider logging when metrics client creation fails.

Currently, if NewClient returns an error, the metric reset is silently skipped. While PostUpgradeHealthCheck will eventually clean up all node drain metrics, logging the error would improve observability when debugging why a metric wasn't immediately reset.

📊 Proposed logging enhancement
 if errors.IsNotFound(err) {
     // Node was deleted - reset the metric for this node to prevent stale alerts
     metricsClient, err := r.MetricsClientBuilder.NewClient(r.Client)
-    if err == nil {
+    if err != nil {
+        reqLogger.Error(err, fmt.Sprintf("Failed to create metrics client for deleted node %s, metric reset skipped", request.Name))
+    } else {
         reqLogger.Info(fmt.Sprintf("Node %s deleted, resetting NodeDrainFailed metric", request.Name))
         metricsClient.ResetMetricNodeDrainFailed(request.Name)
     }
     return reconcile.Result{}, nil
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@controllers/nodekeeper/nodekeeper_controller.go` around lines 75 - 80, When
creating the metrics client via r.MetricsClientBuilder.NewClient(r.Client) in
the node-deleted branch, the code currently ignores the error; update the block
to log the failure when err != nil before skipping the
ResetMetricNodeDrainFailed call. Specifically, after calling
r.MetricsClientBuilder.NewClient(r.Client) and getting an error, emit a
structured log (e.g., reqLogger.Error(err, "failed to create metrics client for
resetting NodeDrainFailed", "node", request.Name)) so failures to obtain the
client are visible, while keeping the existing successful-path behavior that
calls metricsClient.ResetMetricNodeDrainFailed(request.Name).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@controllers/nodekeeper/nodekeeper_controller_test.go`:
- Line 240: Add a test in nodekeeper_controller_test.go that either removes the
stray comment or implements the missing scenario: mock the controller's kube
Client.Get to return a NotFound error (errors.IsNotFound) for the node request
and ensure the metrics client mock receives a call to ResetMetricNodeDrainFailed
with the expected node name; locate the relevant controller logic around
Client.Get and ResetMetricNodeDrainFailed in nodekeeper_controller.go and assert
the metric reset is invoked when Get returns NotFound, or if you prefer not to
add a test, delete the misleading comment.

---

Nitpick comments:
In `@controllers/nodekeeper/nodekeeper_controller.go`:
- Around line 75-80: When creating the metrics client via
r.MetricsClientBuilder.NewClient(r.Client) in the node-deleted branch, the code
currently ignores the error; update the block to log the failure when err != nil
before skipping the ResetMetricNodeDrainFailed call. Specifically, after calling
r.MetricsClientBuilder.NewClient(r.Client) and getting an error, emit a
structured log (e.g., reqLogger.Error(err, "failed to create metrics client for
resetting NodeDrainFailed", "node", request.Name)) so failures to obtain the
client are visible, while keeping the existing successful-path behavior that
calls metricsClient.ResetMetricNodeDrainFailed(request.Name).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 90c91649-0b6d-46bf-97b3-c70e630da936

📥 Commits

Reviewing files that changed from the base of the PR and between eb0dded and 17d1d43.

📒 Files selected for processing (4)
  • controllers/nodekeeper/nodekeeper_controller.go
  • controllers/nodekeeper/nodekeeper_controller_test.go
  • pkg/upgraders/healthcheckstep.go
  • pkg/upgraders/healthcheckstep_test.go

Comment thread controllers/nodekeeper/nodekeeper_controller_test.go Outdated
@codecov-commenter
Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 62.50000% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 53.75%. Comparing base (eb0dded) to head (da2b194).

Files with missing lines Patch % Lines
controllers/nodekeeper/nodekeeper_controller.go 50.00% 2 Missing and 1 partial ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #611      +/-   ##
==========================================
+ Coverage   53.59%   53.75%   +0.15%     
==========================================
  Files         123      123              
  Lines        6165     6173       +8     
==========================================
+ Hits         3304     3318      +14     
+ Misses       2668     2662       -6     
  Partials      193      193              
Files with missing lines Coverage Δ
pkg/upgraders/healthcheckstep.go 81.05% <100.00%> (+0.40%) ⬆️
controllers/nodekeeper/nodekeeper_controller.go 54.11% <50.00%> (+3.48%) ⬆️

... and 1 file with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@devppratik
Copy link
Copy Markdown
Contributor Author

/test validate

1 similar comment
@devppratik
Copy link
Copy Markdown
Contributor Author

/test validate

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 21, 2026

@devppratik: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/validate da2b194 link true /test validate

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants