fix(SREP-4825: Fix stale UpgradeNodeDrainFailedSRE alert blocking upgrade completion)#611
fix(SREP-4825: Fix stale UpgradeNodeDrainFailedSRE alert blocking upgrade completion)#611devppratik wants to merge 2 commits into
Conversation
|
@devppratik: This pull request references SREP-4825 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "5.0.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Warning Rate limit exceeded
You’ve run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Repository: openshift/coderabbit/.coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (2)
WalkthroughThe PR adds node drain failure metric reset behavior in two places: when a node is deleted during reconciliation in the NodeKeeper controller, and after successful post-upgrade health checks complete. Tests are updated to expect the new metric calls. ChangesNode Drain Metrics Reset
Estimated code review effort🎯 2 (Simple) | ⏱️ ~8 minutes 🚥 Pre-merge checks | ✅ 11 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (11 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: devppratik The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
controllers/nodekeeper/nodekeeper_controller.go (1)
75-80: ⚡ Quick winConsider logging when metrics client creation fails.
Currently, if
NewClientreturns an error, the metric reset is silently skipped. WhilePostUpgradeHealthCheckwill eventually clean up all node drain metrics, logging the error would improve observability when debugging why a metric wasn't immediately reset.📊 Proposed logging enhancement
if errors.IsNotFound(err) { // Node was deleted - reset the metric for this node to prevent stale alerts metricsClient, err := r.MetricsClientBuilder.NewClient(r.Client) - if err == nil { + if err != nil { + reqLogger.Error(err, fmt.Sprintf("Failed to create metrics client for deleted node %s, metric reset skipped", request.Name)) + } else { reqLogger.Info(fmt.Sprintf("Node %s deleted, resetting NodeDrainFailed metric", request.Name)) metricsClient.ResetMetricNodeDrainFailed(request.Name) } return reconcile.Result{}, nil }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@controllers/nodekeeper/nodekeeper_controller.go` around lines 75 - 80, When creating the metrics client via r.MetricsClientBuilder.NewClient(r.Client) in the node-deleted branch, the code currently ignores the error; update the block to log the failure when err != nil before skipping the ResetMetricNodeDrainFailed call. Specifically, after calling r.MetricsClientBuilder.NewClient(r.Client) and getting an error, emit a structured log (e.g., reqLogger.Error(err, "failed to create metrics client for resetting NodeDrainFailed", "node", request.Name)) so failures to obtain the client are visible, while keeping the existing successful-path behavior that calls metricsClient.ResetMetricNodeDrainFailed(request.Name).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@controllers/nodekeeper/nodekeeper_controller_test.go`:
- Line 240: Add a test in nodekeeper_controller_test.go that either removes the
stray comment or implements the missing scenario: mock the controller's kube
Client.Get to return a NotFound error (errors.IsNotFound) for the node request
and ensure the metrics client mock receives a call to ResetMetricNodeDrainFailed
with the expected node name; locate the relevant controller logic around
Client.Get and ResetMetricNodeDrainFailed in nodekeeper_controller.go and assert
the metric reset is invoked when Get returns NotFound, or if you prefer not to
add a test, delete the misleading comment.
---
Nitpick comments:
In `@controllers/nodekeeper/nodekeeper_controller.go`:
- Around line 75-80: When creating the metrics client via
r.MetricsClientBuilder.NewClient(r.Client) in the node-deleted branch, the code
currently ignores the error; update the block to log the failure when err != nil
before skipping the ResetMetricNodeDrainFailed call. Specifically, after calling
r.MetricsClientBuilder.NewClient(r.Client) and getting an error, emit a
structured log (e.g., reqLogger.Error(err, "failed to create metrics client for
resetting NodeDrainFailed", "node", request.Name)) so failures to obtain the
client are visible, while keeping the existing successful-path behavior that
calls metricsClient.ResetMetricNodeDrainFailed(request.Name).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 90c91649-0b6d-46bf-97b3-c70e630da936
📒 Files selected for processing (4)
controllers/nodekeeper/nodekeeper_controller.gocontrollers/nodekeeper/nodekeeper_controller_test.gopkg/upgraders/healthcheckstep.gopkg/upgraders/healthcheckstep_test.go
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #611 +/- ##
==========================================
+ Coverage 53.59% 53.75% +0.15%
==========================================
Files 123 123
Lines 6165 6173 +8
==========================================
+ Hits 3304 3318 +14
+ Misses 2668 2662 -6
Partials 193 193
🚀 New features to boost your workflow:
|
|
/test validate |
1 similar comment
|
/test validate |
|
@devppratik: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
What type of PR is this?
bug
What this PR does / why we need it?
When a node is deleted/replaced during upgrade, the
upgradeoperator_node_drain_timeoutmetric persists in Prometheus, causing the UpgradeNodeDrainFailedSRE alert to continue firing for non-existent nodes. This blocks the ClusterHealthyAfterUpgrade stage indefinitely.For detailed analysis refer to comment in JIRA
This commit changes the following
Both changes use the existing ResetAllMetricNodeDrainFailed() method and only affect node-specific metrics without impacting other alert functionality.
Which Jira/Github issue(s) this PR fixes?
Fixes #SREP-4825
Special notes for your reviewer:
Pre-checks (if applicable):