Skip to content

ci: Improve che-happy-path test reliability with retry logic and health checks#1581

Merged
dkwon17 merged 5 commits intomainfrom
improve-che-happy-path-reliability
Mar 27, 2026
Merged

ci: Improve che-happy-path test reliability with retry logic and health checks#1581
dkwon17 merged 5 commits intomainfrom
improve-che-happy-path-reliability

Conversation

@akurinnoy
Copy link
Copy Markdown
Collaborator

@akurinnoy akurinnoy commented Feb 3, 2026

What does this PR do?

Improves reliability of the v14-che-happy-path CI test by adding retry logic, health checks, and comprehensive error diagnostics to the test script.

Why is this needed?

The current happy-path test fails intermittently due to transient infrastructure issues (image pull timeouts, operator reconciliation delays, API server unavailability) that are unrelated to code changes. These false-positive failures reduce CI reliability and slow down development.

Key Improvements

Retry Logic

  • 2 retry attempts with exponential backoff (60s base + 0-15s jitter)
  • Cleanup between retries to ensure fresh state

Health Checks

  • DWO: Waits for deployment condition=available before proceeding
  • Che: Waits for CheCluster condition=Available with 10-minute timeout

Diagnostics

  • Collects operator logs, CheCluster CR, pod info, and events on each failure
  • Clear error messages identifying which stage failed

Configuration

  • Realistic timeouts: 24 hours (86,400s) instead of unrealistic defaults
  • Configurable retry count and delays via environment variables

Expected Impact

  • Reliability: Reduce flakiness from ~50% to >90% success rate
  • Debugging: Comprehensive artifacts for faster issue resolution
  • Compatibility: Drop-in replacement, no Prow config changes needed

Testing

✅ Validated on local CRC cluster with PR #1578

  • DWO deployment verified with health checks
  • Retry logic executed successfully (2 attempts with 71s backoff)
  • Comprehensive artifacts collected on failures
  • Error reporting clear and actionable

Documentation

Complete script documentation: .ci/README-CHE-HAPPY-PATH.md

  • Configuration options and environment variables
  • Usage examples (Prow CI + local testing)
  • Common failure scenarios and troubleshooting
  • Artifact locations and structure

Summary by CodeRabbit

  • Documentation

    • Added comprehensive CI documentation for DevWorkspace Operator and Eclipse Che integration testing workflow.
  • Chores

    • Enhanced CI pipeline with retry mechanisms and exponential backoff for improved deployment reliability.
    • Added diagnostic artifact collection and failure classification capabilities.
    • Implemented structured failure/success reporting with PR comment integration for tracking test flakiness.

export CHE_REPO_BRANCH="${CHE_REPO_BRANCH:-main}"

# Download and run the remote test script
if ! bash <(curl -s "https://raw.githubusercontent.com/eclipse/che/${CHE_REPO_BRANCH}/tests/devworkspace-happy-path/remote-launch.sh"); then

Check warning

Code scanning / Scorecard

Pinned-Dependencies Medium

score is 2: downloadThenRun not pinned by hash
Click Remediation section below to solve this issue
@akurinnoy akurinnoy force-pushed the improve-che-happy-path-reliability branch from 528e591 to 299de3a Compare February 3, 2026 13:28
@dkwon17
Copy link
Copy Markdown
Collaborator

dkwon17 commented Feb 4, 2026

/retest

1 similar comment
@dkwon17
Copy link
Copy Markdown
Collaborator

dkwon17 commented Feb 6, 2026

/retest

…hecks

This commit enhances the `.ci/oci-devworkspace-happy-path.sh` script to
significantly improve test reliability in CI environments by adding:

- Health checks for DWO and Che deployments using kubectl wait
- Retry logic with exponential backoff (2 retries, 60s base delay)
- Comprehensive artifact collection on failures
- Graceful error handling and cleanup between retries
- Clear error messages with stage identification

The improvements address flakiness in the v14-che-happy-path Prow test
by handling transient failures (image pull timeouts, API server issues,
operator reconciliation delays) and providing detailed diagnostics for
genuine failures.

Key features:
- DWO verification: Waits for deployment condition=available
- Che verification: Waits for CheCluster condition=Available
- Retry strategy: 2 attempts with exponential backoff + jitter
- Artifact collection: Operator logs, CheCluster CR, pod info, events
- Cleanup: Deletes failed deployments before retry
- Realistic timeouts: 24 hours (86400s) for pod wait/ready

Expected impact: Reduce CI flakiness from ~50% to >90% success rate for
infrastructure-related failures, with significantly better diagnostics.

Assisted-by: Claude Sonnet 4.5 <noreply@anthropic.com>
Co-Authored-By: Oleksii Kurinnyi <okurinny@redhat.com>
Signed-off-by: Oleksii Kurinnyi <okurinny@redhat.com>
…ealth checks

Signed-off-by: Oleksii Kurinnyi <okurinny@redhat.com>
…c and health checks

Signed-off-by: Oleksii Kurinnyi <okurinny@redhat.com>
@akurinnoy akurinnoy force-pushed the improve-che-happy-path-reliability branch from 751a1a8 to 6d0371f Compare March 19, 2026 12:29
…ry logic and health checks

- Remove broken verifyCheDeployment() (CheCluster has no condition=Available)
- Fix exit 1 -> return 1 in main()
- Fix cleanup: use kubectl wait --for=delete instead of sleep 10
- Add retry for happy-path test (1 retry with 30s delay)
- Add failure classification, reporting, and PR commenting
- Fix pipe delimiter injection, variable quoting, artifact overwrite
- Update README to match code changes

Assisted-by: Claude Opus 4.6 (1M context)
Signed-off-by: Oleksii Kurinnyi <okurinny@redhat.com>
@akurinnoy akurinnoy force-pushed the improve-che-happy-path-reliability branch from 99decd2 to f82470e Compare March 26, 2026 15:18
@codecov
Copy link
Copy Markdown

codecov bot commented Mar 26, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 36.19%. Comparing base (b3e2af5) to head (f82470e).
⚠️ Report is 18 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1581      +/-   ##
==========================================
+ Coverage   35.48%   36.19%   +0.71%     
==========================================
  Files         168      168              
  Lines       14484    14534      +50     
==========================================
+ Hits         5139     5260     +121     
+ Misses       9006     8930      -76     
- Partials      339      344       +5     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@openshift-ci
Copy link
Copy Markdown

openshift-ci bot commented Mar 26, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: akurinnoy, dkwon17

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

…ith retry logic and health checks

Add CHE_REPO_BRANCH format validation to prevent injection

Assisted-by: Claude Opus 4.6 (1M context)
Signed-off-by: Oleksii Kurinnyi <okurinny@redhat.com>
@openshift-ci openshift-ci bot removed the lgtm label Mar 27, 2026
@openshift-ci
Copy link
Copy Markdown

openshift-ci bot commented Mar 27, 2026

New changes are detected. LGTM label has been removed.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 27, 2026

📝 Walkthrough

Walkthrough

A new CI documentation file describes the Eclipse Che happy-path test workflow, and the corresponding shell script is significantly expanded with staged deployment logic, retry mechanisms with exponential backoff, health checks, diagnostic artifact collection, failure classification, and automated PR comment reporting.

Changes

Cohort / File(s) Summary
CI Documentation
.ci/README-CHE-HAPPY-PATH.md
New documentation file describing the Che happy-path test flow, including deployment stages, retry behavior, health checks, diagnostic artifacts, configuration via environment variables, and CI reporting behavior.
CI Deployment & Test Orchestration
.ci/oci-devworkspace-happy-path.sh
Significantly expanded script with new utility imports, enhanced DWO deployment with readiness validation, multi-attempt Che deployment with OLM health checks, exponential backoff retry logic, artifact collection and failure classification, happy-path test execution, structured attempt logging, unified report generation (Markdown and JSON), and PR comment posting via gh or GitHub API.

Sequence Diagram

sequenceDiagram
    participant Main as Main Orchestrator
    participant DWO as DevWorkspace Operator
    participant K8s as Kubernetes
    participant OLM as OLM
    participant Che as Eclipse Che
    participant Test as Happy-Path Test
    participant Artifacts as Artifact Collector
    participant Report as Report Generator

    Main->>DWO: deployDWO()
    DWO->>K8s: make install
    DWO->>K8s: wait for deployment readiness
    K8s-->>DWO: ready
    DWO-->>Main: success

    Main->>Main: deployAndVerifyChe (retry loop)
    loop Retry with exponential backoff
        Main->>OLM: verifyOLMHealth()
        OLM-->>Main: healthy
        Main->>Che: deployChe(attempt)
        Che->>K8s: chectl server:deploy
        K8s-->>Che: deployment response
        alt Deployment Failed
            Che-->>Main: failure
            Main->>Artifacts: collectCheArtifacts()
            Artifacts->>K8s: gather logs/events/status
            K8s-->>Artifacts: diagnostics
            Main->>Main: classifyCheFailure()
            Main->>Che: cleanupFailedChe()
            Che->>K8s: cleanup resources
        else Deployment Succeeded
            Che-->>Main: success
            Main->>Main: break retry loop
        end
    end

    Main->>Test: runHappyPathTest()
    Test->>Test: validate CHE_REPO_BRANCH
    Test->>Test: execute remote script
    alt Test Failed
        Test-->>Main: failure
        Main->>Artifacts: collectCheArtifacts()
    else Test Passed
        Test-->>Main: success
    end

    Main->>Report: generateReport()
    Report->>Report: categorize failures
    Report->>Report: build Markdown + JSON
    Report-->>Main: reports

    Main->>Main: commentPR(markdown_report)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 Whisker-twitching joy, the script now breathes,
With retries, health checks, and diagnostic leaves,
DevWorkspace hops through stages with care,
Che deploys, tests run, reports float through air,
CI pipelines dance with each PR's delight!

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title accurately describes the main changes: improving the che-happy-path test reliability through the addition of retry logic, health checks, and enhanced diagnostics.
Docstring Coverage ✅ Passed Docstring coverage is 85.71% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch improve-che-happy-path-reliability

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
.ci/oci-devworkspace-happy-path.sh (2)

376-381: Consider using jq for JSON construction to avoid potential escaping issues.

The JSON is built via string concatenation. While the current reason values are controlled strings without special characters, any future modification that introduces quotes, backslashes, or newlines in the reason would produce malformed JSON.

♻️ Optional: Use jq for safer JSON construction
     if [ "$first" = true ]; then
       first=false
     else
       attempts_json="${attempts_json},"
     fi
-    attempts_json="${attempts_json}{\"attempt\":\"${attempt}\",\"stage\":\"${stage}\",\"result\":\"${attempt_result}\",\"reason\":\"${reason}\"}"
+    attempts_json="${attempts_json}$(jq -nc --arg a "$attempt" --arg s "$stage" --arg r "$attempt_result" --arg re "$reason" '{attempt:$a,stage:$s,result:$r,reason:$re}')"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.ci/oci-devworkspace-happy-path.sh around lines 376 - 381, The code builds
JSON by string-concatenating into attempts_json using variables attempt, stage,
attempt_result and reason which can break with special characters; change this
to build each record via jq (e.g., echo '{}' | jq --arg attempt "$attempt" --arg
stage "$stage" --arg result "$attempt_result" --arg reason "$reason" '. +
{attempt:$attempt,stage:$stage,result:$result,reason:$reason}') and then append
the produced safe JSON objects to an array (or use jq to append into
attempts_json), ensuring all fields are properly escaped instead of raw string
concatenation.

179-192: Minor: Cleanup timeout masked by || true.

If kubectl wait --for=delete times out, the || true allows the script to proceed silently. In rare cases where CheCluster deletion takes longer than 120s, the retry attempt may encounter residual resources. The 15s grace period helps, but consider logging a warning if the wait times out.

🔧 Optional: Log warning on cleanup timeout
   # Wait for CheCluster CR to be fully deleted
   echo "Waiting for CheCluster deletion to complete..."
-  kubectl wait --for=delete checluster --all \
+  if ! kubectl wait --for=delete checluster --all \
     -n "$CHE_NAMESPACE" \
-    --timeout=120s 2>&1 || true
+    --timeout=120s 2>&1; then
+    echo "WARNING: CheCluster deletion timed out, proceeding anyway"
+  fi
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.ci/oci-devworkspace-happy-path.sh around lines 179 - 192, In
cleanupFailedChe(), detect whether the kubectl wait --for=delete checluster
--all ... --timeout=120s command timed out instead of swallowing its failure
with `|| true`: run the kubectl wait and capture its exit status (or use a
conditional on the command), and if it fails (non-zero) call echo or logger to
emit a warning that CheCluster deletion timed out and residual resources may
remain; keep the existing 15s sleep and ensure the message references the
timeout so operators can see the retry risk.
.ci/README-CHE-HAPPY-PATH.md (1)

149-170: Add language specifier to fenced code block.

The fenced code block showing the artifact directory structure should have a language specifier for better rendering and to satisfy markdownlint (MD040).

📝 Proposed fix
-```
+```text
 $ARTIFACT_DIR/
 ├── attempt-log.txt
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.ci/README-CHE-HAPPY-PATH.md around lines 149 - 170, The fenced code block
that shows the artifact directory structure is missing a language specifier
which triggers markdownlint MD040; update the opening fence in the
README-CHE-HAPPY-PATH.md block from ``` to include a language (e.g., ```text) so
the tree output is rendered and linted correctly; locate the directory-tree
fenced block (the block that begins with the three backticks immediately before
the $ARTIFACT_DIR/ listing) and change its opening fence to include the chosen
language specifier.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.ci/oci-devworkspace-happy-path.sh:
- Around line 229-232: The modulo operation using MAX_JITTER can divide by zero;
update the jitter calculation in the retry/backoff block so it only computes
jitter with RANDOM % MAX_JITTER when MAX_JITTER is a positive integer, otherwise
set jitter=0 (or a safe default). Specifically, before computing
"jitter=$((RANDOM % MAX_JITTER))" add a guard that tests MAX_JITTER (non-empty
and >0) and falls back to jitter=0; leave the exponential_delay and final
"delay" computation unchanged. This ensures safe behavior when MAX_JITTER=0 or
unset.

---

Nitpick comments:
In @.ci/oci-devworkspace-happy-path.sh:
- Around line 376-381: The code builds JSON by string-concatenating into
attempts_json using variables attempt, stage, attempt_result and reason which
can break with special characters; change this to build each record via jq
(e.g., echo '{}' | jq --arg attempt "$attempt" --arg stage "$stage" --arg result
"$attempt_result" --arg reason "$reason" '. +
{attempt:$attempt,stage:$stage,result:$result,reason:$reason}') and then append
the produced safe JSON objects to an array (or use jq to append into
attempts_json), ensuring all fields are properly escaped instead of raw string
concatenation.
- Around line 179-192: In cleanupFailedChe(), detect whether the kubectl wait
--for=delete checluster --all ... --timeout=120s command timed out instead of
swallowing its failure with `|| true`: run the kubectl wait and capture its exit
status (or use a conditional on the command), and if it fails (non-zero) call
echo or logger to emit a warning that CheCluster deletion timed out and residual
resources may remain; keep the existing 15s sleep and ensure the message
references the timeout so operators can see the retry risk.

In @.ci/README-CHE-HAPPY-PATH.md:
- Around line 149-170: The fenced code block that shows the artifact directory
structure is missing a language specifier which triggers markdownlint MD040;
update the opening fence in the README-CHE-HAPPY-PATH.md block from ``` to
include a language (e.g., ```text) so the tree output is rendered and linted
correctly; locate the directory-tree fenced block (the block that begins with
the three backticks immediately before the $ARTIFACT_DIR/ listing) and change
its opening fence to include the chosen language specifier.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: dceb580b-fb76-48fd-8ff6-55fc3fdc4a8e

📥 Commits

Reviewing files that changed from the base of the PR and between 2d6966e and 2eac19f.

📒 Files selected for processing (2)
  • .ci/README-CHE-HAPPY-PATH.md
  • .ci/oci-devworkspace-happy-path.sh

Comment on lines +229 to +232
# Calculate exponential backoff with jitter
local exponential_delay=$((BASE_DELAY * (2 ** (attempt - 1))))
local jitter=$((RANDOM % MAX_JITTER))
local delay=$((exponential_delay + jitter))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Guard against division by zero if MAX_JITTER=0.

If MAX_JITTER is set to 0 via environment variable, RANDOM % MAX_JITTER will cause a division by zero error, crashing the script unexpectedly.

🛡️ Proposed fix
       # Calculate exponential backoff with jitter
       local exponential_delay=$((BASE_DELAY * (2 ** (attempt - 1))))
-      local jitter=$((RANDOM % MAX_JITTER))
+      local jitter=0
+      [ "$MAX_JITTER" -gt 0 ] && jitter=$((RANDOM % MAX_JITTER))
       local delay=$((exponential_delay + jitter))
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Calculate exponential backoff with jitter
local exponential_delay=$((BASE_DELAY * (2 ** (attempt - 1))))
local jitter=$((RANDOM % MAX_JITTER))
local delay=$((exponential_delay + jitter))
# Calculate exponential backoff with jitter
local exponential_delay=$((BASE_DELAY * (2 ** (attempt - 1))))
local jitter=0
[ "$MAX_JITTER" -gt 0 ] && jitter=$((RANDOM % MAX_JITTER))
local delay=$((exponential_delay + jitter))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.ci/oci-devworkspace-happy-path.sh around lines 229 - 232, The modulo
operation using MAX_JITTER can divide by zero; update the jitter calculation in
the retry/backoff block so it only computes jitter with RANDOM % MAX_JITTER when
MAX_JITTER is a positive integer, otherwise set jitter=0 (or a safe default).
Specifically, before computing "jitter=$((RANDOM % MAX_JITTER))" add a guard
that tests MAX_JITTER (non-empty and >0) and falls back to jitter=0; leave the
exponential_delay and final "delay" computation unchanged. This ensures safe
behavior when MAX_JITTER=0 or unset.

@dkwon17 dkwon17 merged commit 441bcb2 into main Mar 27, 2026
12 checks passed
@dkwon17 dkwon17 deleted the improve-che-happy-path-reliability branch March 27, 2026 15:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants