ci: Improve che-happy-path test reliability with retry logic and health checks by akurinnoy · Pull Request #1581 · devfile/devworkspace-operator

akurinnoy · 2026-02-03T13:26:52Z

What does this PR do?

Improves reliability of the v14-che-happy-path CI test by adding retry logic, health checks, and comprehensive error diagnostics to the test script.

Why is this needed?

The current happy-path test fails intermittently due to transient infrastructure issues (image pull timeouts, operator reconciliation delays, API server unavailability) that are unrelated to code changes. These false-positive failures reduce CI reliability and slow down development.

Key Improvements

Retry Logic

2 retry attempts with exponential backoff (60s base + 0-15s jitter)
Cleanup between retries to ensure fresh state

Health Checks

DWO: Waits for deployment condition=available before proceeding
Che: Waits for CheCluster condition=Available with 10-minute timeout

Diagnostics

Collects operator logs, CheCluster CR, pod info, and events on each failure
Clear error messages identifying which stage failed

Configuration

Realistic timeouts: 24 hours (86,400s) instead of unrealistic defaults
Configurable retry count and delays via environment variables

Expected Impact

Reliability: Reduce flakiness from ~50% to >90% success rate
Debugging: Comprehensive artifacts for faster issue resolution
Compatibility: Drop-in replacement, no Prow config changes needed

Testing

✅ Validated on local CRC cluster with PR #1578

DWO deployment verified with health checks
Retry logic executed successfully (2 attempts with 71s backoff)
Comprehensive artifacts collected on failures
Error reporting clear and actionable

Documentation

Complete script documentation: .ci/README-CHE-HAPPY-PATH.md

Configuration options and environment variables
Usage examples (Prow CI + local testing)
Common failure scenarios and troubleshooting
Artifact locations and structure

Summary by CodeRabbit

Documentation
- Added comprehensive CI documentation for DevWorkspace Operator and Eclipse Che integration testing workflow.
Chores
- Enhanced CI pipeline with retry mechanisms and exponential backoff for improved deployment reliability.
- Added diagnostic artifact collection and failure classification capabilities.
- Implemented structured failure/success reporting with PR comment integration for tracking test flakiness.

.ci/oci-devworkspace-happy-path.sh

+  export CHE_REPO_BRANCH="${CHE_REPO_BRANCH:-main}"
+
+  # Download and run the remote test script
+  if ! bash <(curl -s "https://raw.githubusercontent.com/eclipse/che/${CHE_REPO_BRANCH}/tests/devworkspace-happy-path/remote-launch.sh"); then


.ci/oci-devworkspace-happy-path.sh

dkwon17 · 2026-02-04T02:23:19Z

/retest

dkwon17 · 2026-02-06T15:14:39Z

/retest

…hecks This commit enhances the `.ci/oci-devworkspace-happy-path.sh` script to significantly improve test reliability in CI environments by adding: - Health checks for DWO and Che deployments using kubectl wait - Retry logic with exponential backoff (2 retries, 60s base delay) - Comprehensive artifact collection on failures - Graceful error handling and cleanup between retries - Clear error messages with stage identification The improvements address flakiness in the v14-che-happy-path Prow test by handling transient failures (image pull timeouts, API server issues, operator reconciliation delays) and providing detailed diagnostics for genuine failures. Key features: - DWO verification: Waits for deployment condition=available - Che verification: Waits for CheCluster condition=Available - Retry strategy: 2 attempts with exponential backoff + jitter - Artifact collection: Operator logs, CheCluster CR, pod info, events - Cleanup: Deletes failed deployments before retry - Realistic timeouts: 24 hours (86400s) for pod wait/ready Expected impact: Reduce CI flakiness from ~50% to >90% success rate for infrastructure-related failures, with significantly better diagnostics. Assisted-by: Claude Sonnet 4.5 <noreply@anthropic.com> Co-Authored-By: Oleksii Kurinnyi <okurinny@redhat.com> Signed-off-by: Oleksii Kurinnyi <okurinny@redhat.com>

…ealth checks Signed-off-by: Oleksii Kurinnyi <okurinny@redhat.com>

…c and health checks Signed-off-by: Oleksii Kurinnyi <okurinny@redhat.com>

…ry logic and health checks - Remove broken verifyCheDeployment() (CheCluster has no condition=Available) - Fix exit 1 -> return 1 in main() - Fix cleanup: use kubectl wait --for=delete instead of sleep 10 - Add retry for happy-path test (1 retry with 30s delay) - Add failure classification, reporting, and PR commenting - Fix pipe delimiter injection, variable quoting, artifact overwrite - Update README to match code changes Assisted-by: Claude Opus 4.6 (1M context) Signed-off-by: Oleksii Kurinnyi <okurinny@redhat.com>

codecov · 2026-03-26T16:33:28Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 36.19%. Comparing base (b3e2af5) to head (f82470e).
⚠️ Report is 18 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1581      +/-   ##
==========================================
+ Coverage   35.48%   36.19%   +0.71%     
==========================================
  Files         168      168              
  Lines       14484    14534      +50     
==========================================
+ Hits         5139     5260     +121     
+ Misses       9006     8930      -76     
- Partials      339      344       +5

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

openshift-ci · 2026-03-26T21:09:11Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: akurinnoy, dkwon17

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [dkwon17]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

.ci/oci-devworkspace-happy-path.sh

…ith retry logic and health checks Add CHE_REPO_BRANCH format validation to prevent injection Assisted-by: Claude Opus 4.6 (1M context) Signed-off-by: Oleksii Kurinnyi <okurinny@redhat.com>

openshift-ci · 2026-03-27T14:10:17Z

New changes are detected. LGTM label has been removed.

coderabbitai · 2026-03-27T14:10:28Z

📝 Walkthrough

Walkthrough

A new CI documentation file describes the Eclipse Che happy-path test workflow, and the corresponding shell script is significantly expanded with staged deployment logic, retry mechanisms with exponential backoff, health checks, diagnostic artifact collection, failure classification, and automated PR comment reporting.

Changes

Cohort / File(s)	Summary
CI Documentation `.ci/README-CHE-HAPPY-PATH.md`	New documentation file describing the Che happy-path test flow, including deployment stages, retry behavior, health checks, diagnostic artifacts, configuration via environment variables, and CI reporting behavior.
CI Deployment & Test Orchestration `.ci/oci-devworkspace-happy-path.sh`	Significantly expanded script with new utility imports, enhanced DWO deployment with readiness validation, multi-attempt Che deployment with OLM health checks, exponential backoff retry logic, artifact collection and failure classification, happy-path test execution, structured attempt logging, unified report generation (Markdown and JSON), and PR comment posting via `gh` or GitHub API.

Sequence Diagram

sequenceDiagram
    participant Main as Main Orchestrator
    participant DWO as DevWorkspace Operator
    participant K8s as Kubernetes
    participant OLM as OLM
    participant Che as Eclipse Che
    participant Test as Happy-Path Test
    participant Artifacts as Artifact Collector
    participant Report as Report Generator

    Main->>DWO: deployDWO()
    DWO->>K8s: make install
    DWO->>K8s: wait for deployment readiness
    K8s-->>DWO: ready
    DWO-->>Main: success

    Main->>Main: deployAndVerifyChe (retry loop)
    loop Retry with exponential backoff
        Main->>OLM: verifyOLMHealth()
        OLM-->>Main: healthy
        Main->>Che: deployChe(attempt)
        Che->>K8s: chectl server:deploy
        K8s-->>Che: deployment response
        alt Deployment Failed
            Che-->>Main: failure
            Main->>Artifacts: collectCheArtifacts()
            Artifacts->>K8s: gather logs/events/status
            K8s-->>Artifacts: diagnostics
            Main->>Main: classifyCheFailure()
            Main->>Che: cleanupFailedChe()
            Che->>K8s: cleanup resources
        else Deployment Succeeded
            Che-->>Main: success
            Main->>Main: break retry loop
        end
    end

    Main->>Test: runHappyPathTest()
    Test->>Test: validate CHE_REPO_BRANCH
    Test->>Test: execute remote script
    alt Test Failed
        Test-->>Main: failure
        Main->>Artifacts: collectCheArtifacts()
    else Test Passed
        Test-->>Main: success
    end

    Main->>Report: generateReport()
    Report->>Report: categorize failures
    Report->>Report: build Markdown + JSON
    Report-->>Main: reports

    Main->>Main: commentPR(markdown_report)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 Whisker-twitching joy, the script now breathes,
With retries, health checks, and diagnostic leaves,
DevWorkspace hops through stages with care,
Che deploys, tests run, reports float through air,
CI pipelines dance with each PR's delight!

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title accurately describes the main changes: improving the che-happy-path test reliability through the addition of retry logic, health checks, and enhanced diagnostics.
Docstring Coverage	✅ Passed	Docstring coverage is 85.71% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch improve-che-happy-path-reliability

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (3)

.ci/oci-devworkspace-happy-path.sh (2)

376-381: Consider using jq for JSON construction to avoid potential escaping issues.

The JSON is built via string concatenation. While the current reason values are controlled strings without special characters, any future modification that introduces quotes, backslashes, or newlines in the reason would produce malformed JSON.

♻️ Optional: Use jq for safer JSON construction

     if [ "$first" = true ]; then
       first=false
     else
       attempts_json="${attempts_json},"
     fi
-    attempts_json="${attempts_json}{\"attempt\":\"${attempt}\",\"stage\":\"${stage}\",\"result\":\"${attempt_result}\",\"reason\":\"${reason}\"}"
+    attempts_json="${attempts_json}$(jq -nc --arg a "$attempt" --arg s "$stage" --arg r "$attempt_result" --arg re "$reason" '{attempt:$a,stage:$s,result:$r,reason:$re}')"

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In @.ci/oci-devworkspace-happy-path.sh around lines 376 - 381, The code builds
JSON by string-concatenating into attempts_json using variables attempt, stage,
attempt_result and reason which can break with special characters; change this
to build each record via jq (e.g., echo '{}' | jq --arg attempt "$attempt" --arg
stage "$stage" --arg result "$attempt_result" --arg reason "$reason" '. +
{attempt:$attempt,stage:$stage,result:$result,reason:$reason}') and then append
the produced safe JSON objects to an array (or use jq to append into
attempts_json), ensuring all fields are properly escaped instead of raw string
concatenation.

179-192: Minor: Cleanup timeout masked by || true.

If kubectl wait --for=delete times out, the || true allows the script to proceed silently. In rare cases where CheCluster deletion takes longer than 120s, the retry attempt may encounter residual resources. The 15s grace period helps, but consider logging a warning if the wait times out.

🔧 Optional: Log warning on cleanup timeout

   # Wait for CheCluster CR to be fully deleted
   echo "Waiting for CheCluster deletion to complete..."
-  kubectl wait --for=delete checluster --all \
+  if ! kubectl wait --for=delete checluster --all \
     -n "$CHE_NAMESPACE" \
-    --timeout=120s 2>&1 || true
+    --timeout=120s 2>&1; then
+    echo "WARNING: CheCluster deletion timed out, proceeding anyway"
+  fi

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In @.ci/oci-devworkspace-happy-path.sh around lines 179 - 192, In
cleanupFailedChe(), detect whether the kubectl wait --for=delete checluster
--all ... --timeout=120s command timed out instead of swallowing its failure
with `|| true`: run the kubectl wait and capture its exit status (or use a
conditional on the command), and if it fails (non-zero) call echo or logger to
emit a warning that CheCluster deletion timed out and residual resources may
remain; keep the existing 15s sleep and ensure the message references the
timeout so operators can see the retry risk.

.ci/README-CHE-HAPPY-PATH.md (1)

149-170: Add language specifier to fenced code block.

The fenced code block showing the artifact directory structure should have a language specifier for better rendering and to satisfy markdownlint (MD040).

📝 Proposed fix

-```
+```text
 $ARTIFACT_DIR/
 ├── attempt-log.txt

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In @.ci/README-CHE-HAPPY-PATH.md around lines 149 - 170, The fenced code block
that shows the artifact directory structure is missing a language specifier
which triggers markdownlint MD040; update the opening fence in the
README-CHE-HAPPY-PATH.md block from ``` to include a language (e.g., ```text) so
the tree output is rendered and linted correctly; locate the directory-tree
fenced block (the block that begins with the three backticks immediately before
the $ARTIFACT_DIR/ listing) and change its opening fence to include the chosen
language specifier.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.ci/oci-devworkspace-happy-path.sh:
- Around line 229-232: The modulo operation using MAX_JITTER can divide by zero;
update the jitter calculation in the retry/backoff block so it only computes
jitter with RANDOM % MAX_JITTER when MAX_JITTER is a positive integer, otherwise
set jitter=0 (or a safe default). Specifically, before computing
"jitter=$((RANDOM % MAX_JITTER))" add a guard that tests MAX_JITTER (non-empty
and >0) and falls back to jitter=0; leave the exponential_delay and final
"delay" computation unchanged. This ensures safe behavior when MAX_JITTER=0 or
unset.

---

Nitpick comments:
In @.ci/oci-devworkspace-happy-path.sh:
- Around line 376-381: The code builds JSON by string-concatenating into
attempts_json using variables attempt, stage, attempt_result and reason which
can break with special characters; change this to build each record via jq
(e.g., echo '{}' | jq --arg attempt "$attempt" --arg stage "$stage" --arg result
"$attempt_result" --arg reason "$reason" '. +
{attempt:$attempt,stage:$stage,result:$result,reason:$reason}') and then append
the produced safe JSON objects to an array (or use jq to append into
attempts_json), ensuring all fields are properly escaped instead of raw string
concatenation.
- Around line 179-192: In cleanupFailedChe(), detect whether the kubectl wait
--for=delete checluster --all ... --timeout=120s command timed out instead of
swallowing its failure with `|| true`: run the kubectl wait and capture its exit
status (or use a conditional on the command), and if it fails (non-zero) call
echo or logger to emit a warning that CheCluster deletion timed out and residual
resources may remain; keep the existing 15s sleep and ensure the message
references the timeout so operators can see the retry risk.

In @.ci/README-CHE-HAPPY-PATH.md:
- Around line 149-170: The fenced code block that shows the artifact directory
structure is missing a language specifier which triggers markdownlint MD040;
update the opening fence in the README-CHE-HAPPY-PATH.md block from ``` to
include a language (e.g., ```text) so the tree output is rendered and linted
correctly; locate the directory-tree fenced block (the block that begins with
the three backticks immediately before the $ARTIFACT_DIR/ listing) and change
its opening fence to include the chosen language specifier.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: dceb580b-fb76-48fd-8ff6-55fc3fdc4a8e

📥 Commits

Reviewing files that changed from the base of the PR and between 2d6966e and 2eac19f.

📒 Files selected for processing (2)

.ci/README-CHE-HAPPY-PATH.md
.ci/oci-devworkspace-happy-path.sh

coderabbitai · 2026-03-27T14:15:55Z

.ci/oci-devworkspace-happy-path.sh

+      # Calculate exponential backoff with jitter
+      local exponential_delay=$((BASE_DELAY * (2 ** (attempt - 1))))
+      local jitter=$((RANDOM % MAX_JITTER))
+      local delay=$((exponential_delay + jitter))


⚠️ Potential issue | 🟡 Minor

Guard against division by zero if MAX_JITTER=0.

If MAX_JITTER is set to 0 via environment variable, RANDOM % MAX_JITTER will cause a division by zero error, crashing the script unexpectedly.

🛡️ Proposed fix

# Calculate exponential backoff with jitter local exponential_delay=$((BASE_DELAY * (2 ** (attempt - 1)))) - local jitter=$((RANDOM % MAX_JITTER)) + local jitter=0 + [ "$MAX_JITTER" -gt 0 ] && jitter=$((RANDOM % MAX_JITTER)) local delay=$((exponential_delay + jitter))

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# Calculate exponential backoff with jitter

local exponential_delay=$((BASE_DELAY * (2 ** (attempt - 1))))

local jitter=$((RANDOM % MAX_JITTER))

local delay=$((exponential_delay + jitter))

# Calculate exponential backoff with jitter

local exponential_delay=$((BASE_DELAY * (2 ** (attempt - 1))))

local jitter=0

[ "$MAX_JITTER" -gt 0 ] && jitter=$((RANDOM % MAX_JITTER))

local delay=$((exponential_delay + jitter))

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In @.ci/oci-devworkspace-happy-path.sh around lines 229 - 232, The modulo operation using MAX_JITTER can divide by zero; update the jitter calculation in the retry/backoff block so it only computes jitter with RANDOM % MAX_JITTER when MAX_JITTER is a positive integer, otherwise set jitter=0 (or a safe default). Specifically, before computing "jitter=$((RANDOM % MAX_JITTER))" add a guard that tests MAX_JITTER (non-empty and >0) and falls back to jitter=0; leave the exponential_delay and final "delay" computation unchanged. This ensures safe behavior when MAX_JITTER=0 or unset.

akurinnoy requested review from dkwon17, ibuziuk and rohanKanojia as code owners February 3, 2026 13:26

github-advanced-security bot found potential problems Feb 3, 2026

View reviewed changes

akurinnoy force-pushed the improve-che-happy-path-reliability branch from 528e591 to 299de3a Compare February 3, 2026 13:28

rohanKanojia reviewed Feb 3, 2026

View reviewed changes

.ci/oci-devworkspace-happy-path.sh Show resolved Hide resolved

akurinnoy added 3 commits March 19, 2026 14:28

fixup! Improve Che happy-path test reliability with retry logic and h…

cd5a904

…ealth checks Signed-off-by: Oleksii Kurinnyi <okurinny@redhat.com>

fixup! fixup! Improve Che happy-path test reliability with retry logi…

6d0371f

…c and health checks Signed-off-by: Oleksii Kurinnyi <okurinny@redhat.com>

akurinnoy force-pushed the improve-che-happy-path-reliability branch from 751a1a8 to 6d0371f Compare March 19, 2026 12:29

akurinnoy force-pushed the improve-che-happy-path-reliability branch from 99decd2 to f82470e Compare March 26, 2026 15:18

dkwon17 approved these changes Mar 26, 2026

View reviewed changes

openshift-ci bot assigned dkwon17 Mar 26, 2026

openshift-ci bot added the lgtm label Mar 26, 2026

openshift-ci bot added the approved label Mar 26, 2026

dkwon17 reviewed Mar 27, 2026

View reviewed changes

.ci/oci-devworkspace-happy-path.sh Show resolved Hide resolved

fixup! fixup! fixup! fixup! Improve Che happy-path test reliability w…

2eac19f

…ith retry logic and health checks Add CHE_REPO_BRANCH format validation to prevent injection Assisted-by: Claude Opus 4.6 (1M context) Signed-off-by: Oleksii Kurinnyi <okurinny@redhat.com>

openshift-ci bot removed the lgtm label Mar 27, 2026

coderabbitai bot reviewed Mar 27, 2026

View reviewed changes

dkwon17 merged commit 441bcb2 into main Mar 27, 2026
12 checks passed

dkwon17 deleted the improve-che-happy-path-reliability branch March 27, 2026 15:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci: Improve che-happy-path test reliability with retry logic and health checks#1581

ci: Improve che-happy-path test reliability with retry logic and health checks#1581
dkwon17 merged 5 commits intomainfrom
improve-che-happy-path-reliability

akurinnoy commented Feb 3, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

Check warning

Uh oh!

dkwon17 commented Feb 4, 2026

Uh oh!

dkwon17 commented Feb 6, 2026

Uh oh!

codecov bot commented Mar 26, 2026

Uh oh!

openshift-ci bot commented Mar 26, 2026

Uh oh!

Uh oh!

openshift-ci bot commented Mar 27, 2026

Uh oh!

coderabbitai bot commented Mar 27, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Mar 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

akurinnoy commented Feb 3, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Why is this needed?

Key Improvements

Expected Impact

Testing

Documentation

Summary by CodeRabbit

Uh oh!

Check warning

Uh oh!

dkwon17 commented Feb 4, 2026

Uh oh!

dkwon17 commented Feb 6, 2026

Uh oh!

codecov bot commented Mar 26, 2026

Codecov Report

Uh oh!

openshift-ci bot commented Mar 26, 2026

Uh oh!

Uh oh!

openshift-ci bot commented Mar 27, 2026

Uh oh!

coderabbitai bot commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

akurinnoy commented Feb 3, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 27, 2026 •

edited

Loading