Skip to content

Commit f82470e

Browse files
committed
fixup! fixup! fixup! Improve Che happy-path test reliability with retry logic and health checks
- Remove broken verifyCheDeployment() (CheCluster has no condition=Available) - Fix exit 1 -> return 1 in main() - Fix cleanup: use kubectl wait --for=delete instead of sleep 10 - Add retry for happy-path test (1 retry with 30s delay) - Add failure classification, reporting, and PR commenting - Fix pipe delimiter injection, variable quoting, artifact overwrite - Update README to match code changes Assisted-by: Claude Opus 4.6 (1M context) Signed-off-by: Oleksii Kurinnyi <okurinny@redhat.com>
1 parent 6d0371f commit f82470e

2 files changed

Lines changed: 334 additions & 96 deletions

File tree

.ci/README-CHE-HAPPY-PATH.md

Lines changed: 48 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -10,15 +10,14 @@ This script deploys and validates the full DevWorkspace Operator + Eclipse Che s
1010
## Features
1111

1212
### Retry Logic
13-
- **Max retries**: 2 (3 total attempts)
14-
- **Exponential backoff**: 60s base delay with 0-15s jitter
15-
- **Cleanup**: Deletes failed Che deployment before retry
13+
- **Che deployment**: 2 attempts with exponential backoff (60s base + jitter)
14+
- **Cleanup**: Waits for CheCluster CR deletion before retry
15+
- **Happy-path test retry**: 1 retry with 30s delay if Selenium test fails
1616

1717
### Health Checks
1818
- **OLM**: Verifies `catalog-operator` and `olm-operator` are available before Che deployment (2-minute timeout each)
1919
- **DWO**: Waits for `deployment condition=available` (5-minute timeout)
20-
- **Che**: Waits for `CheCluster condition=Available` (10-minute timeout)
21-
- **Pods**: Verifies all Che pods are ready
20+
- **Che**: chectl's built-in readiness checks ensure deployment is healthy
2221

2322
### Artifact Collection
2423
On each failure, collects:
@@ -82,14 +81,14 @@ export ARTIFACT_DIR="/tmp/my-test-artifacts"
8281

8382
2. **Deploy Che** (with retry)
8483
- Runs `chectl server:deploy` with extended timeouts (24h)
85-
- Waits for CheCluster condition=Available
86-
- Verifies all pods are ready
84+
- chectl handles readiness checks internally
8785
- Collects artifacts on failure
8886
- Cleans up and retries if needed
8987

9088
3. **Run Happy-Path Test**
9189
- Downloads test script from Eclipse Che repository
9290
- Executes Che happy-path workflow
91+
- Retries once after 30s if test fails
9392
- Collects artifacts on failure
9493

9594
## Exit Codes
@@ -102,8 +101,6 @@ export ARTIFACT_DIR="/tmp/my-test-artifacts"
102101
| Component | Timeout | Purpose |
103102
|-----------|---------|---------|
104103
| DWO deployment | 5 minutes | Pod becomes available |
105-
| CheCluster Available | 10 minutes | Che fully deployed |
106-
| Che pods ready | 5 minutes | All pods running |
107104
| chectl pod wait/ready | 24 hours | Generous for slow environments |
108105

109106
## Common Failures
@@ -123,13 +120,14 @@ export ARTIFACT_DIR="/tmp/my-test-artifacts"
123120
**Common causes**: Image pull errors, resource constraints, webhook conflicts
124121

125122
### Che Deployment Timeout
126-
**Symptoms**: "ERROR: CheCluster did not become available within 10 minutes"
127-
**Check**: `$ARTIFACT_DIR/che-operator-logs-attempt-*.log`, `$ARTIFACT_DIR/olm-diagnostics-attempt-*.yaml`
123+
**Symptoms**: "ERROR: chectl server:deploy failed" with timeout-related messages
124+
**Check**: `$ARTIFACT_DIR/che-operator-logs-attempt-*.log`, `$ARTIFACT_DIR/olm-diagnostics-attempt-*.yaml`, `$ARTIFACT_DIR/chectl-logs-attempt-*/`
128125
**Common causes**:
129126
- OLM subscription timeout (check `olm-diagnostics` for subscription state)
130127
- Database connection issues
131128
- Image pull failures
132129
- Operator reconciliation errors
130+
- chectl timeout waiting for pods/resources to become ready
133131

134132
### Pod CrashLoopBackOff
135133
**Symptoms**: "ERROR: chectl server:deploy failed"
@@ -150,6 +148,9 @@ export ARTIFACT_DIR="/tmp/my-test-artifacts"
150148
After a failed test run:
151149
```
152150
$ARTIFACT_DIR/
151+
├── attempt-log.txt
152+
├── failure-report.json
153+
├── failure-report.md
153154
├── devworkspace-controller-info/
154155
│ ├── <pod-name>-<container>.log
155156
│ └── events.log
@@ -256,6 +257,42 @@ If OLM subscriptions consistently timeout (visible in `olm-diagnostics-*.yaml`):
256257
- If no InstallPlan exists, OLM couldn't resolve the subscription
257258
- If InstallPlan exists but isn't complete, check its status conditions
258259

260+
## CI Failure Reports
261+
262+
The script automatically generates failure reports and posts them as PR comments after each run (both failures and successes with retries). **Do not delete these comments** — they are used to track flakiness patterns across PRs.
263+
264+
### What gets reported
265+
266+
Each report includes a table of all attempts with:
267+
- **Attempt**: Which attempt number (e.g., `1/2`, `2/2`)
268+
- **Stage**: Which function failed (`deployChe`, `runHappyPathTest`, etc.)
269+
- **Result**: `PASSED` or `FAILED`
270+
- **Reason**: Classified failure reason (e.g., "Che operator reconciliation failure")
271+
272+
### Failure categories
273+
274+
| Category | Meaning | Retryable? |
275+
|----------|---------|------------|
276+
| `INFRA` | Infrastructure issue (OLM, image pull, operator reconciliation) | Yes — `/retest` |
277+
| `TEST` | Test execution issue (Dashboard UI timeout, workspace start) | Maybe |
278+
| `MIXED` | Both infrastructure and test issues across attempts | Yes — `/retest` |
279+
| `UNKNOWN` | Could not classify — check artifacts | Investigate |
280+
281+
### Report artifacts
282+
283+
Reports are always saved to `$ARTIFACT_DIR/` regardless of whether PR commenting succeeds:
284+
- `failure-report.json` — structured data for programmatic analysis
285+
- `failure-report.md` — human-readable markdown (same as the PR comment)
286+
- `attempt-log.txt` — raw attempt tracking log
287+
288+
### Why these comments matter
289+
290+
Over time, these reports reveal:
291+
- Which failure categories are most common
292+
- Whether flakiness is improving or worsening
293+
- Which infrastructure components are least reliable
294+
- Whether retry logic is effective (passed-on-retry patterns)
295+
259296
## Related Documentation
260297

261298
- [Eclipse Che Documentation](https://eclipse.dev/che/docs/)

0 commit comments

Comments
 (0)