Skip to content

Commit 6d0371f

Browse files
committed
fixup! fixup! Improve Che happy-path test reliability with retry logic and health checks
Signed-off-by: Oleksii Kurinnyi <okurinny@redhat.com>
1 parent cd5a904 commit 6d0371f

2 files changed

Lines changed: 182 additions & 2 deletions

File tree

.ci/README-CHE-HAPPY-PATH.md

Lines changed: 113 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -15,12 +15,15 @@ This script deploys and validates the full DevWorkspace Operator + Eclipse Che s
1515
- **Cleanup**: Deletes failed Che deployment before retry
1616

1717
### Health Checks
18+
- **OLM**: Verifies `catalog-operator` and `olm-operator` are available before Che deployment (2-minute timeout each)
1819
- **DWO**: Waits for `deployment condition=available` (5-minute timeout)
1920
- **Che**: Waits for `CheCluster condition=Available` (10-minute timeout)
2021
- **Pods**: Verifies all Che pods are ready
2122

2223
### Artifact Collection
2324
On each failure, collects:
25+
- OLM diagnostics (Subscription, InstallPlan, CSV, CatalogSource)
26+
- CatalogSource pod logs
2427
- Che operator logs (last 1000 lines)
2528
- CheCluster CR status (full YAML)
2629
- All pod logs from Che namespace
@@ -105,21 +108,43 @@ export ARTIFACT_DIR="/tmp/my-test-artifacts"
105108

106109
## Common Failures
107110

111+
### OLM Infrastructure Not Ready
112+
**Symptoms**: "ERROR: OLM infrastructure is not healthy, cannot proceed with Che deployment"
113+
**Check**: `$ARTIFACT_DIR/olm-diagnostics-olm-check.yaml`
114+
**Common causes**:
115+
- OLM operators not running (`catalog-operator`, `olm-operator`)
116+
- Cluster provisioning issues during bootstrap
117+
- Resource constraints preventing OLM operator scheduling
118+
**Resolution**: This indicates a fundamental cluster infrastructure issue. Check cluster health and OLM operator logs before retrying.
119+
108120
### DWO Deployment Fails
109121
**Symptoms**: "ERROR: DWO controller is not ready"
110122
**Check**: `$ARTIFACT_DIR/devworkspace-controller-info/`
111123
**Common causes**: Image pull errors, resource constraints, webhook conflicts
112124

113125
### Che Deployment Timeout
114126
**Symptoms**: "ERROR: CheCluster did not become available within 10 minutes"
115-
**Check**: `$ARTIFACT_DIR/che-operator-logs-attempt-*.log`
116-
**Common causes**: Database connection issues, image pull failures, operator reconciliation errors
127+
**Check**: `$ARTIFACT_DIR/che-operator-logs-attempt-*.log`, `$ARTIFACT_DIR/olm-diagnostics-attempt-*.yaml`
128+
**Common causes**:
129+
- OLM subscription timeout (check `olm-diagnostics` for subscription state)
130+
- Database connection issues
131+
- Image pull failures
132+
- Operator reconciliation errors
117133

118134
### Pod CrashLoopBackOff
119135
**Symptoms**: "ERROR: chectl server:deploy failed"
120136
**Check**: `$ARTIFACT_DIR/eclipse-che-info/` for pod logs
121137
**Common causes**: Configuration errors, resource limits, TLS certificate issues
122138

139+
### OLM Subscription Stuck
140+
**Symptoms**: Subscription timeout after 120 seconds with no resources created
141+
**Check**: `$ARTIFACT_DIR/olm-diagnostics-attempt-*.yaml`, `$ARTIFACT_DIR/catalogsource-logs-attempt-*.log`
142+
**Common causes**:
143+
- CatalogSource pod not pulling/running
144+
- InstallPlan not created (subscription cannot resolve dependencies)
145+
- Cluster resource exhaustion preventing operator pod scheduling
146+
**Resolution**: Check OLM operator logs and CatalogSource pod status. See "Advanced Troubleshooting" section for monitoring and alternative deployment options.
147+
123148
## Artifact Locations
124149

125150
After a failed test run:
@@ -135,6 +160,10 @@ $ARTIFACT_DIR/
135160
├── che-operator-logs-attempt-2.log
136161
├── checluster-status-attempt-1.yaml
137162
├── checluster-status-attempt-2.yaml
163+
├── olm-diagnostics-attempt-1.yaml
164+
├── olm-diagnostics-attempt-2.yaml
165+
├── catalogsource-logs-attempt-1.log
166+
├── catalogsource-logs-attempt-2.log
138167
├── chectl-logs-attempt-1/
139168
└── chectl-logs-attempt-2/
140169
```
@@ -146,9 +175,91 @@ $ARTIFACT_DIR/
146175
- `chectl` - Eclipse Che CLI (v7.114.0+)
147176
- `jq` - JSON processor (for chectl)
148177

178+
## Advanced Troubleshooting
179+
180+
### OLM Infrastructure Issues
181+
182+
If you experience persistent OLM subscription timeouts (see `olm-diagnostics-*.yaml` artifacts):
183+
184+
#### Option 1: OLM Health Check (Implemented)
185+
The script now verifies OLM infrastructure health before deploying Che:
186+
- Checks `catalog-operator` is available
187+
- Checks `olm-operator` is available
188+
- Verifies `openshift-marketplace` is accessible
189+
190+
If OLM is unhealthy, the test fails fast with diagnostic artifacts instead of waiting through timeouts.
191+
192+
#### Option 2: Monitor Subscription Progress (Advanced)
193+
For debugging stuck subscriptions, you can add active monitoring to detect zero-progress scenarios earlier:
194+
195+
```bash
196+
# Example: Monitor subscription state every 10 seconds
197+
while [ $elapsed -lt 300 ]; do
198+
state=$(kubectl get subscription eclipse-che -n eclipse-che \
199+
-o jsonpath='{.status.state}' 2>/dev/null)
200+
echo "[$elapsed/300s] Subscription state: ${state:-unknown}"
201+
if [ "$state" = "AtLatestKnown" ]; then
202+
break
203+
fi
204+
sleep 10
205+
elapsed=$((elapsed + 10))
206+
done
207+
```
208+
209+
This helps identify whether subscriptions are progressing slowly vs. completely stuck.
210+
211+
#### Option 3: Skip OLM Installation (Alternative Approach)
212+
For CI environments with persistent OLM issues, consider deploying Che operator directly instead of via OLM:
213+
214+
```bash
215+
chectl server:deploy \
216+
--installer=operator \ # Uses direct YAML deployment
217+
-p openshift \
218+
--batch \
219+
--telemetry=off \
220+
--skip-devworkspace-operator \
221+
--chenamespace="$CHE_NAMESPACE"
222+
```
223+
224+
**Trade-offs**:
225+
- ✅ Bypasses OLM infrastructure entirely
226+
- ✅ More reliable in resource-constrained CI environments
227+
- ❌ Doesn't test OLM integration path (used by production OperatorHub)
228+
- ❌ May miss OLM-specific issues
229+
230+
**When to use**: Temporary workaround for CI infrastructure issues while OLM problems are being resolved.
231+
232+
### Subscription Timeout Issues
233+
234+
If OLM subscriptions consistently timeout (visible in `olm-diagnostics-*.yaml`):
235+
236+
1. **Check OLM operator logs**:
237+
```bash
238+
kubectl logs -n openshift-operator-lifecycle-manager \
239+
deployment/catalog-operator --tail=100
240+
kubectl logs -n openshift-operator-lifecycle-manager \
241+
deployment/olm-operator --tail=100
242+
```
243+
244+
2. **Verify CatalogSource pod is running**:
245+
```bash
246+
kubectl get pods -n openshift-marketplace \
247+
-l olm.catalogSource=eclipse-che
248+
kubectl logs -n openshift-marketplace \
249+
-l olm.catalogSource=eclipse-che
250+
```
251+
252+
3. **Check InstallPlan creation**:
253+
```bash
254+
kubectl get installplan -n eclipse-che -o yaml
255+
```
256+
- If no InstallPlan exists, OLM couldn't resolve the subscription
257+
- If InstallPlan exists but isn't complete, check its status conditions
258+
149259
## Related Documentation
150260

151261
- [Eclipse Che Documentation](https://eclipse.dev/che/docs/)
152262
- [chectl GitHub Repository](https://github.com/che-incubator/chectl)
263+
- [OLM Troubleshooting Guide](https://olm.operatorframework.io/docs/troubleshooting/)
153264
- [DevWorkspace Operator README](../README.md)
154265
- [Contributing Guidelines](../CONTRIBUTING.md)

.ci/oci-devworkspace-happy-path.sh

Lines changed: 69 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -66,6 +66,43 @@ deployDWO() {
6666
return 0
6767
}
6868

69+
# Generated by Claude Sonnet 4.5
70+
verifyOLMHealth() {
71+
echo "======== Verifying OLM Infrastructure ========"
72+
73+
# Check catalog-operator is available
74+
echo "Checking catalog-operator..."
75+
if ! kubectl wait --for=condition=available deployment/catalog-operator \
76+
-n openshift-operator-lifecycle-manager \
77+
--timeout=120s 2>&1; then
78+
echo "ERROR: catalog-operator is not ready"
79+
kubectl get deployment/catalog-operator \
80+
-n openshift-operator-lifecycle-manager -o yaml || true
81+
return 1
82+
fi
83+
84+
# Check olm-operator is available
85+
echo "Checking olm-operator..."
86+
if ! kubectl wait --for=condition=available deployment/olm-operator \
87+
-n openshift-operator-lifecycle-manager \
88+
--timeout=120s 2>&1; then
89+
echo "ERROR: olm-operator is not ready"
90+
kubectl get deployment/olm-operator \
91+
-n openshift-operator-lifecycle-manager -o yaml || true
92+
return 1
93+
fi
94+
95+
# Verify marketplace is accessible
96+
echo "Checking openshift-marketplace..."
97+
if ! kubectl get catalogsources -n openshift-marketplace &>/dev/null; then
98+
echo "ERROR: Cannot access CatalogSources in openshift-marketplace"
99+
return 1
100+
fi
101+
102+
echo "✅ OLM infrastructure is healthy"
103+
return 0
104+
}
105+
69106
deployChe() {
70107
echo "======== Deploying Eclipse Che (attempt $1/$MAX_RETRIES) ========"
71108

@@ -166,6 +203,30 @@ collectCheArtifacts() {
166203
echo "Collecting CheCluster status to $checluster_status"
167204
kubectl get checluster -n "$CHE_NAMESPACE" -o yaml > "$checluster_status" 2>&1 || true
168205

206+
# Collect OLM-specific diagnostics
207+
local olm_diagnostics="${ARTIFACT_DIR}/olm-diagnostics-attempt-${attempt}.yaml"
208+
echo "Collecting OLM diagnostics to $olm_diagnostics"
209+
{
210+
echo "=== Subscription ==="
211+
kubectl get subscription -n "$CHE_NAMESPACE" -o yaml 2>&1 || echo "No subscriptions found"
212+
echo ""
213+
echo "=== InstallPlan ==="
214+
kubectl get installplan -n "$CHE_NAMESPACE" -o yaml 2>&1 || echo "No installplans found"
215+
echo ""
216+
echo "=== ClusterServiceVersion ==="
217+
kubectl get csv -n "$CHE_NAMESPACE" -o yaml 2>&1 || echo "No CSVs found"
218+
echo ""
219+
echo "=== CatalogSource ==="
220+
kubectl get catalogsource -n openshift-marketplace -o yaml 2>&1 || echo "Cannot access catalogsources"
221+
} > "$olm_diagnostics" 2>&1 || true
222+
223+
# Collect CatalogSource pod logs
224+
local catalogsource_logs="${ARTIFACT_DIR}/catalogsource-logs-attempt-${attempt}.log"
225+
echo "Collecting CatalogSource pod logs to $catalogsource_logs"
226+
kubectl logs -n openshift-marketplace \
227+
-l olm.catalogSource=eclipse-che \
228+
--tail=1000 > "$catalogsource_logs" 2>&1 || true
229+
169230
# Collect chectl server logs
170231
echo "Collecting chectl server logs"
171232
chectl server:logs -n "$CHE_NAMESPACE" -d "${ARTIFACT_DIR}/chectl-logs-attempt-${attempt}" 2>&1 || true
@@ -186,6 +247,14 @@ cleanupFailedChe() {
186247
deployAndVerifyChe() {
187248
local attempt
188249

250+
# Verify OLM infrastructure health before attempting Che deployment
251+
if ! verifyOLMHealth; then
252+
echo "❌ ERROR: OLM infrastructure is not healthy, cannot proceed with Che deployment"
253+
echo "Collecting OLM diagnostics..."
254+
collectCheArtifacts "olm-check"
255+
return 1
256+
fi
257+
189258
for attempt in $(seq 1 $MAX_RETRIES); do
190259
echo ""
191260
echo "========================================"

0 commit comments

Comments
 (0)