diff --git a/.ci/README-CHE-HAPPY-PATH.md b/.ci/README-CHE-HAPPY-PATH.md new file mode 100644 index 000000000..2f698a976 --- /dev/null +++ b/.ci/README-CHE-HAPPY-PATH.md @@ -0,0 +1,302 @@ +# Che Happy-Path Test + +**Script**: `.ci/oci-devworkspace-happy-path.sh` +**Purpose**: Integration test validating DevWorkspace Operator with Eclipse Che deployment + +## Overview + +This script deploys and validates the full DevWorkspace Operator + Eclipse Che stack on OpenShift, ensuring the happy-path user workflow succeeds. It's used in the `v14-che-happy-path` Prow CI test. + +## Features + +### Retry Logic +- **Che deployment**: 2 attempts with exponential backoff (60s base + jitter) +- **Cleanup**: Waits for CheCluster CR deletion before retry +- **Happy-path test retry**: 1 retry with 30s delay if Selenium test fails + +### Health Checks +- **OLM**: Verifies `catalog-operator` and `olm-operator` are available before Che deployment (2-minute timeout each) +- **DWO**: Waits for `deployment condition=available` (5-minute timeout) +- **Che**: chectl's built-in readiness checks ensure deployment is healthy + +### Artifact Collection +On each failure, collects: +- OLM diagnostics (Subscription, InstallPlan, CSV, CatalogSource) +- CatalogSource pod logs +- Che operator logs (last 1000 lines) +- CheCluster CR status (full YAML) +- All pod logs from Che namespace +- Kubernetes events +- chectl server logs + +### Error Handling +- Graceful error handling with stage-specific messages +- Progress indicators: "Attempt 1/2", "Retrying in 71s..." +- No crash on failures + +## Configuration + +Environment variables (all optional): + +| Variable | Default | Description | +|----------|---------|-------------| +| `CHE_NAMESPACE` | `eclipse-che` | Namespace for Che deployment | +| `MAX_RETRIES` | `2` | Maximum retry attempts | +| `BASE_DELAY` | `60` | Base delay in seconds for exponential backoff | +| `MAX_JITTER` | `15` | Maximum jitter in seconds | +| `ARTIFACT_DIR` | `/tmp/dwo-e2e-artifacts` | Directory for diagnostic artifacts | +| `DEVWORKSPACE_OPERATOR` | (required) | DWO image to deploy | + +## Usage + +### In Prow CI + +The script is called automatically by the `v14-che-happy-path` Prow job. Prow sets `DEVWORKSPACE_OPERATOR` based on the context: + +**For PR checks** (testing PR code): +```bash +export DEVWORKSPACE_OPERATOR="quay.io/devfile/devworkspace-controller:pr-${PR_NUMBER}-${COMMIT_SHA}" +./.ci/oci-devworkspace-happy-path.sh +``` + +**For periodic/nightly runs** (testing main branch): +```bash +export DEVWORKSPACE_OPERATOR="quay.io/devfile/devworkspace-controller:next" +./.ci/oci-devworkspace-happy-path.sh +``` + +### Local Testing +```bash +export DEVWORKSPACE_OPERATOR="quay.io/youruser/devworkspace-controller:your-tag" +export ARTIFACT_DIR="/tmp/my-test-artifacts" +./.ci/oci-devworkspace-happy-path.sh +``` + +## Test Flow + +1. **Deploy DWO** + - Runs `make install` + - Waits for controller deployment to be available + - Collects artifacts if deployment fails + +2. **Deploy Che** (with retry) + - Runs `chectl server:deploy` with extended timeouts (24h) + - chectl handles readiness checks internally + - Collects artifacts on failure + - Cleans up and retries if needed + +3. **Run Happy-Path Test** + - Downloads test script from Eclipse Che repository + - Executes Che happy-path workflow + - Retries once after 30s if test fails + - Collects artifacts on failure + +## Exit Codes + +- `0`: Success - All stages completed +- `1`: Failure - Check `$ARTIFACT_DIR` for diagnostics + +## Timeouts + +| Component | Timeout | Purpose | +|-----------|---------|---------| +| DWO deployment | 5 minutes | Pod becomes available | +| chectl pod wait/ready | 24 hours | Generous for slow environments | + +## Common Failures + +### OLM Infrastructure Not Ready +**Symptoms**: "ERROR: OLM infrastructure is not healthy, cannot proceed with Che deployment" +**Check**: `$ARTIFACT_DIR/olm-diagnostics-olm-check.yaml` +**Common causes**: +- OLM operators not running (`catalog-operator`, `olm-operator`) +- Cluster provisioning issues during bootstrap +- Resource constraints preventing OLM operator scheduling +**Resolution**: This indicates a fundamental cluster infrastructure issue. Check cluster health and OLM operator logs before retrying. + +### DWO Deployment Fails +**Symptoms**: "ERROR: DWO controller is not ready" +**Check**: `$ARTIFACT_DIR/devworkspace-controller-info/` +**Common causes**: Image pull errors, resource constraints, webhook conflicts + +### Che Deployment Timeout +**Symptoms**: "ERROR: chectl server:deploy failed" with timeout-related messages +**Check**: `$ARTIFACT_DIR/che-operator-logs-attempt-*.log`, `$ARTIFACT_DIR/olm-diagnostics-attempt-*.yaml`, `$ARTIFACT_DIR/chectl-logs-attempt-*/` +**Common causes**: +- OLM subscription timeout (check `olm-diagnostics` for subscription state) +- Database connection issues +- Image pull failures +- Operator reconciliation errors +- chectl timeout waiting for pods/resources to become ready + +### Pod CrashLoopBackOff +**Symptoms**: "ERROR: chectl server:deploy failed" +**Check**: `$ARTIFACT_DIR/eclipse-che-info/` for pod logs +**Common causes**: Configuration errors, resource limits, TLS certificate issues + +### OLM Subscription Stuck +**Symptoms**: Subscription timeout after 120 seconds with no resources created +**Check**: `$ARTIFACT_DIR/olm-diagnostics-attempt-*.yaml`, `$ARTIFACT_DIR/catalogsource-logs-attempt-*.log` +**Common causes**: +- CatalogSource pod not pulling/running +- InstallPlan not created (subscription cannot resolve dependencies) +- Cluster resource exhaustion preventing operator pod scheduling +**Resolution**: Check OLM operator logs and CatalogSource pod status. See "Advanced Troubleshooting" section for monitoring and alternative deployment options. + +## Artifact Locations + +After a failed test run: +``` +$ARTIFACT_DIR/ +├── attempt-log.txt +├── failure-report.json +├── failure-report.md +├── devworkspace-controller-info/ +│ ├── -.log +│ └── events.log +├── eclipse-che-info/ +│ ├── -.log +│ └── events.log +├── che-operator-logs-attempt-1.log +├── che-operator-logs-attempt-2.log +├── checluster-status-attempt-1.yaml +├── checluster-status-attempt-2.yaml +├── olm-diagnostics-attempt-1.yaml +├── olm-diagnostics-attempt-2.yaml +├── catalogsource-logs-attempt-1.log +├── catalogsource-logs-attempt-2.log +├── chectl-logs-attempt-1/ +└── chectl-logs-attempt-2/ +``` + +## Dependencies + +- `kubectl` - Kubernetes CLI +- `oc` - OpenShift CLI (for log collection) +- `chectl` - Eclipse Che CLI (v7.114.0+) +- `jq` - JSON processor (for chectl) + +## Advanced Troubleshooting + +### OLM Infrastructure Issues + +If you experience persistent OLM subscription timeouts (see `olm-diagnostics-*.yaml` artifacts): + +#### Option 1: OLM Health Check (Implemented) +The script now verifies OLM infrastructure health before deploying Che: +- Checks `catalog-operator` is available +- Checks `olm-operator` is available +- Verifies `openshift-marketplace` is accessible + +If OLM is unhealthy, the test fails fast with diagnostic artifacts instead of waiting through timeouts. + +#### Option 2: Monitor Subscription Progress (Advanced) +For debugging stuck subscriptions, you can add active monitoring to detect zero-progress scenarios earlier: + +```bash +# Example: Monitor subscription state every 10 seconds +while [ $elapsed -lt 300 ]; do + state=$(kubectl get subscription eclipse-che -n eclipse-che \ + -o jsonpath='{.status.state}' 2>/dev/null) + echo "[$elapsed/300s] Subscription state: ${state:-unknown}" + if [ "$state" = "AtLatestKnown" ]; then + break + fi + sleep 10 + elapsed=$((elapsed + 10)) +done +``` + +This helps identify whether subscriptions are progressing slowly vs. completely stuck. + +#### Option 3: Skip OLM Installation (Alternative Approach) +For CI environments with persistent OLM issues, consider deploying Che operator directly instead of via OLM: + +```bash +chectl server:deploy \ + --installer=operator \ # Uses direct YAML deployment + -p openshift \ + --batch \ + --telemetry=off \ + --skip-devworkspace-operator \ + --chenamespace="$CHE_NAMESPACE" +``` + +**Trade-offs**: +- ✅ Bypasses OLM infrastructure entirely +- ✅ More reliable in resource-constrained CI environments +- ❌ Doesn't test OLM integration path (used by production OperatorHub) +- ❌ May miss OLM-specific issues + +**When to use**: Temporary workaround for CI infrastructure issues while OLM problems are being resolved. + +### Subscription Timeout Issues + +If OLM subscriptions consistently timeout (visible in `olm-diagnostics-*.yaml`): + +1. **Check OLM operator logs**: + ```bash + kubectl logs -n openshift-operator-lifecycle-manager \ + deployment/catalog-operator --tail=100 + kubectl logs -n openshift-operator-lifecycle-manager \ + deployment/olm-operator --tail=100 + ``` + +2. **Verify CatalogSource pod is running**: + ```bash + kubectl get pods -n openshift-marketplace \ + -l olm.catalogSource=eclipse-che + kubectl logs -n openshift-marketplace \ + -l olm.catalogSource=eclipse-che + ``` + +3. **Check InstallPlan creation**: + ```bash + kubectl get installplan -n eclipse-che -o yaml + ``` + - If no InstallPlan exists, OLM couldn't resolve the subscription + - If InstallPlan exists but isn't complete, check its status conditions + +## CI Failure Reports + +The script automatically generates failure reports and posts them as PR comments after each run (both failures and successes with retries). **Do not delete these comments** — they are used to track flakiness patterns across PRs. + +### What gets reported + +Each report includes a table of all attempts with: +- **Attempt**: Which attempt number (e.g., `1/2`, `2/2`) +- **Stage**: Which function failed (`deployChe`, `runHappyPathTest`, etc.) +- **Result**: `PASSED` or `FAILED` +- **Reason**: Classified failure reason (e.g., "Che operator reconciliation failure") + +### Failure categories + +| Category | Meaning | Retryable? | +|----------|---------|------------| +| `INFRA` | Infrastructure issue (OLM, image pull, operator reconciliation) | Yes — `/retest` | +| `TEST` | Test execution issue (Dashboard UI timeout, workspace start) | Maybe | +| `MIXED` | Both infrastructure and test issues across attempts | Yes — `/retest` | +| `UNKNOWN` | Could not classify — check artifacts | Investigate | + +### Report artifacts + +Reports are always saved to `$ARTIFACT_DIR/` regardless of whether PR commenting succeeds: +- `failure-report.json` — structured data for programmatic analysis +- `failure-report.md` — human-readable markdown (same as the PR comment) +- `attempt-log.txt` — raw attempt tracking log + +### Why these comments matter + +Over time, these reports reveal: +- Which failure categories are most common +- Whether flakiness is improving or worsening +- Which infrastructure components are least reliable +- Whether retry logic is effective (passed-on-retry patterns) + +## Related Documentation + +- [Eclipse Che Documentation](https://eclipse.dev/che/docs/) +- [chectl GitHub Repository](https://github.com/che-incubator/chectl) +- [OLM Troubleshooting Guide](https://olm.operatorframework.io/docs/troubleshooting/) +- [DevWorkspace Operator README](../README.md) +- [Contributing Guidelines](../CONTRIBUTING.md) diff --git a/.ci/oci-devworkspace-happy-path.sh b/.ci/oci-devworkspace-happy-path.sh index 7b2167534..b3f5113a0 100755 --- a/.ci/oci-devworkspace-happy-path.sh +++ b/.ci/oci-devworkspace-happy-path.sh @@ -1,6 +1,6 @@ #!/bin/bash # -# Copyright (c) 2019-2025 Red Hat, Inc. +# Copyright (c) 2019-2026 Red Hat, Inc. # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at @@ -14,10 +14,6 @@ # limitations under the License. # - -#!/usr/bin/env bash -# exit immediately when a command fails -set -e # only exit with zero if all commands of the pipeline exit successfully set -o pipefail # error on unset variables @@ -25,29 +21,533 @@ set -u # print each command before executing it set -x +# Source common utilities +source "$(dirname "$0")/common.sh" + # ENV used by PROW ci export CI="openshift" # Pod created by openshift ci don't have user. Using this envs should avoid errors with git user. export GIT_COMMITTER_NAME="CI BOT" export GIT_COMMITTER_EMAIL="ci_bot@notused.com" +# Che configuration +export CHE_NAMESPACE="${CHE_NAMESPACE:-eclipse-che}" +export MAX_RETRIES=2 +export BASE_DELAY=60 +export MAX_JITTER=15 + +# Artifact directory for logs +export ARTIFACT_DIR="${ARTIFACT_DIR:-/tmp/dwo-e2e-artifacts}" +mkdir -p "${ARTIFACT_DIR}" + +# Failure tracking +ATTEMPT_LOG="${ARTIFACT_DIR}/attempt-log.txt" +: > "$ATTEMPT_LOG" + deployDWO() { + echo "======== Deploying DevWorkspace Operator ========" export NAMESPACE="devworkspace-controller" export DWO_IMG="${DEVWORKSPACE_OPERATOR}" - make install + + if ! make install; then + echo "ERROR: Failed to deploy DevWorkspace Operator" + bumpPodsInfo "$NAMESPACE" + return 1 + fi + + echo "======== Verifying DWO deployment ========" + # Wait for DWO controller to be ready + if ! kubectl wait --for=condition=available deployment/devworkspace-controller-manager \ + -n "$NAMESPACE" \ + --timeout=300s; then + echo "ERROR: DWO controller is not ready" + bumpPodsInfo "$NAMESPACE" + return 1 + fi + + echo "✅ DevWorkspace Operator deployed successfully" + return 0 +} + +# Generated by Claude Sonnet 4.5 +verifyOLMHealth() { + echo "======== Verifying OLM Infrastructure ========" + + # Check catalog-operator is available + echo "Checking catalog-operator..." + if ! kubectl wait --for=condition=available deployment/catalog-operator \ + -n openshift-operator-lifecycle-manager \ + --timeout=120s 2>&1; then + echo "ERROR: catalog-operator is not ready" + kubectl get deployment/catalog-operator \ + -n openshift-operator-lifecycle-manager -o yaml || true + return 1 + fi + + # Check olm-operator is available + echo "Checking olm-operator..." + if ! kubectl wait --for=condition=available deployment/olm-operator \ + -n openshift-operator-lifecycle-manager \ + --timeout=120s 2>&1; then + echo "ERROR: olm-operator is not ready" + kubectl get deployment/olm-operator \ + -n openshift-operator-lifecycle-manager -o yaml || true + return 1 + fi + + # Verify marketplace is accessible + echo "Checking openshift-marketplace..." + if ! kubectl get catalogsources -n openshift-marketplace &>/dev/null; then + echo "ERROR: Cannot access CatalogSources in openshift-marketplace" + return 1 + fi + + echo "✅ OLM infrastructure is healthy" + return 0 } deployChe() { - chectl server:deploy \ + echo "======== Deploying Eclipse Che (attempt $1/$MAX_RETRIES) ========" + + if ! chectl server:deploy \ -p openshift \ --batch \ --telemetry=off \ --skip-devworkspace-operator \ - --k8spodwaittimeout=6000000 \ - --k8spodreadytimeout=6000000 + --chenamespace="$CHE_NAMESPACE" \ + --k8spodwaittimeout=86400 \ + --k8spodreadytimeout=86400; then + echo "ERROR: chectl server:deploy failed" + return 1 + fi + + echo "✅ chectl server:deploy completed" + return 0 +} + +# Generated by Claude Sonnet 4.5 +collectCheArtifacts() { + local attempt=$1 + echo "======== Collecting Che artifacts (attempt $attempt) ========" + + # Collect pod info from Che namespace + bumpPodsInfo "$CHE_NAMESPACE" || true + + # Collect Che operator logs + local che_operator_logs="${ARTIFACT_DIR}/che-operator-logs-attempt-${attempt}.log" + echo "Collecting Che operator logs to $che_operator_logs" + kubectl logs -n "$CHE_NAMESPACE" \ + -l app.kubernetes.io/component=che-operator \ + --tail=1000 > "$che_operator_logs" 2>&1 || true + + # Collect CheCluster CR status + local checluster_status="${ARTIFACT_DIR}/checluster-status-attempt-${attempt}.yaml" + echo "Collecting CheCluster status to $checluster_status" + kubectl get checluster -n "$CHE_NAMESPACE" -o yaml > "$checluster_status" 2>&1 || true + + # Collect OLM-specific diagnostics + local olm_diagnostics="${ARTIFACT_DIR}/olm-diagnostics-attempt-${attempt}.yaml" + echo "Collecting OLM diagnostics to $olm_diagnostics" + { + echo "=== Subscription ===" + kubectl get subscription -n "$CHE_NAMESPACE" -o yaml 2>&1 || echo "No subscriptions found" + echo "" + echo "=== InstallPlan ===" + kubectl get installplan -n "$CHE_NAMESPACE" -o yaml 2>&1 || echo "No installplans found" + echo "" + echo "=== ClusterServiceVersion ===" + kubectl get csv -n "$CHE_NAMESPACE" -o yaml 2>&1 || echo "No CSVs found" + echo "" + echo "=== CatalogSource ===" + kubectl get catalogsource -n openshift-marketplace -o yaml 2>&1 || echo "Cannot access catalogsources" + } > "$olm_diagnostics" 2>&1 || true + + # Collect CatalogSource pod logs + local catalogsource_logs="${ARTIFACT_DIR}/catalogsource-logs-attempt-${attempt}.log" + echo "Collecting CatalogSource pod logs to $catalogsource_logs" + kubectl logs -n openshift-marketplace \ + -l olm.catalogSource=eclipse-che \ + --tail=1000 > "$catalogsource_logs" 2>&1 || true + + # Collect chectl server logs + echo "Collecting chectl server logs" + chectl server:logs -n "$CHE_NAMESPACE" -d "${ARTIFACT_DIR}/chectl-logs-attempt-${attempt}" 2>&1 || true + + echo "Artifact collection completed" +} + +# Generated by Claude Sonnet 4.5 +cleanupFailedChe() { + echo "======== Cleaning up failed Che deployment ========" + chectl server:delete -n "$CHE_NAMESPACE" --yes 2>&1 || true + + # Wait for CheCluster CR to be fully deleted + echo "Waiting for CheCluster deletion to complete..." + kubectl wait --for=delete checluster --all \ + -n "$CHE_NAMESPACE" \ + --timeout=120s 2>&1 || true + + # Grace period for remaining finalizers + sleep 15 +} + +# Generated by Claude Sonnet 4.5 +deployAndVerifyChe() { + local attempt + + # Verify OLM infrastructure health before attempting Che deployment + if ! verifyOLMHealth; then + echo "❌ ERROR: OLM infrastructure is not healthy, cannot proceed with Che deployment" + echo "Collecting OLM diagnostics..." + collectCheArtifacts "olm-check" + recordAttempt "0/0" "verifyOLMHealth" "FAILED" "OLM infrastructure not ready" + return 1 + fi + + for attempt in $(seq 1 "$MAX_RETRIES"); do + echo "" + echo "========================================" + echo "Che Deployment Attempt $attempt/$MAX_RETRIES" + echo "========================================" + + # Try to deploy Che + if deployChe "$attempt"; then + echo "✅ Eclipse Che deployed successfully on attempt $attempt" + recordAttempt "$attempt/$MAX_RETRIES" "deployChe" "PASSED" "-" + return 0 + fi + + # Deployment failed — collect artifacts and classify + echo "❌ Che deployment failed on attempt $attempt" + collectCheArtifacts "$attempt" + local reason + reason=$(classifyCheFailure "$attempt") + recordAttempt "$attempt/$MAX_RETRIES" "deployChe" "FAILED" "$reason" + + # If not the last attempt, clean up and retry + if [ "$attempt" -lt "$MAX_RETRIES" ]; then + # Calculate exponential backoff with jitter + local exponential_delay=$((BASE_DELAY * (2 ** (attempt - 1)))) + local jitter=$((RANDOM % MAX_JITTER)) + local delay=$((exponential_delay + jitter)) + + echo "Cleaning up failed deployment..." + cleanupFailedChe + + echo "Retrying in ${delay} seconds..." + sleep "$delay" + fi + done + + echo "❌ ERROR: Che deployment failed after $MAX_RETRIES attempts" + return 1 +} + +# Generated by Claude Sonnet 4.5 +runHappyPathTest() { + local attempt="${1:-1}" + echo "======== Running Che Happy Path Test (attempt $attempt) ========" + export CHE_REPO_BRANCH="${CHE_REPO_BRANCH:-main}" + if ! [[ "$CHE_REPO_BRANCH" =~ ^[a-zA-Z0-9._/-]+$ ]]; then + echo "ERROR: Invalid CHE_REPO_BRANCH format: $CHE_REPO_BRANCH. Alphanumeric, dots, hyphens and slashes only." + return 1 + fi + + # Download and run the remote test script + if ! bash <(curl -s "https://raw.githubusercontent.com/eclipse/che/${CHE_REPO_BRANCH}/tests/devworkspace-happy-path/remote-launch.sh"); then + echo "ERROR: Happy path test failed" + + # Collect artifacts on test failure + echo "Collecting artifacts after test failure..." + collectCheArtifacts "test-attempt-${attempt}" + + return 1 + fi + + echo "✅ Happy path test completed successfully" + return 0 +} + +# Record an attempt result to the tracking log +# Format: TAB-delimited to avoid conflicts with reason text +# Usage: recordAttempt +recordAttempt() { + printf '%s\t%s\t%s\t%s\n' "$1" "$2" "$3" "$4" >> "$ATTEMPT_LOG" } -deployDWO -deployChe -export CHE_REPO_BRANCH="main" -bash <(curl -s https://raw.githubusercontent.com/eclipse/che/${CHE_REPO_BRANCH}/tests/devworkspace-happy-path/remote-launch.sh) +# Classify a Che deployment failure by analyzing collected artifacts +# Returns the reason string via stdout +classifyCheFailure() { + local attempt=$1 + local checluster_file="${ARTIFACT_DIR}/checluster-status-attempt-${attempt}.yaml" + local olm_file="${ARTIFACT_DIR}/olm-diagnostics-attempt-${attempt}.yaml" + local che_logs="${ARTIFACT_DIR}/che-operator-logs-attempt-${attempt}.log" + + # Check CheCluster for reconciliation failure + if [ -f "$checluster_file" ] && grep -q "InstallOrUpdateFailed" "$checluster_file" 2>/dev/null; then + echo "Che operator reconciliation failure (InstallOrUpdateFailed)" + return + fi + + # Check for image pull errors in pod events + local events_file="${ARTIFACT_DIR}/${CHE_NAMESPACE}-info/events.log" + if [ -f "$events_file" ] && grep -qi "ImagePullBackOff\|ErrImagePull\|Failed to pull image" "$events_file" 2>/dev/null; then + echo "Image pull failure in ${CHE_NAMESPACE} namespace" + return + fi + + # Check for OLM subscription issues + if [ -f "$olm_file" ] && grep -q "state: AtLatestKnown" "$olm_file" 2>/dev/null; then + : # Subscription resolved, issue is elsewhere + elif [ -f "$olm_file" ] && grep -q "No subscriptions found" "$olm_file" 2>/dev/null; then + echo "OLM subscription not created" + return + fi + + # Check operator logs for specific errors + if [ -f "$che_logs" ] && grep -qi "timeout\|deadline exceeded" "$che_logs" 2>/dev/null; then + echo "Che operator timeout during reconciliation" + return + fi + + echo "Unknown - check artifacts for details" +} + +# Classify a happy-path test failure +classifyTestFailure() { + local events_file="${ARTIFACT_DIR}/${CHE_NAMESPACE}-info/events.log" + local pods_dir="${ARTIFACT_DIR}/${CHE_NAMESPACE}-info" + + # Check for dashboard loader timeout (most common) + if [ -d "$pods_dir" ] && grep -rqi "main-page-loader.*still visible\|TimeoutError.*loader" "$pods_dir" 2>/dev/null; then + echo "Dashboard UI loader timeout" + return + fi + + # Check for workspace start timeout + if [ -d "$pods_dir" ] && grep -rqi "workspace.*timeout\|TS_SELENIUM_START_WORKSPACE_TIMEOUT" "$pods_dir" 2>/dev/null; then + echo "Workspace start timeout" + return + fi + + # Check for authentication failure + if [ -d "$pods_dir" ] && grep -rqi "login.*failed\|OAuth.*error\|authentication" "$pods_dir" 2>/dev/null; then + echo "Authentication/OAuth failure" + return + fi + + echo "Test execution failure - check test pod logs" +} + +# Determine failure category from reason string +categorizeFailure() { + local reason="$1" + case "$reason" in + *"reconciliation"*|*"Image pull"*|*"OLM"*|*"operator timeout"*) + echo "INFRA" + ;; + *"Dashboard"*|*"Workspace start"*|*"Authentication"*|*"Test execution"*) + echo "TEST" + ;; + *) + echo "UNKNOWN" + ;; + esac +} + +# Generate the failure report as markdown and JSON +generateReport() { + local result="$1" # PASSED or FAILED + local report_md="${ARTIFACT_DIR}/failure-report.md" + local report_json="${ARTIFACT_DIR}/failure-report.json" + + # Build the attempts table from the log + local attempts_md="" + local attempts_json="[" + local first=true + local has_infra=false + local has_test=false + + while IFS=$'\t' read -r attempt stage attempt_result reason; do + [ -z "$attempt" ] && continue + attempts_md="${attempts_md}| ${attempt} | ${stage} | ${attempt_result} | ${reason} |\n" + + if [ "$first" = true ]; then + first=false + else + attempts_json="${attempts_json}," + fi + attempts_json="${attempts_json}{\"attempt\":\"${attempt}\",\"stage\":\"${stage}\",\"result\":\"${attempt_result}\",\"reason\":\"${reason}\"}" + + local failure_cat + failure_cat=$(categorizeFailure "$reason") + [ "$failure_cat" = "INFRA" ] && has_infra=true + [ "$failure_cat" = "TEST" ] && has_test=true + done < "$ATTEMPT_LOG" + attempts_json="${attempts_json}]" + + # Determine overall category + local category="UNKNOWN" + if [ "$has_infra" = true ] && [ "$has_test" = true ]; then + category="MIXED" + elif [ "$has_infra" = true ]; then + category="INFRA" + elif [ "$has_test" = true ]; then + category="TEST" + fi + + # Determine retryability + local retryable="false" + [ "$category" = "INFRA" ] || [ "$category" = "MIXED" ] && retryable="true" + + # Build Prow job link if available + local prow_link="" + if [ -n "${PROW_JOB_ID:-}" ]; then + prow_link="https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/devfile_devworkspace-operator/${PULL_NUMBER:-unknown}/pull-ci-devfile-devworkspace-operator-main-v14-che-happy-path/${BUILD_ID:-unknown}" + fi + + # Write JSON report + cat > "$report_json" </dev/null; then + echo "**Note:** Passed on retry — transient issue" + fi + fi + + if [ -n "$prow_link" ]; then + echo "" + echo "[View logs and artifacts](${prow_link})" + fi + + echo "" + echo "---" + echo "📊 Automated CI failure report — Do not delete. These comments track" + echo "flakiness patterns across PRs. Learn more" + } > "$report_md" + + echo "Report written to ${report_md} and ${report_json}" +} + +# Post the failure report as a PR comment via GitHub API +commentPR() { + local report_md="${ARTIFACT_DIR}/failure-report.md" + + # Only comment on PRs (not periodic jobs) + if [ -z "${PULL_NUMBER:-}" ]; then + echo "Not a PR job (PULL_NUMBER not set), skipping PR comment" + return 0 + fi + + if [ ! -f "$report_md" ]; then + echo "No report file found, skipping PR comment" + return 0 + fi + + local repo="${REPO_OWNER:-devfile}/${REPO_NAME:-devworkspace-operator}" + local body + body=$(cat "$report_md") + + # Try gh CLI first, fall back to curl + if command -v gh &>/dev/null; then + echo "Posting failure report to PR #${PULL_NUMBER} via gh CLI..." + gh pr comment "${PULL_NUMBER}" \ + --repo "$repo" \ + --body "$body" 2>&1 || echo "WARNING: Failed to post PR comment via gh" + elif [ -n "${GITHUB_TOKEN:-}" ]; then + echo "Posting failure report to PR #${PULL_NUMBER} via GitHub API..." + local escaped_body + escaped_body=$(echo "$body" | python3 -c 'import sys,json; print(json.dumps(sys.stdin.read()))' 2>/dev/null || echo "\"report generation failed\"") + curl -s -X POST \ + -H "Authorization: token ${GITHUB_TOKEN}" \ + -H "Accept: application/vnd.github.v3+json" \ + "https://api.github.com/repos/${repo}/issues/${PULL_NUMBER}/comments" \ + -d "{\"body\": ${escaped_body}}" 2>&1 || echo "WARNING: Failed to post PR comment via API" + else + echo "No GitHub credentials available (no gh CLI, no GITHUB_TOKEN), skipping PR comment" + echo "Report saved to: ${report_md}" + fi +} + +# Main execution +# Generated by Claude Sonnet 4.5 +main() { + # Deploy DWO + if ! deployDWO; then + echo "❌ FAILED: DevWorkspace Operator deployment" + recordAttempt "1/1" "deployDWO" "FAILED" "DWO controller deployment failed" + return 1 + fi + + # Deploy Che with retry logic + if ! deployAndVerifyChe; then + echo "❌ FAILED: Eclipse Che deployment" + return 1 + fi + + # Run the happy path test (with 1 retry) + if ! runHappyPathTest 1; then + local test_reason + test_reason=$(classifyTestFailure) + recordAttempt "1/2" "runHappyPathTest" "FAILED" "$test_reason" + + echo "⚠️ Happy path test failed on first attempt, retrying..." + sleep 30 + if ! runHappyPathTest 2; then + test_reason=$(classifyTestFailure) + recordAttempt "2/2" "runHappyPathTest" "FAILED" "$test_reason" + echo "❌ FAILED: Happy path test execution (2 attempts)" + return 1 + fi + recordAttempt "2/2" "runHappyPathTest" "PASSED" "-" + else + recordAttempt "1/1" "runHappyPathTest" "PASSED" "-" + fi + + echo "" + echo "✅ SUCCESS: All tests passed!" + return 0 +} + +# Run main function +main +main_result=$? + +# Generate and post failure report (always, even on success for retry tracking) +if [ "$main_result" -eq 0 ]; then + generateReport "PASSED" +else + generateReport "FAILED" +fi +commentPR + +exit $main_result