Skip to content

Latest commit

 

History

History
302 lines (238 loc) · 10.8 KB

File metadata and controls

302 lines (238 loc) · 10.8 KB

Che Happy-Path Test

Script: .ci/oci-devworkspace-happy-path.sh Purpose: Integration test validating DevWorkspace Operator with Eclipse Che deployment

Overview

This script deploys and validates the full DevWorkspace Operator + Eclipse Che stack on OpenShift, ensuring the happy-path user workflow succeeds. It's used in the v14-che-happy-path Prow CI test.

Features

Retry Logic

  • Che deployment: 2 attempts with exponential backoff (60s base + jitter)
  • Cleanup: Waits for CheCluster CR deletion before retry
  • Happy-path test retry: 1 retry with 30s delay if Selenium test fails

Health Checks

  • OLM: Verifies catalog-operator and olm-operator are available before Che deployment (2-minute timeout each)
  • DWO: Waits for deployment condition=available (5-minute timeout)
  • Che: chectl's built-in readiness checks ensure deployment is healthy

Artifact Collection

On each failure, collects:

  • OLM diagnostics (Subscription, InstallPlan, CSV, CatalogSource)
  • CatalogSource pod logs
  • Che operator logs (last 1000 lines)
  • CheCluster CR status (full YAML)
  • All pod logs from Che namespace
  • Kubernetes events
  • chectl server logs

Error Handling

  • Graceful error handling with stage-specific messages
  • Progress indicators: "Attempt 1/2", "Retrying in 71s..."
  • No crash on failures

Configuration

Environment variables (all optional):

Variable Default Description
CHE_NAMESPACE eclipse-che Namespace for Che deployment
MAX_RETRIES 2 Maximum retry attempts
BASE_DELAY 60 Base delay in seconds for exponential backoff
MAX_JITTER 15 Maximum jitter in seconds
ARTIFACT_DIR /tmp/dwo-e2e-artifacts Directory for diagnostic artifacts
DEVWORKSPACE_OPERATOR (required) DWO image to deploy

Usage

In Prow CI

The script is called automatically by the v14-che-happy-path Prow job. Prow sets DEVWORKSPACE_OPERATOR based on the context:

For PR checks (testing PR code):

export DEVWORKSPACE_OPERATOR="quay.io/devfile/devworkspace-controller:pr-${PR_NUMBER}-${COMMIT_SHA}"
./.ci/oci-devworkspace-happy-path.sh

For periodic/nightly runs (testing main branch):

export DEVWORKSPACE_OPERATOR="quay.io/devfile/devworkspace-controller:next"
./.ci/oci-devworkspace-happy-path.sh

Local Testing

export DEVWORKSPACE_OPERATOR="quay.io/youruser/devworkspace-controller:your-tag"
export ARTIFACT_DIR="/tmp/my-test-artifacts"
./.ci/oci-devworkspace-happy-path.sh

Test Flow

  1. Deploy DWO

    • Runs make install
    • Waits for controller deployment to be available
    • Collects artifacts if deployment fails
  2. Deploy Che (with retry)

    • Runs chectl server:deploy with extended timeouts (24h)
    • chectl handles readiness checks internally
    • Collects artifacts on failure
    • Cleans up and retries if needed
  3. Run Happy-Path Test

    • Downloads test script from Eclipse Che repository
    • Executes Che happy-path workflow
    • Retries once after 30s if test fails
    • Collects artifacts on failure

Exit Codes

  • 0: Success - All stages completed
  • 1: Failure - Check $ARTIFACT_DIR for diagnostics

Timeouts

Component Timeout Purpose
DWO deployment 5 minutes Pod becomes available
chectl pod wait/ready 24 hours Generous for slow environments

Common Failures

OLM Infrastructure Not Ready

Symptoms: "ERROR: OLM infrastructure is not healthy, cannot proceed with Che deployment" Check: $ARTIFACT_DIR/olm-diagnostics-olm-check.yaml Common causes:

  • OLM operators not running (catalog-operator, olm-operator)
  • Cluster provisioning issues during bootstrap
  • Resource constraints preventing OLM operator scheduling Resolution: This indicates a fundamental cluster infrastructure issue. Check cluster health and OLM operator logs before retrying.

DWO Deployment Fails

Symptoms: "ERROR: DWO controller is not ready" Check: $ARTIFACT_DIR/devworkspace-controller-info/ Common causes: Image pull errors, resource constraints, webhook conflicts

Che Deployment Timeout

Symptoms: "ERROR: chectl server:deploy failed" with timeout-related messages Check: $ARTIFACT_DIR/che-operator-logs-attempt-*.log, $ARTIFACT_DIR/olm-diagnostics-attempt-*.yaml, $ARTIFACT_DIR/chectl-logs-attempt-*/ Common causes:

  • OLM subscription timeout (check olm-diagnostics for subscription state)
  • Database connection issues
  • Image pull failures
  • Operator reconciliation errors
  • chectl timeout waiting for pods/resources to become ready

Pod CrashLoopBackOff

Symptoms: "ERROR: chectl server:deploy failed" Check: $ARTIFACT_DIR/eclipse-che-info/ for pod logs Common causes: Configuration errors, resource limits, TLS certificate issues

OLM Subscription Stuck

Symptoms: Subscription timeout after 120 seconds with no resources created Check: $ARTIFACT_DIR/olm-diagnostics-attempt-*.yaml, $ARTIFACT_DIR/catalogsource-logs-attempt-*.log Common causes:

  • CatalogSource pod not pulling/running
  • InstallPlan not created (subscription cannot resolve dependencies)
  • Cluster resource exhaustion preventing operator pod scheduling Resolution: Check OLM operator logs and CatalogSource pod status. See "Advanced Troubleshooting" section for monitoring and alternative deployment options.

Artifact Locations

After a failed test run:

$ARTIFACT_DIR/
├── attempt-log.txt
├── failure-report.json
├── failure-report.md
├── devworkspace-controller-info/
│   ├── <pod-name>-<container>.log
│   └── events.log
├── eclipse-che-info/
│   ├── <pod-name>-<container>.log
│   └── events.log
├── che-operator-logs-attempt-1.log
├── che-operator-logs-attempt-2.log
├── checluster-status-attempt-1.yaml
├── checluster-status-attempt-2.yaml
├── olm-diagnostics-attempt-1.yaml
├── olm-diagnostics-attempt-2.yaml
├── catalogsource-logs-attempt-1.log
├── catalogsource-logs-attempt-2.log
├── chectl-logs-attempt-1/
└── chectl-logs-attempt-2/

Dependencies

  • kubectl - Kubernetes CLI
  • oc - OpenShift CLI (for log collection)
  • chectl - Eclipse Che CLI (v7.114.0+)
  • jq - JSON processor (for chectl)

Advanced Troubleshooting

OLM Infrastructure Issues

If you experience persistent OLM subscription timeouts (see olm-diagnostics-*.yaml artifacts):

Option 1: OLM Health Check (Implemented)

The script now verifies OLM infrastructure health before deploying Che:

  • Checks catalog-operator is available
  • Checks olm-operator is available
  • Verifies openshift-marketplace is accessible

If OLM is unhealthy, the test fails fast with diagnostic artifacts instead of waiting through timeouts.

Option 2: Monitor Subscription Progress (Advanced)

For debugging stuck subscriptions, you can add active monitoring to detect zero-progress scenarios earlier:

# Example: Monitor subscription state every 10 seconds
while [ $elapsed -lt 300 ]; do
  state=$(kubectl get subscription eclipse-che -n eclipse-che \
    -o jsonpath='{.status.state}' 2>/dev/null)
  echo "[$elapsed/300s] Subscription state: ${state:-unknown}"
  if [ "$state" = "AtLatestKnown" ]; then
    break
  fi
  sleep 10
  elapsed=$((elapsed + 10))
done

This helps identify whether subscriptions are progressing slowly vs. completely stuck.

Option 3: Skip OLM Installation (Alternative Approach)

For CI environments with persistent OLM issues, consider deploying Che operator directly instead of via OLM:

chectl server:deploy \
  --installer=operator \  # Uses direct YAML deployment
  -p openshift \
  --batch \
  --telemetry=off \
  --skip-devworkspace-operator \
  --chenamespace="$CHE_NAMESPACE"

Trade-offs:

  • ✅ Bypasses OLM infrastructure entirely
  • ✅ More reliable in resource-constrained CI environments
  • ❌ Doesn't test OLM integration path (used by production OperatorHub)
  • ❌ May miss OLM-specific issues

When to use: Temporary workaround for CI infrastructure issues while OLM problems are being resolved.

Subscription Timeout Issues

If OLM subscriptions consistently timeout (visible in olm-diagnostics-*.yaml):

  1. Check OLM operator logs:

    kubectl logs -n openshift-operator-lifecycle-manager \
      deployment/catalog-operator --tail=100
    kubectl logs -n openshift-operator-lifecycle-manager \
      deployment/olm-operator --tail=100
  2. Verify CatalogSource pod is running:

    kubectl get pods -n openshift-marketplace \
      -l olm.catalogSource=eclipse-che
    kubectl logs -n openshift-marketplace \
      -l olm.catalogSource=eclipse-che
  3. Check InstallPlan creation:

    kubectl get installplan -n eclipse-che -o yaml
    • If no InstallPlan exists, OLM couldn't resolve the subscription
    • If InstallPlan exists but isn't complete, check its status conditions

CI Failure Reports

The script automatically generates failure reports and posts them as PR comments after each run (both failures and successes with retries). Do not delete these comments — they are used to track flakiness patterns across PRs.

What gets reported

Each report includes a table of all attempts with:

  • Attempt: Which attempt number (e.g., 1/2, 2/2)
  • Stage: Which function failed (deployChe, runHappyPathTest, etc.)
  • Result: PASSED or FAILED
  • Reason: Classified failure reason (e.g., "Che operator reconciliation failure")

Failure categories

Category Meaning Retryable?
INFRA Infrastructure issue (OLM, image pull, operator reconciliation) Yes — /retest
TEST Test execution issue (Dashboard UI timeout, workspace start) Maybe
MIXED Both infrastructure and test issues across attempts Yes — /retest
UNKNOWN Could not classify — check artifacts Investigate

Report artifacts

Reports are always saved to $ARTIFACT_DIR/ regardless of whether PR commenting succeeds:

  • failure-report.json — structured data for programmatic analysis
  • failure-report.md — human-readable markdown (same as the PR comment)
  • attempt-log.txt — raw attempt tracking log

Why these comments matter

Over time, these reports reveal:

  • Which failure categories are most common
  • Whether flakiness is improving or worsening
  • Which infrastructure components are least reliable
  • Whether retry logic is effective (passed-on-retry patterns)

Related Documentation