Skip to content

feat(pipeline): add post-deploy health verification (VERIFYING status)#46

Open
microbluey wants to merge 3 commits into
wellCh4n:mainfrom
microbluey:feat/post-deploy-health-verification
Open

feat(pipeline): add post-deploy health verification (VERIFYING status)#46
microbluey wants to merge 3 commits into
wellCh4n:mainfrom
microbluey:feat/post-deploy-health-verification

Conversation

@microbluey
Copy link
Copy Markdown
Contributor

@microbluey microbluey commented May 31, 2026

What

Deploys and rollbacks previously marked SUCCEEDED right after the K8s server-side apply returned — before the StatefulSet rollout actually became ready. A bad image (ImagePullBackOff) or pods that never start still showed success. This is the platform-wide follow-up deliberately split out of #45 (image-only rollback).

Approach

Introduce a VERIFYING status between DEPLOYING and SUCCEEDED. After the artifact is applied the pipeline enters VERIFYING (with a deadline), and the existing 5s scan job (PipelineInstanceScanJob) polls the StatefulSet rollout to decide SUCCEEDED or ERROR. Async + persisted, so it survives a process restart during the verify window. Covers all three deploy paths: manual deploy, rollback, and post-build auto-deploy.

Health judgement

  • Fail fast: any pod container state.waiting.reason ∈ {ImagePullBackOff, ErrImagePull, CrashLoopBackOff} → ERROR (no need to wait for the timeout)
  • Success: StatefulSet rollout converged (observedGeneration caught up + updatedReplicas == readyReplicas == spec.replicas)
  • Timeout: verify_deadline elapsed without becoming ready → ERROR
  • Otherwise: stay VERIFYING, re-check next tick

Key decisions

  1. Timeout anchored by a new verify_deadline column on pipeline (does NOT touch BaseDataObject), Flyway V11.
  2. VERIFYING is not stoppable (apply already done); the state machine only allows VERIFYING → {SUCCEEDED, ERROR}.
  3. Failure does not auto-rollback — it marks ERROR + sends a FAILED notification, leaving the bad version live for a human to decide.
  4. Toggle oops.pipeline.health.{enabled,timeout} (default on, 5m); enabled=false reverts to immediate SUCCEEDED (the DEPLOYING → SUCCEEDED edge is kept for this path).

Backend

  • State machine / Pipeline / PipelineStatus; verify_deadline column + updateStatusAndDeadlineIfMatch.
  • ApplicationRuntimeGateway.getDeploymentHealth + DeploymentHealth DTO (reads StatefulSet rollout status + fatal pod waiting reasons).
  • PipelineHealthProperties + both application.yml.example samples; VERIFYING notification type.
  • Wired into PipelineService (deploy + rollback share a completeDeployPhase helper) and the scan job's IMMEDIATE branch + new scanVerifyingPipelines.

Frontend

  • VERIFYING added to the PipelineStatus union, treated as an in-progress (primary) badge, 4-locale i18n, help docs. Stop and rollback buttons intentionally not enabled for VERIFYING.

Tests

  • PipelineStateMachineTests (+3 VERIFYING edges), new PipelineVerificationScanTests (4: success / fail-fast / timeout / in-progress), new PipelineHealthVerificationTests (deploy → VERIFYING), PipelineRollbackTests (constructor updated, still green).
  • All unit tests pass via ./mvnw test. The single OopsApplicationTests.contextLoads error is the pre-existing no-DB smoke test, unrelated to this change.
  • pnpm build passes.

Deploys and rollbacks previously marked SUCCEEDED right after the K8s
server-side apply returned, before the StatefulSet rollout actually
became ready -- a bad image (ImagePullBackOff) still showed success.

Introduce a VERIFYING status between DEPLOYING and SUCCEEDED: after the
artifact is applied the pipeline enters VERIFYING (with a deadline), and
the existing 5s scan job polls the rollout to decide SUCCEEDED or ERROR.
Async + persisted, so it survives a restart during the verify window.

- State machine: DEPLOYING -> VERIFYING -> {SUCCEEDED, ERROR}; VERIFYING
  is not stoppable. DEPLOYING -> SUCCEEDED kept for the disabled path.
- verify_deadline column on pipeline (V11) anchors the timeout.
- ApplicationRuntimeGateway.getDeploymentHealth + DeploymentHealth:
  fail fast on ImagePullBackOff/ErrImagePull/CrashLoopBackOff, succeed on
  converged rollout (observedGeneration + updated==ready==desired),
  else time out.
- oops.pipeline.health.{enabled,timeout} (default on, 5m);
  enabled=false reverts to immediate SUCCEEDED.
- Wired into PipelineService (deploy + rollback) and the scan job's
  IMMEDIATE branch + new scanVerifyingPipelines.
- VERIFYING notification type; frontend status badge + 4-locale i18n.
- Failure does not auto-rollback: mark ERROR + notify, leave it to a human.

Tests: state-machine edges, scan-job verify decisions, deploy->VERIFYING.
@wellCh4n wellCh4n assigned wellCh4n and microbluey and unassigned wellCh4n May 31, 2026
Copy link
Copy Markdown
Owner

@wellCh4n wellCh4n left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be what Kubernetes liveness probes are designed for:

https://kubernetes.io/docs/concepts/workloads/pods/probes/#liveness-probe

The container will be restarted automatically when the probe fails.

I’d prefer extending the existing health check mechanism to support this use case rather than introducing a separate implementation.

…fication

The existing health check only configured a liveness probe. Add a
readiness probe built from the same HealthCheck config: readiness gates
Service traffic and drives readyReplicas, which is what post-deploy
verification (VERIFYING) keys off to decide a rollout is actually healthy
rather than merely started. This makes the existing health-check config
the single source of truth, instead of introducing a parallel notion.
@microbluey
Copy link
Copy Markdown
Contributor Author

Good call — I've pushed a change to build on the existing health-check mechanism rather than add a parallel one.

First, to clarify what VERIFYING is for, since I think we're after two different things:

  • A liveness probe only runs after the container starts, and on failure restarts the container. The most common bad-deploy failure is ImagePullBackOff / ErrImagePull (or a pod stuck Pending) — the container never starts, so liveness never runs and there's nothing to restart. Liveness can't catch those.
  • More fundamentally, this PR is about the pipeline status, not keeping a running container alive. Today deploy() server-side-applies and immediately marks the pipeline SUCCEEDED, before the rollout is ready. A probe failure happens inside Kubernetes and is never reported back, so the pipeline shows a false SUCCEEDED and the operator gets no notification. That's the gap VERIFYING closes.

On your point about reusing the existing mechanism — agreed, and done (2f1af1c):

  • The existing health check only configured a liveness probe. I've added a readiness probe from the same HealthCheck config (StatefulSetProcessor). Readiness is what gates Service traffic and drives readyReplicas, which is exactly what VERIFYING keys off — so the existing health-check config is now the single source of truth for "is this rollout healthy," and the pipeline's only job is to read that outcome back and notify on failure. (Adding readiness is a worthwhile fix on its own — running only a liveness probe with no readiness is a gap regardless of this feature.)

One honest caveat: when a user hasn't enabled the health check, readyReplicas only means "container started." In that case VERIFYING still adds value via the ImagePullBackOff / Pending-timeout detection, which a probe can't provide. So the two are complementary: the readiness probe sharpens "ready" when configured, and the rollout-status read is what reports it back to the pipeline either way.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds post-deploy health verification so pipelines move through VERIFYING after Kubernetes apply and only become SUCCEEDED once the StatefulSet rollout is healthy.

Changes:

  • Adds VERIFYING status across backend state machine, persistence, notifications, scan job, and frontend labels/badges.
  • Adds Kubernetes rollout health inspection and verification timeout configuration.
  • Adds tests for state transitions, rollback/deploy verification behavior, and scan-job outcomes.

Reviewed changes

Copilot reviewed 31 out of 31 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
src/main/java/com/github/wellch4n/oops/application/service/PipelineService.java Routes deploy/rollback completion through VERIFYING or legacy success.
src/main/java/com/github/wellch4n/oops/infrastructure/scheduler/PipelineInstanceScanJob.java Polls verifying pipelines and marks success/error.
src/main/java/com/github/wellch4n/oops/infrastructure/kubernetes/KubernetesApplicationRuntimeGateway.java Adds StatefulSet and pod health snapshot logic.
src/main/java/com/github/wellch4n/oops/infrastructure/kubernetes/task/processor/StatefulSetProcessor.java Adds readiness probe generation.
src/main/java/com/github/wellch4n/oops/domain/** Adds VERIFYING status and allowed transitions.
src/main/java/com/github/wellch4n/oops/application/** Adds health DTO, gateway/repository methods, and notification type.
src/main/java/com/github/wellch4n/oops/infrastructure/persistence/jpa/** Persists verifyDeadline and conditional status/deadline updates.
src/main/resources/db/migration/V11__add_pipeline_verify_deadline.sql Adds pipeline.verify_deadline.
src/main/java/com/github/wellch4n/oops/infrastructure/config/PipelineHealthProperties.java Adds oops.pipeline.health configuration.
config/application.yml.example, docker/application.yml.example Documents health verification settings.
web/** Adds VERIFYING API type, labels, badge treatment, and help docs.
src/test/java/** Adds/updates tests for verification and rollback paths.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread web/app/help/docs/pipelines/page.tsx Outdated
Co-authored-by: Codex <codex@openai.com>
@microbluey
Copy link
Copy Markdown
Contributor Author

Thanks, addressed the latest review comments.

  • Replaced the help-doc status value FAILED with the actual API enum ERROR.
  • Added deadline handling in the VERIFYING error path, so repeated getDeploymentHealth failures cannot leave a pipeline stuck indefinitely.
  • Centralized active deployment statuses in DeploymentConcurrencyPolicy and included VERIFYING, so deploy/manual deploy/rollback all block while a previous rollout is still being verified.

I also added coverage for the active-status set and the timeout-after-health-query-error path.

Verified the affected backend tests locally: DeploymentConcurrencyPolicyTests, PipelineVerificationScanTests, PipelineHealthVerificationTests, PipelineRollbackTests.

Result: Tests run: 14, Failures: 0, Errors: 0, Skipped: 0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants