feat(pipeline): add post-deploy health verification (VERIFYING status) by microbluey · Pull Request #46 · wellCh4n/oops

microbluey · 2026-05-31T02:55:25Z

What

Deploys and rollbacks previously marked SUCCEEDED right after the K8s server-side apply returned — before the StatefulSet rollout actually became ready. A bad image (ImagePullBackOff) or pods that never start still showed success. This is the platform-wide follow-up deliberately split out of #45 (image-only rollback).

Approach

Introduce a VERIFYING status between DEPLOYING and SUCCEEDED. After the artifact is applied the pipeline enters VERIFYING (with a deadline), and the existing 5s scan job (PipelineInstanceScanJob) polls the StatefulSet rollout to decide SUCCEEDED or ERROR. Async + persisted, so it survives a process restart during the verify window. Covers all three deploy paths: manual deploy, rollback, and post-build auto-deploy.

Health judgement

Fail fast: any pod container state.waiting.reason ∈ {ImagePullBackOff, ErrImagePull, CrashLoopBackOff} → ERROR (no need to wait for the timeout)
Success: StatefulSet rollout converged (observedGeneration caught up + updatedReplicas == readyReplicas == spec.replicas)
Timeout: verify_deadline elapsed without becoming ready → ERROR
Otherwise: stay VERIFYING, re-check next tick

Key decisions

Timeout anchored by a new verify_deadline column on pipeline (does NOT touch BaseDataObject), Flyway V11.
VERIFYING is not stoppable (apply already done); the state machine only allows VERIFYING → {SUCCEEDED, ERROR}.
Failure does not auto-rollback — it marks ERROR + sends a FAILED notification, leaving the bad version live for a human to decide.
Toggle oops.pipeline.health.{enabled,timeout} (default on, 5m); enabled=false reverts to immediate SUCCEEDED (the DEPLOYING → SUCCEEDED edge is kept for this path).

Backend

State machine / Pipeline / PipelineStatus; verify_deadline column + updateStatusAndDeadlineIfMatch.
ApplicationRuntimeGateway.getDeploymentHealth + DeploymentHealth DTO (reads StatefulSet rollout status + fatal pod waiting reasons).
PipelineHealthProperties + both application.yml.example samples; VERIFYING notification type.
Wired into PipelineService (deploy + rollback share a completeDeployPhase helper) and the scan job's IMMEDIATE branch + new scanVerifyingPipelines.

Frontend

VERIFYING added to the PipelineStatus union, treated as an in-progress (primary) badge, 4-locale i18n, help docs. Stop and rollback buttons intentionally not enabled for VERIFYING.

Tests

PipelineStateMachineTests (+3 VERIFYING edges), new PipelineVerificationScanTests (4: success / fail-fast / timeout / in-progress), new PipelineHealthVerificationTests (deploy → VERIFYING), PipelineRollbackTests (constructor updated, still green).
All unit tests pass via ./mvnw test. The single OopsApplicationTests.contextLoads error is the pre-existing no-DB smoke test, unrelated to this change.
pnpm build passes.

Deploys and rollbacks previously marked SUCCEEDED right after the K8s server-side apply returned, before the StatefulSet rollout actually became ready -- a bad image (ImagePullBackOff) still showed success. Introduce a VERIFYING status between DEPLOYING and SUCCEEDED: after the artifact is applied the pipeline enters VERIFYING (with a deadline), and the existing 5s scan job polls the rollout to decide SUCCEEDED or ERROR. Async + persisted, so it survives a restart during the verify window. - State machine: DEPLOYING -> VERIFYING -> {SUCCEEDED, ERROR}; VERIFYING is not stoppable. DEPLOYING -> SUCCEEDED kept for the disabled path. - verify_deadline column on pipeline (V11) anchors the timeout. - ApplicationRuntimeGateway.getDeploymentHealth + DeploymentHealth: fail fast on ImagePullBackOff/ErrImagePull/CrashLoopBackOff, succeed on converged rollout (observedGeneration + updated==ready==desired), else time out. - oops.pipeline.health.{enabled,timeout} (default on, 5m); enabled=false reverts to immediate SUCCEEDED. - Wired into PipelineService (deploy + rollback) and the scan job's IMMEDIATE branch + new scanVerifyingPipelines. - VERIFYING notification type; frontend status badge + 4-locale i18n. - Failure does not auto-rollback: mark ERROR + notify, leave it to a human. Tests: state-machine edges, scan-job verify decisions, deploy->VERIFYING.

wellCh4n

This seems to be what Kubernetes liveness probes are designed for:

https://kubernetes.io/docs/concepts/workloads/pods/probes/#liveness-probe

The container will be restarted automatically when the probe fails.

I’d prefer extending the existing health check mechanism to support this use case rather than introducing a separate implementation.

…fication The existing health check only configured a liveness probe. Add a readiness probe built from the same HealthCheck config: readiness gates Service traffic and drives readyReplicas, which is what post-deploy verification (VERIFYING) keys off to decide a rollout is actually healthy rather than merely started. This makes the existing health-check config the single source of truth, instead of introducing a parallel notion.

microbluey · 2026-05-31T15:43:04Z

Good call — I've pushed a change to build on the existing health-check mechanism rather than add a parallel one.

First, to clarify what VERIFYING is for, since I think we're after two different things:

A liveness probe only runs after the container starts, and on failure restarts the container. The most common bad-deploy failure is ImagePullBackOff / ErrImagePull (or a pod stuck Pending) — the container never starts, so liveness never runs and there's nothing to restart. Liveness can't catch those.
More fundamentally, this PR is about the pipeline status, not keeping a running container alive. Today deploy() server-side-applies and immediately marks the pipeline SUCCEEDED, before the rollout is ready. A probe failure happens inside Kubernetes and is never reported back, so the pipeline shows a false SUCCEEDED and the operator gets no notification. That's the gap VERIFYING closes.

On your point about reusing the existing mechanism — agreed, and done (2f1af1c):

The existing health check only configured a liveness probe. I've added a readiness probe from the same HealthCheck config (StatefulSetProcessor). Readiness is what gates Service traffic and drives readyReplicas, which is exactly what VERIFYING keys off — so the existing health-check config is now the single source of truth for "is this rollout healthy," and the pipeline's only job is to read that outcome back and notify on failure. (Adding readiness is a worthwhile fix on its own — running only a liveness probe with no readiness is a gap regardless of this feature.)

One honest caveat: when a user hasn't enabled the health check, readyReplicas only means "container started." In that case VERIFYING still adds value via the ImagePullBackOff / Pending-timeout detection, which a probe can't provide. So the two are complementary: the readiness probe sharpens "ready" when configured, and the rollout-status read is what reports it back to the pipeline either way.

Copilot

Pull request overview

This PR adds post-deploy health verification so pipelines move through VERIFYING after Kubernetes apply and only become SUCCEEDED once the StatefulSet rollout is healthy.

Changes:

Adds VERIFYING status across backend state machine, persistence, notifications, scan job, and frontend labels/badges.
Adds Kubernetes rollout health inspection and verification timeout configuration.
Adds tests for state transitions, rollback/deploy verification behavior, and scan-job outcomes.

Reviewed changes

Copilot reviewed 31 out of 31 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
`src/main/java/com/github/wellch4n/oops/application/service/PipelineService.java`	Routes deploy/rollback completion through `VERIFYING` or legacy success.
`src/main/java/com/github/wellch4n/oops/infrastructure/scheduler/PipelineInstanceScanJob.java`	Polls verifying pipelines and marks success/error.
`src/main/java/com/github/wellch4n/oops/infrastructure/kubernetes/KubernetesApplicationRuntimeGateway.java`	Adds StatefulSet and pod health snapshot logic.
`src/main/java/com/github/wellch4n/oops/infrastructure/kubernetes/task/processor/StatefulSetProcessor.java`	Adds readiness probe generation.
`src/main/java/com/github/wellch4n/oops/domain/**`	Adds `VERIFYING` status and allowed transitions.
`src/main/java/com/github/wellch4n/oops/application/**`	Adds health DTO, gateway/repository methods, and notification type.
`src/main/java/com/github/wellch4n/oops/infrastructure/persistence/jpa/**`	Persists `verifyDeadline` and conditional status/deadline updates.
`src/main/resources/db/migration/V11__add_pipeline_verify_deadline.sql`	Adds `pipeline.verify_deadline`.
`src/main/java/com/github/wellch4n/oops/infrastructure/config/PipelineHealthProperties.java`	Adds `oops.pipeline.health` configuration.
`config/application.yml.example`, `docker/application.yml.example`	Documents health verification settings.
`web/**`	Adds `VERIFYING` API type, labels, badge treatment, and help docs.
`src/test/java/**`	Adds/updates tests for verification and rollback paths.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Co-authored-by: Codex <codex@openai.com>

microbluey · 2026-06-01T05:13:18Z

Thanks, addressed the latest review comments.

Replaced the help-doc status value FAILED with the actual API enum ERROR.
Added deadline handling in the VERIFYING error path, so repeated getDeploymentHealth failures cannot leave a pipeline stuck indefinitely.
Centralized active deployment statuses in DeploymentConcurrencyPolicy and included VERIFYING, so deploy/manual deploy/rollback all block while a previous rollout is still being verified.

I also added coverage for the active-status set and the timeout-after-health-query-error path.

Verified the affected backend tests locally: DeploymentConcurrencyPolicyTests, PipelineVerificationScanTests, PipelineHealthVerificationTests, PipelineRollbackTests.

Result: Tests run: 14, Failures: 0, Errors: 0, Skipped: 0.

wellCh4n assigned wellCh4n and microbluey and unassigned wellCh4n May 31, 2026

wellCh4n requested changes May 31, 2026

View reviewed changes

microbluey requested a review from wellCh4n May 31, 2026 15:47

wellCh4n requested a review from Copilot June 1, 2026 03:22

Copilot started reviewing on behalf of wellCh4n June 1, 2026 03:23 View session

Copilot AI reviewed Jun 1, 2026

View reviewed changes

Comment thread web/app/help/docs/pipelines/page.tsx Outdated

Comment thread src/main/java/com/github/wellch4n/oops/infrastructure/scheduler/PipelineInstanceScanJob.java

Comment thread src/main/java/com/github/wellch4n/oops/application/service/PipelineService.java

fix(pipeline): address verifying review feedback

404ffc3

Co-authored-by: Codex <codex@openai.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(pipeline): add post-deploy health verification (VERIFYING status)#46

feat(pipeline): add post-deploy health verification (VERIFYING status)#46
microbluey wants to merge 3 commits into
wellCh4n:mainfrom
microbluey:feat/post-deploy-health-verification

microbluey commented May 31, 2026 •

edited

Loading

Uh oh!

wellCh4n left a comment

Uh oh!

microbluey commented May 31, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

microbluey commented Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

microbluey commented May 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Approach

Health judgement

Key decisions

Backend

Frontend

Tests

Uh oh!

wellCh4n left a comment

Choose a reason for hiding this comment

Uh oh!

microbluey commented May 31, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

microbluey commented Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

microbluey commented May 31, 2026 •

edited

Loading