feat(pipeline): add post-deploy health verification (VERIFYING status)#46
feat(pipeline): add post-deploy health verification (VERIFYING status)#46microbluey wants to merge 3 commits into
Conversation
Deploys and rollbacks previously marked SUCCEEDED right after the K8s
server-side apply returned, before the StatefulSet rollout actually
became ready -- a bad image (ImagePullBackOff) still showed success.
Introduce a VERIFYING status between DEPLOYING and SUCCEEDED: after the
artifact is applied the pipeline enters VERIFYING (with a deadline), and
the existing 5s scan job polls the rollout to decide SUCCEEDED or ERROR.
Async + persisted, so it survives a restart during the verify window.
- State machine: DEPLOYING -> VERIFYING -> {SUCCEEDED, ERROR}; VERIFYING
is not stoppable. DEPLOYING -> SUCCEEDED kept for the disabled path.
- verify_deadline column on pipeline (V11) anchors the timeout.
- ApplicationRuntimeGateway.getDeploymentHealth + DeploymentHealth:
fail fast on ImagePullBackOff/ErrImagePull/CrashLoopBackOff, succeed on
converged rollout (observedGeneration + updated==ready==desired),
else time out.
- oops.pipeline.health.{enabled,timeout} (default on, 5m);
enabled=false reverts to immediate SUCCEEDED.
- Wired into PipelineService (deploy + rollback) and the scan job's
IMMEDIATE branch + new scanVerifyingPipelines.
- VERIFYING notification type; frontend status badge + 4-locale i18n.
- Failure does not auto-rollback: mark ERROR + notify, leave it to a human.
Tests: state-machine edges, scan-job verify decisions, deploy->VERIFYING.
wellCh4n
left a comment
There was a problem hiding this comment.
This seems to be what Kubernetes liveness probes are designed for:
https://kubernetes.io/docs/concepts/workloads/pods/probes/#liveness-probe
The container will be restarted automatically when the probe fails.
I’d prefer extending the existing health check mechanism to support this use case rather than introducing a separate implementation.
…fication The existing health check only configured a liveness probe. Add a readiness probe built from the same HealthCheck config: readiness gates Service traffic and drives readyReplicas, which is what post-deploy verification (VERIFYING) keys off to decide a rollout is actually healthy rather than merely started. This makes the existing health-check config the single source of truth, instead of introducing a parallel notion.
|
Good call — I've pushed a change to build on the existing health-check mechanism rather than add a parallel one. First, to clarify what
On your point about reusing the existing mechanism — agreed, and done (
One honest caveat: when a user hasn't enabled the health check, |
There was a problem hiding this comment.
Pull request overview
This PR adds post-deploy health verification so pipelines move through VERIFYING after Kubernetes apply and only become SUCCEEDED once the StatefulSet rollout is healthy.
Changes:
- Adds
VERIFYINGstatus across backend state machine, persistence, notifications, scan job, and frontend labels/badges. - Adds Kubernetes rollout health inspection and verification timeout configuration.
- Adds tests for state transitions, rollback/deploy verification behavior, and scan-job outcomes.
Reviewed changes
Copilot reviewed 31 out of 31 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
src/main/java/com/github/wellch4n/oops/application/service/PipelineService.java |
Routes deploy/rollback completion through VERIFYING or legacy success. |
src/main/java/com/github/wellch4n/oops/infrastructure/scheduler/PipelineInstanceScanJob.java |
Polls verifying pipelines and marks success/error. |
src/main/java/com/github/wellch4n/oops/infrastructure/kubernetes/KubernetesApplicationRuntimeGateway.java |
Adds StatefulSet and pod health snapshot logic. |
src/main/java/com/github/wellch4n/oops/infrastructure/kubernetes/task/processor/StatefulSetProcessor.java |
Adds readiness probe generation. |
src/main/java/com/github/wellch4n/oops/domain/** |
Adds VERIFYING status and allowed transitions. |
src/main/java/com/github/wellch4n/oops/application/** |
Adds health DTO, gateway/repository methods, and notification type. |
src/main/java/com/github/wellch4n/oops/infrastructure/persistence/jpa/** |
Persists verifyDeadline and conditional status/deadline updates. |
src/main/resources/db/migration/V11__add_pipeline_verify_deadline.sql |
Adds pipeline.verify_deadline. |
src/main/java/com/github/wellch4n/oops/infrastructure/config/PipelineHealthProperties.java |
Adds oops.pipeline.health configuration. |
config/application.yml.example, docker/application.yml.example |
Documents health verification settings. |
web/** |
Adds VERIFYING API type, labels, badge treatment, and help docs. |
src/test/java/** |
Adds/updates tests for verification and rollback paths. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Codex <codex@openai.com>
|
Thanks, addressed the latest review comments.
I also added coverage for the active-status set and the timeout-after-health-query-error path. Verified the affected backend tests locally: Result: |
What
Deploys and rollbacks previously marked
SUCCEEDEDright after the K8s server-side apply returned — before the StatefulSet rollout actually became ready. A bad image (ImagePullBackOff) or pods that never start still showed success. This is the platform-wide follow-up deliberately split out of #45 (image-only rollback).Approach
Introduce a
VERIFYINGstatus betweenDEPLOYINGandSUCCEEDED. After the artifact is applied the pipeline entersVERIFYING(with a deadline), and the existing 5s scan job (PipelineInstanceScanJob) polls the StatefulSet rollout to decideSUCCEEDEDorERROR. Async + persisted, so it survives a process restart during the verify window. Covers all three deploy paths: manual deploy, rollback, and post-build auto-deploy.Health judgement
state.waiting.reason ∈ {ImagePullBackOff, ErrImagePull, CrashLoopBackOff}→ ERROR (no need to wait for the timeout)observedGenerationcaught up +updatedReplicas == readyReplicas == spec.replicas)verify_deadlineelapsed without becoming ready → ERRORVERIFYING, re-check next tickKey decisions
verify_deadlinecolumn onpipeline(does NOT touchBaseDataObject), FlywayV11.VERIFYINGis not stoppable (apply already done); the state machine only allowsVERIFYING → {SUCCEEDED, ERROR}.oops.pipeline.health.{enabled,timeout}(default on, 5m);enabled=falsereverts to immediateSUCCEEDED(theDEPLOYING → SUCCEEDEDedge is kept for this path).Backend
Pipeline/PipelineStatus;verify_deadlinecolumn +updateStatusAndDeadlineIfMatch.ApplicationRuntimeGateway.getDeploymentHealth+DeploymentHealthDTO (reads StatefulSet rollout status + fatal pod waiting reasons).PipelineHealthProperties+ bothapplication.yml.examplesamples;VERIFYINGnotification type.PipelineService(deploy + rollback share acompleteDeployPhasehelper) and the scan job's IMMEDIATE branch + newscanVerifyingPipelines.Frontend
VERIFYINGadded to thePipelineStatusunion, treated as an in-progress (primary) badge, 4-locale i18n, help docs. Stop and rollback buttons intentionally not enabled forVERIFYING.Tests
PipelineStateMachineTests(+3 VERIFYING edges), newPipelineVerificationScanTests(4: success / fail-fast / timeout / in-progress), newPipelineHealthVerificationTests(deploy → VERIFYING),PipelineRollbackTests(constructor updated, still green)../mvnw test. The singleOopsApplicationTests.contextLoadserror is the pre-existing no-DB smoke test, unrelated to this change.pnpm buildpasses.