Skip to content

ci(deploy-cp): run STONITH verify last, after the agent relaunch#135

Merged
posix4e merged 1 commit into
mainfrom
ci/stonith-verify-last
Apr 18, 2026
Merged

ci(deploy-cp): run STONITH verify last, after the agent relaunch#135
posix4e merged 1 commit into
mainfrom
ci/stonith-verify-last

Conversation

@posix4e
Copy link
Copy Markdown
Member

@posix4e posix4e commented Apr 18, 2026

Summary

  • Moves the `Verify STONITH halted prior VM(s) in this env` step to be the last step in `deploy-cp.yml`, after the `Relaunch dd-local-{env}` cascade.
  • Relaunch itself triggers a second wave of STONITH (old agent re-registers → old CF tunnel is deleted → old CP's self-watchdog poweroffs). Running the verification after relaunch captures that wave; today's position only observes the register-time kill.
  • User-facing deliverables (PR comment, relaunched local agent) now land before the slowest + flakiest check (24×5s loop + fallback `gcloud delete`).

No behavior change on the happy path.

Scope boundary

If relaunch fails, STONITH verification no longer runs (it used to). Acceptable — old VMs already self-terminate via the CF-tunnel-delete trigger at register time, and operator attention is needed on relaunch failures anyway.

Test plan

  • YAML parses locally (`python3 -c 'import yaml; yaml.safe_load(open(...))'`) — ✅
  • Next deploy-preview run on this PR shows STONITH verification step logs appearing after the Relaunch step logs
  • Overall deploy-preview run still passes end-to-end

🤖 Generated with Claude Code

The relaunch cascade itself triggers a second wave of STONITH activity:
the tdx2 agent re-registers with the fresh CP, its old CF tunnel is
deleted, and the old CP's self-watchdog then poweroffs. Verifying after
the relaunch captures that wave — the previous position verified only
the kill that happens when the new CP first registers its own tunnel.

Also reorders user-facing deliverables (PR comment, local agent back
online) ahead of the slowest + flakiest check (24×5s loop with a
fallback `gcloud delete`), so those land first.

No behavior change on the happy path. If relaunch fails, the STONITH
verification no longer runs — but operator attention is already needed
in that case, and the old VMs self-terminate via the CF-tunnel-delete
trigger at register time either way.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

DD preview ready

URL: https://pr-135.devopsdefender.com

Browser login: paste gh auth token output at https://pr-135.devopsdefender.com/auth/pat

CLI / curl: curl -H "Authorization: Bearer $(gh auth token)" https://pr-135.devopsdefender.com/

Register endpoint for a local agent: wss://pr-135.devopsdefender.com/register

@posix4e posix4e merged commit 2b1ab20 into main Apr 18, 2026
4 checks passed
@posix4e posix4e deleted the ci/stonith-verify-last branch April 18, 2026 21:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant