ci: fold every deploy path into Release, drop scripts/#131
Closed
posix4e wants to merge 1 commit into
Closed
Conversation
Release now owns the full lifecycle: build → deploy-preview (PR) OR
deploy-production (main / dispatch) → relaunch-agent (blocking) →
verify agent re-registered with CP. A release is "done" only when the
local dd-local-{kind} VM is back online talking to the freshly-deployed
CP — that's the signal that tells us a PR is safe to merge or a merge
actually shipped.
Deleted:
.github/workflows/production-deploy.yml — folded into release.yml
as a deploy-production job with same deploy-cp.yml body.
.github/workflows/local-agents.yml — manual-dispatch path gone;
push a commit to trigger a relaunch via the cascade.
Deleted scripts/:
scripts/gcp-deploy.sh — inlined into deploy-cp.yml.
scripts/dd-relaunch.sh → apps/_infra/dd-relaunch.sh (host-side).
scripts/local-agents.sh → apps/_infra/local-agents.sh (host-side).
scripts/workloads.sh — dead after inline; only gcp-deploy
sourced it and local-agents.sh built
workloads via inline jq anyway.
scripts/redeploy-workload.sh — unused helper, removed.
deploy-cp.yml's Relaunch step drops `continue-on-error: true`; the
relaunch-agent composite gains a "Verify agent registered with CP"
step that polls /api/agents for a freshly-registered dd-local-{kind}
entry with a 5-min budget.
Concurrency on release.yml becomes expression-driven: PR pushes cancel
in-progress runs; main / tag / dispatch queue so an in-flight prod
deploy finishes cleanly.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
DD preview readyURL: https://pr-131.devopsdefender.com Browser login: paste CLI / curl: Register endpoint for a local agent: |
posix4e
added a commit
that referenced
this pull request
Apr 18, 2026
Audit via gcloud revealed dd-pr-121-1776434711 running in staging a day after PR #121 merged (branch not deleted → pr-teardown.yml never fired; cleanup.yml only reaps TERMINATED, not RUNNING). Closes that gap and tidies what audit found along the way. Changes: - Delete .github/workflows/retire-staging.yml. One-shot to retire the legacy app-staging.{domain} env; has already been run, no dd_env=staging VMs exist. Header itself said "delete after use". - cleanup.yml: drop reap-staging job (filters dd_env=staging, dead). - cleanup.yml: drop "Production Deploy" from workflow_run trigger (that workflow is being removed in #131 / follow-ups). Release alone now covers both preview and prod post-deploy hooks. - cleanup.yml: add reap-merged-pr-previews. For each pr-N env with a RUNNING VM, resolve the PR's state via gh; if MERGED/CLOSED, tear down VM + CF tunnel + DNS CNAME like pr-teardown.yml would have. Scheduled sweep + workflow_run + dispatch. Deferred (blocked on easyenclave): image-family split preview→staging / prod→stable. easyenclave has no easyenclave-stable GCP family and no qcow2 on v0.1.14 today, so there's nothing to point prod at. Ship the split once easyenclave publishes stable images. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
5 tasks
posix4e
added a commit
that referenced
this pull request
Apr 18, 2026
Consolidates PRs #127, #131, #132 into one commit. Net -381 lines, one "Release" workflow drives the whole fleet lifecycle. Workload spec: apps/<name>/workload.{json,json.tmpl} becomes the single source of truth for every EE workload (cloudflared, dd-agent, dd-management, ollama, openclaw, nv, mount-models, podman-static, podman-bootstrap). Boot-time (config.iso / ee-config metadata) and runtime (/deploy) both bake from the same file. Workflow topology: release.yml is the one entry point. pull_request → build → deploy-preview → dd-local-preview relaunch push main → build → deploy-production → dd-local-prod relaunch push v* → build only (versioned artifact, no deploy) workflow_dispatch → build → deploy-production (rollback: release_tag input) .github/workflows/deploy-cp.yml is the reusable workflow both paths call, so preview CI exercises the exact code prod uses. It provisions the GCP CP VM, verifies health + /cp/attest MRTD + dashboard + STONITH, comments on PR, and cascades agent relaunch. The cascade is blocking: a release goes green only when the matching dd-local-{kind} VM re-registers with the freshly-deployed CP (proves "everything works"). .github/actions/relaunch-agent/ is the composite action that SSHes into tdx2, runs dd-relaunch.sh, and polls /api/agents for the freshly-registered entry (5-min budget). Deleted: .github/workflows/production-deploy.yml (folded into release.yml as deploy-production job). .github/workflows/local-agents.yml (manual-dispatch path gone; push a commit to trigger a relaunch). .github/workflows/retire-staging.yml (one-shot, already run, no dd_env=staging VMs exist). scripts/ entirely. gcp-deploy.sh inlined into deploy-cp.yml; dd-relaunch.sh + local-agents.sh moved to apps/_infra/ (host-side); ollama-deploy.sh, redeploy-workload.sh, workloads.sh deleted as unused after the refactor. Cleanup audit: gcloud survey revealed dd-pr-121-1776434711 RUNNING in staging a day after PR #121 merged — branch not deleted, so pr-teardown.yml never fired and cleanup.yml only reaps TERMINATED. Added a reap-merged-pr-previews job to cleanup.yml that resolves each RUNNING pr-N VM's PR state via gh and tears down VM + CF tunnel + DNS CNAME when MERGED/CLOSED. Dropped reap-staging (dead), trimmed workflow_run trigger (Production Deploy is gone). Deferred (blocked on easyenclave): easyenclave image family split — preview → easyenclave-staging, prod → easyenclave-stable. easyenclave has no stable GCP family and no qcow2 on v0.1.14 today, so nothing to point prod at. Plus tdx2 auto-refresh of the base qcow2 from easyenclave releases. Follow-up PR once easyenclave publishes stable images. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Member
Author
|
Folded into #132 (single-commit consolidation). |
posix4e
added a commit
that referenced
this pull request
Apr 18, 2026
…ollama+openclaw Consolidates PRs #127, #131, #132, #133 into one commit. Every fleet path runs through one workflow, every agent VM boots the full container stack, and audit-surfaced cleanup gaps are closed. ## Workload spec — single source of truth apps/<name>/workload.{json,json.tmpl} is the one place a workload is defined. Boot-time (config.iso) and runtime (/deploy) bake from the same file. New apps/README.md documents the schema, the lifecycle matrix (CP / preview agent / prod agent), ordering via `until` polling, and a "deploying your own" walkthrough. Main README points at it. The bake helper (envsubst + jq strip-empty-env-entries) is inlined in two places — .github/workflows/deploy-cp.yml (CI, CP workloads) and apps/_infra/local-agents.sh (tdx2, agent workloads) — restricted to the uppercase ${VAR} refs each template declares so shell locals ($i, $((…))) in cmd strings aren't eaten. ## Workflow topology — one entry point release.yml drives everything: pull_request → build → deploy-preview → dd-local-preview relaunch push main → build → deploy-production → dd-local-prod relaunch push v* → build only (versioned artifact, no deploy) workflow_dispatch → build → deploy-production (rollback: release_tag input) .github/workflows/deploy-cp.yml is the reusable workflow both paths call. Preview CI exercises the exact code prod uses. It provisions the GCP CP VM, verifies /cp/attest MRTD + dashboard + STONITH, comments on PR, and cascades the agent relaunch — blocking on the agent re-registering with the CP (proves "everything works"). .github/actions/relaunch-agent/ is the composite that SSHes into tdx2, runs apps/_infra/dd-relaunch.sh, polls /api/agents for the freshly-registered entry (5-min budget). Retired workflows: production-deploy.yml → folded into release.yml as deploy-production local-agents.yml → no remaining purpose (cascade handles relaunch) retire-staging.yml → one-shot, already run scripts/ deleted entirely. gcp-deploy.sh inlined into deploy-cp.yml; dd-relaunch.sh + local-agents.sh moved to apps/_infra/ (host-side); ollama-deploy.sh, redeploy-workload.sh, workloads.sh removed as unused. ## Container stack on agent VMs dd-local-{preview,prod} now boot the full chain: nv (prod only) → mount-models → podman-static → podman-bootstrap → ollama (prod.json with GPU devices / preview.json CPU only) → openclaw (qwen2.5:7b on prod, qwen2.5:0.5b on preview) → cloudflared → dd-agent podman-bootstrap installs the wrapper as `podman` (not `dd-podman`), so bare `podman ps` from a guest shell reaches the right storage root instead of failing with `mkdir /var/lib/containers: read-only file system`. Raw binary moves to .podman-raw; dd-podman becomes a symlink for back-compat. ## Cleanup audit gcloud survey found dd-pr-121 RUNNING in staging a day after PR #121 merged (branch not deleted → pr-teardown.yml never fired; cleanup.yml only reaps TERMINATED). Added reap-merged-pr-previews: for each RUNNING pr-N VM, gh-resolves the PR state and tears down VM + CF tunnel + DNS CNAME when MERGED/CLOSED. Dropped reap-staging (dead — no dd_env=staging VMs exist), trimmed the workflow_run trigger (Production Deploy is gone). ## Deferred (blocked on easyenclave) Preview → easyenclave-staging / prod → easyenclave-stable image-family split + tdx2 auto-refresh of the base qcow2. easyenclave has no stable GCP family and no qcow2 on v0.1.14 today. Follow-up PR once easyenclave publishes stable images with qcow2 + a stable GCP image family. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
posix4e
added a commit
that referenced
this pull request
Apr 18, 2026
…ollama+openclaw (#132) Consolidates PRs #127, #131, #132, #133 into one commit. Every fleet path runs through one workflow, every agent VM boots the full container stack, and audit-surfaced cleanup gaps are closed. ## Workload spec — single source of truth apps/<name>/workload.{json,json.tmpl} is the one place a workload is defined. Boot-time (config.iso) and runtime (/deploy) bake from the same file. New apps/README.md documents the schema, the lifecycle matrix (CP / preview agent / prod agent), ordering via `until` polling, and a "deploying your own" walkthrough. Main README points at it. The bake helper (envsubst + jq strip-empty-env-entries) is inlined in two places — .github/workflows/deploy-cp.yml (CI, CP workloads) and apps/_infra/local-agents.sh (tdx2, agent workloads) — restricted to the uppercase ${VAR} refs each template declares so shell locals ($i, $((…))) in cmd strings aren't eaten. ## Workflow topology — one entry point release.yml drives everything: pull_request → build → deploy-preview → dd-local-preview relaunch push main → build → deploy-production → dd-local-prod relaunch push v* → build only (versioned artifact, no deploy) workflow_dispatch → build → deploy-production (rollback: release_tag input) .github/workflows/deploy-cp.yml is the reusable workflow both paths call. Preview CI exercises the exact code prod uses. It provisions the GCP CP VM, verifies /cp/attest MRTD + dashboard + STONITH, comments on PR, and cascades the agent relaunch — blocking on the agent re-registering with the CP (proves "everything works"). .github/actions/relaunch-agent/ is the composite that SSHes into tdx2, runs apps/_infra/dd-relaunch.sh, polls /api/agents for the freshly-registered entry (5-min budget). Retired workflows: production-deploy.yml → folded into release.yml as deploy-production local-agents.yml → no remaining purpose (cascade handles relaunch) retire-staging.yml → one-shot, already run scripts/ deleted entirely. gcp-deploy.sh inlined into deploy-cp.yml; dd-relaunch.sh + local-agents.sh moved to apps/_infra/ (host-side); ollama-deploy.sh, redeploy-workload.sh, workloads.sh removed as unused. ## Container stack on agent VMs dd-local-{preview,prod} now boot the full chain: nv (prod only) → mount-models → podman-static → podman-bootstrap → ollama (prod.json with GPU devices / preview.json CPU only) → openclaw (qwen2.5:7b on prod, qwen2.5:0.5b on preview) → cloudflared → dd-agent podman-bootstrap installs the wrapper as `podman` (not `dd-podman`), so bare `podman ps` from a guest shell reaches the right storage root instead of failing with `mkdir /var/lib/containers: read-only file system`. Raw binary moves to .podman-raw; dd-podman becomes a symlink for back-compat. ## Cleanup audit gcloud survey found dd-pr-121 RUNNING in staging a day after PR #121 merged (branch not deleted → pr-teardown.yml never fired; cleanup.yml only reaps TERMINATED). Added reap-merged-pr-previews: for each RUNNING pr-N VM, gh-resolves the PR state and tears down VM + CF tunnel + DNS CNAME when MERGED/CLOSED. Dropped reap-staging (dead — no dd_env=staging VMs exist), trimmed the workflow_run trigger (Production Deploy is gone). ## Deferred (blocked on easyenclave) Preview → easyenclave-staging / prod → easyenclave-stable image-family split + tdx2 auto-refresh of the base qcow2. easyenclave has no stable GCP family and no qcow2 on v0.1.14 today. Follow-up PR once easyenclave publishes stable images with qcow2 + a stable GCP image family. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
One workflow,
release.yml, owns the full lifecycle now:build→deploy-preview(deploy-cp.yml, preview inputs) → blocking relaunch ofdd-local-previewbuild→deploy-production(deploy-cp.yml, prod inputs) → blocking relaunch ofdd-local-proddeploy-productionwith rollback-pickrelease_taginputbuildonly (versioned artifact, no deploy)A release goes green only when the matching
dd-local-{kind}VM re-registers with the freshly-deployed CP within 5 minutes. That's the user's gate: before I merge, I want to see the local agent deployment worked.Deleted:
production-deploy.yml— folded intorelease.ymlas adeploy-productionjob.local-agents.yml— manual-dispatch path gone; push a commit to trigger a relaunch via the cascade.scripts/—gcp-deploy.shinlined intodeploy-cp.yml(thebake/joinhelpers become a ~10-line inline envsubst + jq pipeline).dd-relaunch.shandlocal-agents.shmoved toapps/_infra/(host-side scripts, consumed by the tdx2 host via git checkout).workloads.sh+redeploy-workload.shdeleted as unused..github/actions/relaunch-agent/action.ymlgains a blockingVerify agent registered with CPstep (polls/api/agentsfor 5 min);deploy-cp.yml's Relaunch step dropscontinue-on-error: true.Net: -235 lines, three files gone, one signal that actually matches intent.
Test plan
build→deploy-preview→ dd-local-preview relaunches and re-registers → PR goes green.build→deploy-production→ dd-local-prod relaunches and re-registers.workflow_dispatchwithrelease_tag: v0.2.0→deploy-productiondeploys that tag; dd-local-prod re-registers.ls scripts/fails;ls apps/_infra/showsdd-relaunch.sh+local-agents.sh.Stacked under #127 — retarget to main after #127 merges.
🤖 Generated with Claude Code