ci: unify workloads + collapse deploy paths into Release + cleanup audit#132
Merged
Conversation
ffd015b to
eaf40d3
Compare
posix4e
added a commit
that referenced
this pull request
Apr 18, 2026
Consolidates PRs #127, #131, #132 into one commit. Net -381 lines, one "Release" workflow drives the whole fleet lifecycle. Workload spec: apps/<name>/workload.{json,json.tmpl} becomes the single source of truth for every EE workload (cloudflared, dd-agent, dd-management, ollama, openclaw, nv, mount-models, podman-static, podman-bootstrap). Boot-time (config.iso / ee-config metadata) and runtime (/deploy) both bake from the same file. Workflow topology: release.yml is the one entry point. pull_request → build → deploy-preview → dd-local-preview relaunch push main → build → deploy-production → dd-local-prod relaunch push v* → build only (versioned artifact, no deploy) workflow_dispatch → build → deploy-production (rollback: release_tag input) .github/workflows/deploy-cp.yml is the reusable workflow both paths call, so preview CI exercises the exact code prod uses. It provisions the GCP CP VM, verifies health + /cp/attest MRTD + dashboard + STONITH, comments on PR, and cascades agent relaunch. The cascade is blocking: a release goes green only when the matching dd-local-{kind} VM re-registers with the freshly-deployed CP (proves "everything works"). .github/actions/relaunch-agent/ is the composite action that SSHes into tdx2, runs dd-relaunch.sh, and polls /api/agents for the freshly-registered entry (5-min budget). Deleted: .github/workflows/production-deploy.yml (folded into release.yml as deploy-production job). .github/workflows/local-agents.yml (manual-dispatch path gone; push a commit to trigger a relaunch). .github/workflows/retire-staging.yml (one-shot, already run, no dd_env=staging VMs exist). scripts/ entirely. gcp-deploy.sh inlined into deploy-cp.yml; dd-relaunch.sh + local-agents.sh moved to apps/_infra/ (host-side); ollama-deploy.sh, redeploy-workload.sh, workloads.sh deleted as unused after the refactor. Cleanup audit: gcloud survey revealed dd-pr-121-1776434711 RUNNING in staging a day after PR #121 merged — branch not deleted, so pr-teardown.yml never fired and cleanup.yml only reaps TERMINATED. Added a reap-merged-pr-previews job to cleanup.yml that resolves each RUNNING pr-N VM's PR state via gh and tears down VM + CF tunnel + DNS CNAME when MERGED/CLOSED. Dropped reap-staging (dead), trimmed workflow_run trigger (Production Deploy is gone). Deferred (blocked on easyenclave): easyenclave image family split — preview → easyenclave-staging, prod → easyenclave-stable. easyenclave has no stable GCP family and no qcow2 on v0.1.14 today, so nothing to point prod at. Plus tdx2 auto-refresh of the base qcow2 from easyenclave releases. Follow-up PR once easyenclave publishes stable images. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced Apr 18, 2026
DD preview readyURL: https://pr-132.devopsdefender.com Browser login: paste CLI / curl: Register endpoint for a local agent: |
5 tasks
…ollama+openclaw Consolidates PRs #127, #131, #132, #133 into one commit. Every fleet path runs through one workflow, every agent VM boots the full container stack, and audit-surfaced cleanup gaps are closed. ## Workload spec — single source of truth apps/<name>/workload.{json,json.tmpl} is the one place a workload is defined. Boot-time (config.iso) and runtime (/deploy) bake from the same file. New apps/README.md documents the schema, the lifecycle matrix (CP / preview agent / prod agent), ordering via `until` polling, and a "deploying your own" walkthrough. Main README points at it. The bake helper (envsubst + jq strip-empty-env-entries) is inlined in two places — .github/workflows/deploy-cp.yml (CI, CP workloads) and apps/_infra/local-agents.sh (tdx2, agent workloads) — restricted to the uppercase ${VAR} refs each template declares so shell locals ($i, $((…))) in cmd strings aren't eaten. ## Workflow topology — one entry point release.yml drives everything: pull_request → build → deploy-preview → dd-local-preview relaunch push main → build → deploy-production → dd-local-prod relaunch push v* → build only (versioned artifact, no deploy) workflow_dispatch → build → deploy-production (rollback: release_tag input) .github/workflows/deploy-cp.yml is the reusable workflow both paths call. Preview CI exercises the exact code prod uses. It provisions the GCP CP VM, verifies /cp/attest MRTD + dashboard + STONITH, comments on PR, and cascades the agent relaunch — blocking on the agent re-registering with the CP (proves "everything works"). .github/actions/relaunch-agent/ is the composite that SSHes into tdx2, runs apps/_infra/dd-relaunch.sh, polls /api/agents for the freshly-registered entry (5-min budget). Retired workflows: production-deploy.yml → folded into release.yml as deploy-production local-agents.yml → no remaining purpose (cascade handles relaunch) retire-staging.yml → one-shot, already run scripts/ deleted entirely. gcp-deploy.sh inlined into deploy-cp.yml; dd-relaunch.sh + local-agents.sh moved to apps/_infra/ (host-side); ollama-deploy.sh, redeploy-workload.sh, workloads.sh removed as unused. ## Container stack on agent VMs dd-local-{preview,prod} now boot the full chain: nv (prod only) → mount-models → podman-static → podman-bootstrap → ollama (prod.json with GPU devices / preview.json CPU only) → openclaw (qwen2.5:7b on prod, qwen2.5:0.5b on preview) → cloudflared → dd-agent podman-bootstrap installs the wrapper as `podman` (not `dd-podman`), so bare `podman ps` from a guest shell reaches the right storage root instead of failing with `mkdir /var/lib/containers: read-only file system`. Raw binary moves to .podman-raw; dd-podman becomes a symlink for back-compat. ## Cleanup audit gcloud survey found dd-pr-121 RUNNING in staging a day after PR #121 merged (branch not deleted → pr-teardown.yml never fired; cleanup.yml only reaps TERMINATED). Added reap-merged-pr-previews: for each RUNNING pr-N VM, gh-resolves the PR state and tears down VM + CF tunnel + DNS CNAME when MERGED/CLOSED. Dropped reap-staging (dead — no dd_env=staging VMs exist), trimmed the workflow_run trigger (Production Deploy is gone). ## Deferred (blocked on easyenclave) Preview → easyenclave-staging / prod → easyenclave-stable image-family split + tdx2 auto-refresh of the base qcow2. easyenclave has no stable GCP family and no qcow2 on v0.1.14 today. Follow-up PR once easyenclave publishes stable images with qcow2 + a stable GCP image family. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
eaf40d3 to
19f52b7
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
One consolidated PR replacing #127, #131, #132, #133. One commit. Everything below in one merge.
Scope
Workload spec — single source of truth
apps/<name>/workload.{json,json.tmpl}is the one place a workload is defined. Boot-time (config.iso) and runtime (/deploy) bake from the same file. Newapps/README.mddocuments the schema, lifecycle matrix (CP / preview agent / prod agent), ordering viauntilpolling, and a "deploying your own" walkthrough. Main README points at it.Workflow topology — one entry point
release.ymldrives everything:pull_request→ build →deploy-preview→dd-local-previewrelaunchpush main→ build →deploy-production→dd-local-prodrelaunchpush v*→ build onlyworkflow_dispatch→ build →deploy-production(rollback viarelease_taginput).github/workflows/deploy-cp.ymlis the reusable workflow both paths call. Preview CI exercises the exact code prod uses. It provisions the CP, verifies/cp/attestMRTD + dashboard + STONITH, comments on PR, and cascades the agent relaunch — blocking on the agent re-registering with the CP.Retired:
production-deploy.yml(folded in),local-agents.yml(unused),retire-staging.yml(one-shot, done).scripts/deleted entirely (gcp-deploy.shinlined into deploy-cp.yml;dd-relaunch.sh+local-agents.sh→apps/_infra/).Container stack on agent VMs
dd-local-{preview,prod}now boot the full chain:nv(prod only) →mount-models→podman-static→podman-bootstrap→ollama(GPU prod / CPU preview) →openclaw(qwen2.5:7b prod, qwen2.5:0.5b preview) →cloudflared→dd-agent.podman-bootstrapinstalls the wrapper aspodman(notdd-podman), so barepodman psin a guest shell works.Cleanup audit
gcloudsurvey founddd-pr-121-1776434711RUNNING in staging a day after PR #121 merged. Addedreap-merged-pr-previewsthat resolves each RUNNING pr-N VM's PR state and tears down VM + CF tunnel + DNS when MERGED/CLOSED. Droppedreap-staging(dead), trimmedworkflow_runtrigger.Deferred
Preview →
easyenclave-staging/ prod →easyenclave-stableimage-family split + tdx2 auto-refresh. Blocked on easyenclave publishing stable qcow2 assets.Test plan
/api/agentsfor dd-local-preview).https://pr-132.devopsdefender.com/returns 200 with Bearer PAT.podman psin a dd-local-{kind} guest returns the ollama container (no read-only-fs error).deploy-productioncascade provisions prod CP + relaunches dd-local-prod with qwen2.5:7b on H100.dd-pr-121; leaves open PRs alone.Replaces #127, #131, #132, #133.
🤖 Generated with Claude Code