ci: fold every deploy path into Release, drop scripts/ by posix4e · Pull Request #131 · devopsdefender/dd

posix4e · 2026-04-18T17:47:05Z

Summary

One workflow, release.yml, owns the full lifecycle now:

PR → build → deploy-preview (deploy-cp.yml, preview inputs) → blocking relaunch of dd-local-preview
push main → build → deploy-production (deploy-cp.yml, prod inputs) → blocking relaunch of dd-local-prod
workflow_dispatch → deploy-production with rollback-pick release_tag input
push v* → build only (versioned artifact, no deploy)

A release goes green only when the matching dd-local-{kind} VM re-registers with the freshly-deployed CP within 5 minutes. That's the user's gate: before I merge, I want to see the local agent deployment worked.

Deleted:

production-deploy.yml — folded into release.yml as a deploy-production job.
local-agents.yml — manual-dispatch path gone; push a commit to trigger a relaunch via the cascade.
scripts/ — gcp-deploy.sh inlined into deploy-cp.yml (the bake/join helpers become a ~10-line inline envsubst + jq pipeline). dd-relaunch.sh and local-agents.sh moved to apps/_infra/ (host-side scripts, consumed by the tdx2 host via git checkout). workloads.sh + redeploy-workload.sh deleted as unused.

.github/actions/relaunch-agent/action.yml gains a blocking Verify agent registered with CP step (polls /api/agents for 5 min); deploy-cp.yml's Relaunch step drops continue-on-error: true.

Net: -235 lines, three files gone, one signal that actually matches intent.

Test plan

Throwaway PR → Release runs build → deploy-preview → dd-local-preview relaunches and re-registers → PR goes green.
Merge to main → Release runs build → deploy-production → dd-local-prod relaunches and re-registers.
workflow_dispatch with release_tag: v0.2.0 → deploy-production deploys that tag; dd-local-prod re-registers.
Two rapid PR pushes → old Release run cancels, new one runs (PR concurrency).
Two rapid main pushes → both runs queue, second waits for first (main concurrency never cancels).
ls scripts/ fails; ls apps/_infra/ shows dd-relaunch.sh + local-agents.sh.

Stacked under #127 — retarget to main after #127 merges.

🤖 Generated with Claude Code

Release now owns the full lifecycle: build → deploy-preview (PR) OR deploy-production (main / dispatch) → relaunch-agent (blocking) → verify agent re-registered with CP. A release is "done" only when the local dd-local-{kind} VM is back online talking to the freshly-deployed CP — that's the signal that tells us a PR is safe to merge or a merge actually shipped. Deleted: .github/workflows/production-deploy.yml — folded into release.yml as a deploy-production job with same deploy-cp.yml body. .github/workflows/local-agents.yml — manual-dispatch path gone; push a commit to trigger a relaunch via the cascade. Deleted scripts/: scripts/gcp-deploy.sh — inlined into deploy-cp.yml. scripts/dd-relaunch.sh → apps/_infra/dd-relaunch.sh (host-side). scripts/local-agents.sh → apps/_infra/local-agents.sh (host-side). scripts/workloads.sh — dead after inline; only gcp-deploy sourced it and local-agents.sh built workloads via inline jq anyway. scripts/redeploy-workload.sh — unused helper, removed. deploy-cp.yml's Relaunch step drops `continue-on-error: true`; the relaunch-agent composite gains a "Verify agent registered with CP" step that polls /api/agents for a freshly-registered dd-local-{kind} entry with a 5-min budget. Concurrency on release.yml becomes expression-driven: PR pushes cancel in-progress runs; main / tag / dispatch queue so an in-flight prod deploy finishes cleanly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-04-18T17:49:50Z

DD preview ready

URL: https://pr-131.devopsdefender.com

Browser login: paste gh auth token output at https://pr-131.devopsdefender.com/auth/pat

CLI / curl: curl -H "Authorization: Bearer $(gh auth token)" https://pr-131.devopsdefender.com/

Register endpoint for a local agent: wss://pr-131.devopsdefender.com/register

Audit via gcloud revealed dd-pr-121-1776434711 running in staging a day after PR #121 merged (branch not deleted → pr-teardown.yml never fired; cleanup.yml only reaps TERMINATED, not RUNNING). Closes that gap and tidies what audit found along the way. Changes: - Delete .github/workflows/retire-staging.yml. One-shot to retire the legacy app-staging.{domain} env; has already been run, no dd_env=staging VMs exist. Header itself said "delete after use". - cleanup.yml: drop reap-staging job (filters dd_env=staging, dead). - cleanup.yml: drop "Production Deploy" from workflow_run trigger (that workflow is being removed in #131 / follow-ups). Release alone now covers both preview and prod post-deploy hooks. - cleanup.yml: add reap-merged-pr-previews. For each pr-N env with a RUNNING VM, resolve the PR's state via gh; if MERGED/CLOSED, tear down VM + CF tunnel + DNS CNAME like pr-teardown.yml would have. Scheduled sweep + workflow_run + dispatch. Deferred (blocked on easyenclave): image-family split preview→staging / prod→stable. easyenclave has no easyenclave-stable GCP family and no qcow2 on v0.1.14 today, so there's nothing to point prod at. Ship the split once easyenclave publishes stable images. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Consolidates PRs #127, #131, #132 into one commit. Net -381 lines, one "Release" workflow drives the whole fleet lifecycle. Workload spec: apps/<name>/workload.{json,json.tmpl} becomes the single source of truth for every EE workload (cloudflared, dd-agent, dd-management, ollama, openclaw, nv, mount-models, podman-static, podman-bootstrap). Boot-time (config.iso / ee-config metadata) and runtime (/deploy) both bake from the same file. Workflow topology: release.yml is the one entry point. pull_request → build → deploy-preview → dd-local-preview relaunch push main → build → deploy-production → dd-local-prod relaunch push v* → build only (versioned artifact, no deploy) workflow_dispatch → build → deploy-production (rollback: release_tag input) .github/workflows/deploy-cp.yml is the reusable workflow both paths call, so preview CI exercises the exact code prod uses. It provisions the GCP CP VM, verifies health + /cp/attest MRTD + dashboard + STONITH, comments on PR, and cascades agent relaunch. The cascade is blocking: a release goes green only when the matching dd-local-{kind} VM re-registers with the freshly-deployed CP (proves "everything works"). .github/actions/relaunch-agent/ is the composite action that SSHes into tdx2, runs dd-relaunch.sh, and polls /api/agents for the freshly-registered entry (5-min budget). Deleted: .github/workflows/production-deploy.yml (folded into release.yml as deploy-production job). .github/workflows/local-agents.yml (manual-dispatch path gone; push a commit to trigger a relaunch). .github/workflows/retire-staging.yml (one-shot, already run, no dd_env=staging VMs exist). scripts/ entirely. gcp-deploy.sh inlined into deploy-cp.yml; dd-relaunch.sh + local-agents.sh moved to apps/_infra/ (host-side); ollama-deploy.sh, redeploy-workload.sh, workloads.sh deleted as unused after the refactor. Cleanup audit: gcloud survey revealed dd-pr-121-1776434711 RUNNING in staging a day after PR #121 merged — branch not deleted, so pr-teardown.yml never fired and cleanup.yml only reaps TERMINATED. Added a reap-merged-pr-previews job to cleanup.yml that resolves each RUNNING pr-N VM's PR state via gh and tears down VM + CF tunnel + DNS CNAME when MERGED/CLOSED. Dropped reap-staging (dead), trimmed workflow_run trigger (Production Deploy is gone). Deferred (blocked on easyenclave): easyenclave image family split — preview → easyenclave-staging, prod → easyenclave-stable. easyenclave has no stable GCP family and no qcow2 on v0.1.14 today, so nothing to point prod at. Plus tdx2 auto-refresh of the base qcow2 from easyenclave releases. Follow-up PR once easyenclave publishes stable images. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

posix4e · 2026-04-18T18:13:07Z

Folded into #132 (single-commit consolidation).

…ollama+openclaw Consolidates PRs #127, #131, #132, #133 into one commit. Every fleet path runs through one workflow, every agent VM boots the full container stack, and audit-surfaced cleanup gaps are closed. ## Workload spec — single source of truth apps/<name>/workload.{json,json.tmpl} is the one place a workload is defined. Boot-time (config.iso) and runtime (/deploy) bake from the same file. New apps/README.md documents the schema, the lifecycle matrix (CP / preview agent / prod agent), ordering via `until` polling, and a "deploying your own" walkthrough. Main README points at it. The bake helper (envsubst + jq strip-empty-env-entries) is inlined in two places — .github/workflows/deploy-cp.yml (CI, CP workloads) and apps/_infra/local-agents.sh (tdx2, agent workloads) — restricted to the uppercase ${VAR} refs each template declares so shell locals ($i, $((…))) in cmd strings aren't eaten. ## Workflow topology — one entry point release.yml drives everything: pull_request → build → deploy-preview → dd-local-preview relaunch push main → build → deploy-production → dd-local-prod relaunch push v* → build only (versioned artifact, no deploy) workflow_dispatch → build → deploy-production (rollback: release_tag input) .github/workflows/deploy-cp.yml is the reusable workflow both paths call. Preview CI exercises the exact code prod uses. It provisions the GCP CP VM, verifies /cp/attest MRTD + dashboard + STONITH, comments on PR, and cascades the agent relaunch — blocking on the agent re-registering with the CP (proves "everything works"). .github/actions/relaunch-agent/ is the composite that SSHes into tdx2, runs apps/_infra/dd-relaunch.sh, polls /api/agents for the freshly-registered entry (5-min budget). Retired workflows: production-deploy.yml → folded into release.yml as deploy-production local-agents.yml → no remaining purpose (cascade handles relaunch) retire-staging.yml → one-shot, already run scripts/ deleted entirely. gcp-deploy.sh inlined into deploy-cp.yml; dd-relaunch.sh + local-agents.sh moved to apps/_infra/ (host-side); ollama-deploy.sh, redeploy-workload.sh, workloads.sh removed as unused. ## Container stack on agent VMs dd-local-{preview,prod} now boot the full chain: nv (prod only) → mount-models → podman-static → podman-bootstrap → ollama (prod.json with GPU devices / preview.json CPU only) → openclaw (qwen2.5:7b on prod, qwen2.5:0.5b on preview) → cloudflared → dd-agent podman-bootstrap installs the wrapper as `podman` (not `dd-podman`), so bare `podman ps` from a guest shell reaches the right storage root instead of failing with `mkdir /var/lib/containers: read-only file system`. Raw binary moves to .podman-raw; dd-podman becomes a symlink for back-compat. ## Cleanup audit gcloud survey found dd-pr-121 RUNNING in staging a day after PR #121 merged (branch not deleted → pr-teardown.yml never fired; cleanup.yml only reaps TERMINATED). Added reap-merged-pr-previews: for each RUNNING pr-N VM, gh-resolves the PR state and tears down VM + CF tunnel + DNS CNAME when MERGED/CLOSED. Dropped reap-staging (dead — no dd_env=staging VMs exist), trimmed the workflow_run trigger (Production Deploy is gone). ## Deferred (blocked on easyenclave) Preview → easyenclave-staging / prod → easyenclave-stable image-family split + tdx2 auto-refresh of the base qcow2. easyenclave has no stable GCP family and no qcow2 on v0.1.14 today. Follow-up PR once easyenclave publishes stable images with qcow2 + a stable GCP image family. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ollama+openclaw (#132) Consolidates PRs #127, #131, #132, #133 into one commit. Every fleet path runs through one workflow, every agent VM boots the full container stack, and audit-surfaced cleanup gaps are closed. ## Workload spec — single source of truth apps/<name>/workload.{json,json.tmpl} is the one place a workload is defined. Boot-time (config.iso) and runtime (/deploy) bake from the same file. New apps/README.md documents the schema, the lifecycle matrix (CP / preview agent / prod agent), ordering via `until` polling, and a "deploying your own" walkthrough. Main README points at it. The bake helper (envsubst + jq strip-empty-env-entries) is inlined in two places — .github/workflows/deploy-cp.yml (CI, CP workloads) and apps/_infra/local-agents.sh (tdx2, agent workloads) — restricted to the uppercase ${VAR} refs each template declares so shell locals ($i, $((…))) in cmd strings aren't eaten. ## Workflow topology — one entry point release.yml drives everything: pull_request → build → deploy-preview → dd-local-preview relaunch push main → build → deploy-production → dd-local-prod relaunch push v* → build only (versioned artifact, no deploy) workflow_dispatch → build → deploy-production (rollback: release_tag input) .github/workflows/deploy-cp.yml is the reusable workflow both paths call. Preview CI exercises the exact code prod uses. It provisions the GCP CP VM, verifies /cp/attest MRTD + dashboard + STONITH, comments on PR, and cascades the agent relaunch — blocking on the agent re-registering with the CP (proves "everything works"). .github/actions/relaunch-agent/ is the composite that SSHes into tdx2, runs apps/_infra/dd-relaunch.sh, polls /api/agents for the freshly-registered entry (5-min budget). Retired workflows: production-deploy.yml → folded into release.yml as deploy-production local-agents.yml → no remaining purpose (cascade handles relaunch) retire-staging.yml → one-shot, already run scripts/ deleted entirely. gcp-deploy.sh inlined into deploy-cp.yml; dd-relaunch.sh + local-agents.sh moved to apps/_infra/ (host-side); ollama-deploy.sh, redeploy-workload.sh, workloads.sh removed as unused. ## Container stack on agent VMs dd-local-{preview,prod} now boot the full chain: nv (prod only) → mount-models → podman-static → podman-bootstrap → ollama (prod.json with GPU devices / preview.json CPU only) → openclaw (qwen2.5:7b on prod, qwen2.5:0.5b on preview) → cloudflared → dd-agent podman-bootstrap installs the wrapper as `podman` (not `dd-podman`), so bare `podman ps` from a guest shell reaches the right storage root instead of failing with `mkdir /var/lib/containers: read-only file system`. Raw binary moves to .podman-raw; dd-podman becomes a symlink for back-compat. ## Cleanup audit gcloud survey found dd-pr-121 RUNNING in staging a day after PR #121 merged (branch not deleted → pr-teardown.yml never fired; cleanup.yml only reaps TERMINATED). Added reap-merged-pr-previews: for each RUNNING pr-N VM, gh-resolves the PR state and tears down VM + CF tunnel + DNS CNAME when MERGED/CLOSED. Dropped reap-staging (dead — no dd_env=staging VMs exist), trimmed the workflow_run trigger (Production Deploy is gone). ## Deferred (blocked on easyenclave) Preview → easyenclave-staging / prod → easyenclave-stable image-family split + tdx2 auto-refresh of the base qcow2. easyenclave has no stable GCP family and no qcow2 on v0.1.14 today. Follow-up PR once easyenclave publishes stable images with qcow2 + a stable GCP image family. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

posix4e temporarily deployed to staging April 18, 2026 17:48 — with GitHub Actions Inactive

posix4e mentioned this pull request Apr 18, 2026

ci: unify workloads + collapse deploy paths into Release + cleanup audit #132

Merged

5 tasks

posix4e closed this Apr 18, 2026

posix4e deleted the feat/fold-into-release branch April 18, 2026 21:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci: fold every deploy path into Release, drop scripts/#131

ci: fold every deploy path into Release, drop scripts/#131
posix4e wants to merge 1 commit into
feat/unify-workloadsfrom
feat/fold-into-release

posix4e commented Apr 18, 2026

Uh oh!

github-actions Bot commented Apr 18, 2026

Uh oh!

posix4e commented Apr 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

posix4e commented Apr 18, 2026

Summary

Test plan

Uh oh!

github-actions Bot commented Apr 18, 2026

DD preview ready

Uh oh!

posix4e commented Apr 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant