Skip to content

ci: unify workloads + collapse deploy paths into Release + cleanup audit#132

Merged
posix4e merged 1 commit into
mainfrom
chore/cleanup-audit
Apr 18, 2026
Merged

ci: unify workloads + collapse deploy paths into Release + cleanup audit#132
posix4e merged 1 commit into
mainfrom
chore/cleanup-audit

Conversation

@posix4e
Copy link
Copy Markdown
Member

@posix4e posix4e commented Apr 18, 2026

One consolidated PR replacing #127, #131, #132, #133. One commit. Everything below in one merge.

Scope

Workload spec — single source of truth

apps/<name>/workload.{json,json.tmpl} is the one place a workload is defined. Boot-time (config.iso) and runtime (/deploy) bake from the same file. New apps/README.md documents the schema, lifecycle matrix (CP / preview agent / prod agent), ordering via until polling, and a "deploying your own" walkthrough. Main README points at it.

Workflow topology — one entry point

release.yml drives everything:

  • pull_request → build → deploy-previewdd-local-preview relaunch
  • push main → build → deploy-productiondd-local-prod relaunch
  • push v* → build only
  • workflow_dispatch → build → deploy-production (rollback via release_tag input)

.github/workflows/deploy-cp.yml is the reusable workflow both paths call. Preview CI exercises the exact code prod uses. It provisions the CP, verifies /cp/attest MRTD + dashboard + STONITH, comments on PR, and cascades the agent relaunch — blocking on the agent re-registering with the CP.

Retired: production-deploy.yml (folded in), local-agents.yml (unused), retire-staging.yml (one-shot, done). scripts/ deleted entirely (gcp-deploy.sh inlined into deploy-cp.yml; dd-relaunch.sh + local-agents.shapps/_infra/).

Container stack on agent VMs

dd-local-{preview,prod} now boot the full chain: nv (prod only) → mount-modelspodman-staticpodman-bootstrapollama (GPU prod / CPU preview) → openclaw (qwen2.5:7b prod, qwen2.5:0.5b preview) → cloudflareddd-agent. podman-bootstrap installs the wrapper as podman (not dd-podman), so bare podman ps in a guest shell works.

Cleanup audit

gcloud survey found dd-pr-121-1776434711 RUNNING in staging a day after PR #121 merged. Added reap-merged-pr-previews that resolves each RUNNING pr-N VM's PR state and tears down VM + CF tunnel + DNS when MERGED/CLOSED. Dropped reap-staging (dead), trimmed workflow_run trigger.

Deferred

Preview → easyenclave-staging / prod → easyenclave-stable image-family split + tdx2 auto-refresh. Blocked on easyenclave publishing stable qcow2 assets.

Test plan

  • Release cascade on this PR runs build → deploy-preview → relaunch-agent composite (SSH, dd-relaunch.sh, poll /api/agents for dd-local-preview).
  • Dashboard at https://pr-132.devopsdefender.com/ returns 200 with Bearer PAT.
  • podman ps in a dd-local-{kind} guest returns the ollama container (no read-only-fs error).
  • After merge, deploy-production cascade provisions prod CP + relaunches dd-local-prod with qwen2.5:7b on H100.
  • Dispatch Cleanup → reaps dd-pr-121; leaves open PRs alone.

Replaces #127, #131, #132, #133.

🤖 Generated with Claude Code

@posix4e posix4e force-pushed the chore/cleanup-audit branch from ffd015b to eaf40d3 Compare April 18, 2026 18:12
posix4e added a commit that referenced this pull request Apr 18, 2026
Consolidates PRs #127, #131, #132 into one commit. Net -381 lines,
one "Release" workflow drives the whole fleet lifecycle.

Workload spec:
  apps/<name>/workload.{json,json.tmpl} becomes the single source of
  truth for every EE workload (cloudflared, dd-agent, dd-management,
  ollama, openclaw, nv, mount-models, podman-static, podman-bootstrap).
  Boot-time (config.iso / ee-config metadata) and runtime (/deploy)
  both bake from the same file.

Workflow topology:
  release.yml is the one entry point.
    pull_request      → build → deploy-preview      → dd-local-preview relaunch
    push main         → build → deploy-production   → dd-local-prod    relaunch
    push v*           → build only (versioned artifact, no deploy)
    workflow_dispatch → build → deploy-production   (rollback: release_tag input)
  .github/workflows/deploy-cp.yml is the reusable workflow both paths
  call, so preview CI exercises the exact code prod uses. It provisions
  the GCP CP VM, verifies health + /cp/attest MRTD + dashboard + STONITH,
  comments on PR, and cascades agent relaunch. The cascade is blocking:
  a release goes green only when the matching dd-local-{kind} VM
  re-registers with the freshly-deployed CP (proves "everything works").
  .github/actions/relaunch-agent/ is the composite action that SSHes
  into tdx2, runs dd-relaunch.sh, and polls /api/agents for the
  freshly-registered entry (5-min budget).

Deleted:
  .github/workflows/production-deploy.yml (folded into release.yml
    as deploy-production job).
  .github/workflows/local-agents.yml (manual-dispatch path gone;
    push a commit to trigger a relaunch).
  .github/workflows/retire-staging.yml (one-shot, already run,
    no dd_env=staging VMs exist).
  scripts/ entirely. gcp-deploy.sh inlined into deploy-cp.yml;
  dd-relaunch.sh + local-agents.sh moved to apps/_infra/ (host-side);
  ollama-deploy.sh, redeploy-workload.sh, workloads.sh deleted as
  unused after the refactor.

Cleanup audit:
  gcloud survey revealed dd-pr-121-1776434711 RUNNING in staging a
  day after PR #121 merged — branch not deleted, so pr-teardown.yml
  never fired and cleanup.yml only reaps TERMINATED. Added a
  reap-merged-pr-previews job to cleanup.yml that resolves each
  RUNNING pr-N VM's PR state via gh and tears down VM + CF tunnel
  + DNS CNAME when MERGED/CLOSED. Dropped reap-staging (dead),
  trimmed workflow_run trigger (Production Deploy is gone).

Deferred (blocked on easyenclave):
  easyenclave image family split — preview → easyenclave-staging,
  prod → easyenclave-stable. easyenclave has no stable GCP family
  and no qcow2 on v0.1.14 today, so nothing to point prod at. Plus
  tdx2 auto-refresh of the base qcow2 from easyenclave releases.
  Follow-up PR once easyenclave publishes stable images.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@posix4e posix4e changed the title ci: audit cleanup surfaces — drop dead code, add merged-PR reaper ci: unify workloads + collapse deploy paths into Release + cleanup audit Apr 18, 2026
@posix4e posix4e changed the base branch from feat/fold-into-release to main April 18, 2026 18:12
@github-actions
Copy link
Copy Markdown

DD preview ready

URL: https://pr-132.devopsdefender.com

Browser login: paste gh auth token output at https://pr-132.devopsdefender.com/auth/pat

CLI / curl: curl -H "Authorization: Bearer $(gh auth token)" https://pr-132.devopsdefender.com/

Register endpoint for a local agent: wss://pr-132.devopsdefender.com/register

…ollama+openclaw

Consolidates PRs #127, #131, #132, #133 into one commit. Every fleet
path runs through one workflow, every agent VM boots the full
container stack, and audit-surfaced cleanup gaps are closed.

## Workload spec — single source of truth

apps/<name>/workload.{json,json.tmpl} is the one place a workload is
defined. Boot-time (config.iso) and runtime (/deploy) bake from the
same file. New apps/README.md documents the schema, the lifecycle
matrix (CP / preview agent / prod agent), ordering via `until` polling,
and a "deploying your own" walkthrough. Main README points at it.

The bake helper (envsubst + jq strip-empty-env-entries) is inlined in
two places — .github/workflows/deploy-cp.yml (CI, CP workloads) and
apps/_infra/local-agents.sh (tdx2, agent workloads) — restricted to
the uppercase ${VAR} refs each template declares so shell locals
($i, $((…))) in cmd strings aren't eaten.

## Workflow topology — one entry point

release.yml drives everything:
  pull_request     → build → deploy-preview → dd-local-preview relaunch
  push main        → build → deploy-production → dd-local-prod relaunch
  push v*          → build only (versioned artifact, no deploy)
  workflow_dispatch → build → deploy-production (rollback: release_tag input)

.github/workflows/deploy-cp.yml is the reusable workflow both paths
call. Preview CI exercises the exact code prod uses. It provisions
the GCP CP VM, verifies /cp/attest MRTD + dashboard + STONITH,
comments on PR, and cascades the agent relaunch — blocking on the
agent re-registering with the CP (proves "everything works").

.github/actions/relaunch-agent/ is the composite that SSHes into
tdx2, runs apps/_infra/dd-relaunch.sh, polls /api/agents for the
freshly-registered entry (5-min budget).

Retired workflows:
  production-deploy.yml → folded into release.yml as deploy-production
  local-agents.yml      → no remaining purpose (cascade handles relaunch)
  retire-staging.yml    → one-shot, already run

scripts/ deleted entirely. gcp-deploy.sh inlined into deploy-cp.yml;
dd-relaunch.sh + local-agents.sh moved to apps/_infra/ (host-side);
ollama-deploy.sh, redeploy-workload.sh, workloads.sh removed as unused.

## Container stack on agent VMs

dd-local-{preview,prod} now boot the full chain:
  nv (prod only) → mount-models → podman-static → podman-bootstrap
    → ollama (prod.json with GPU devices / preview.json CPU only)
    → openclaw (qwen2.5:7b on prod, qwen2.5:0.5b on preview)
    → cloudflared → dd-agent

podman-bootstrap installs the wrapper as `podman` (not `dd-podman`),
so bare `podman ps` from a guest shell reaches the right storage
root instead of failing with `mkdir /var/lib/containers: read-only
file system`. Raw binary moves to .podman-raw; dd-podman becomes a
symlink for back-compat.

## Cleanup audit

gcloud survey found dd-pr-121 RUNNING in staging a day after PR #121
merged (branch not deleted → pr-teardown.yml never fired;
cleanup.yml only reaps TERMINATED). Added reap-merged-pr-previews:
for each RUNNING pr-N VM, gh-resolves the PR state and tears down
VM + CF tunnel + DNS CNAME when MERGED/CLOSED. Dropped reap-staging
(dead — no dd_env=staging VMs exist), trimmed the workflow_run
trigger (Production Deploy is gone).

## Deferred (blocked on easyenclave)

Preview → easyenclave-staging / prod → easyenclave-stable image-family
split + tdx2 auto-refresh of the base qcow2. easyenclave has no stable
GCP family and no qcow2 on v0.1.14 today. Follow-up PR once easyenclave
publishes stable images with qcow2 + a stable GCP image family.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@posix4e posix4e force-pushed the chore/cleanup-audit branch from eaf40d3 to 19f52b7 Compare April 18, 2026 20:31
@posix4e posix4e merged commit 92d96fc into main Apr 18, 2026
4 checks passed
@posix4e posix4e deleted the chore/cleanup-audit branch April 18, 2026 20:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant