feat(apps): wire ollama+openclaw into dd-local-{preview,prod} by posix4e · Pull Request #133 · devopsdefender/dd

posix4e · 2026-04-18T20:25:11Z

Summary

Agent VMs boot the full container stack now. Every PR push exercises podman + ollama + openclaw end-to-end; every main merge reproduces the same chain on dd-local-prod with the H100.

Preview (dd-local-preview, CPU): qwen2.5:0.5b (~400 MB).
Prod (dd-local-prod, GPU): qwen2.5:7b (~4.4 GB) on the H100 NVL.

Changes

apps/_infra/local-agents.sh: replace inline jq -c -n workload literals with a bake() helper that reads from apps/<name>/workload.{json,json.tmpl}. Agent workload set grows from {nv, mount-models, cloudflared, dd-agent} → {nv (prod only), mount-models, podman-static, podman-bootstrap, ollama.{prod,preview}, openclaw, cloudflared, dd-agent}.
apps/podman-bootstrap/workload.json: install the wrapper as podman (not dd-podman) so bare podman ps from PATH works instead of failing with mkdir /var/lib/containers: read-only file system. Raw binary moves to .podman-raw; dd-podman becomes a symlink for back-compat with openclaw's spec.
bake() in both deploy-cp.yml and local-agents.sh now passes envsubst a restricted var list (uppercase ${VAR} references only). Lowercase $i / $((…)) inside openclaw's until loop no longer get eaten.
New apps/README.md: canonical reference for the workload spec, lifecycle matrix (CP / preview agent / prod agent), ordering pattern (until polling), and a "deploying your own" walkthrough. Main README points at it.

Stacked on #132.

Test plan

This PR's Release cascade: deploy-preview brings up pr-N.devopsdefender.com, dd-local-preview relaunches with the full workload set, re-registers. PR goes green.
After merge chain lands on main, deploy-production cascade runs: dd-local-prod relaunches with GPU workload + qwen2.5:7b.
podman ps inside a dd-local-{kind} guest returns the running ollama container (no more read-only-fs error).
Openclaw health (observational, not a release gate): curl -H "Authorization: Bearer $PAT" https://<agent>/exec -d '{"cmd":["podman","exec","ollama","curl","-fsS","http://127.0.0.1:18789/healthz"],"timeout_secs":15}' returns exit 0 after first-boot npm install settles.
Model inference smoke (prod, manual): curl … /exec -d '{"cmd":["podman","exec","ollama","openclaw","agent","--message","ping","--thinking","low"],"timeout_secs":120}' returns exit 0 with a non-empty reply.

Follow-ups (not in this PR)

gh-pages walkthrough reproducing apps/README.md in marketing voice + "this is what pr-N.devopsdefender.com is running right now" link.
Image-family split preview→staging / prod→stable, tdx2 base-image auto-refresh from easyenclave releases. Blocked on easyenclave publishing stable qcow2 assets.

🤖 Generated with Claude Code

Consolidates PRs #127, #131, #132 into one commit. Net -381 lines, one "Release" workflow drives the whole fleet lifecycle. Workload spec: apps/<name>/workload.{json,json.tmpl} becomes the single source of truth for every EE workload (cloudflared, dd-agent, dd-management, ollama, openclaw, nv, mount-models, podman-static, podman-bootstrap). Boot-time (config.iso / ee-config metadata) and runtime (/deploy) both bake from the same file. Workflow topology: release.yml is the one entry point. pull_request → build → deploy-preview → dd-local-preview relaunch push main → build → deploy-production → dd-local-prod relaunch push v* → build only (versioned artifact, no deploy) workflow_dispatch → build → deploy-production (rollback: release_tag input) .github/workflows/deploy-cp.yml is the reusable workflow both paths call, so preview CI exercises the exact code prod uses. It provisions the GCP CP VM, verifies health + /cp/attest MRTD + dashboard + STONITH, comments on PR, and cascades agent relaunch. The cascade is blocking: a release goes green only when the matching dd-local-{kind} VM re-registers with the freshly-deployed CP (proves "everything works"). .github/actions/relaunch-agent/ is the composite action that SSHes into tdx2, runs dd-relaunch.sh, and polls /api/agents for the freshly-registered entry (5-min budget). Deleted: .github/workflows/production-deploy.yml (folded into release.yml as deploy-production job). .github/workflows/local-agents.yml (manual-dispatch path gone; push a commit to trigger a relaunch). .github/workflows/retire-staging.yml (one-shot, already run, no dd_env=staging VMs exist). scripts/ entirely. gcp-deploy.sh inlined into deploy-cp.yml; dd-relaunch.sh + local-agents.sh moved to apps/_infra/ (host-side); ollama-deploy.sh, redeploy-workload.sh, workloads.sh deleted as unused after the refactor. Cleanup audit: gcloud survey revealed dd-pr-121-1776434711 RUNNING in staging a day after PR #121 merged — branch not deleted, so pr-teardown.yml never fired and cleanup.yml only reaps TERMINATED. Added a reap-merged-pr-previews job to cleanup.yml that resolves each RUNNING pr-N VM's PR state via gh and tears down VM + CF tunnel + DNS CNAME when MERGED/CLOSED. Dropped reap-staging (dead), trimmed workflow_run trigger (Production Deploy is gone). Deferred (blocked on easyenclave): easyenclave image family split — preview → easyenclave-staging, prod → easyenclave-stable. easyenclave has no stable GCP family and no qcow2 on v0.1.14 today, so nothing to point prod at. Plus tdx2 auto-refresh of the base qcow2 from easyenclave releases. Follow-up PR once easyenclave publishes stable images. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…README.md Agent VMs boot the full container stack now — every PR push and every main merge exercises podman + ollama + openclaw end-to-end. Preview runs CPU inference with qwen2.5:0.5b; prod runs GPU inference with qwen2.5:7b on the H100. Changes: - apps/_infra/local-agents.sh: replace the inline `jq -c -n` workload literals with a `bake()` helper that reads from apps/<name>/workload .{json,json.tmpl}. The workload set grows from {nv, mount-models, cloudflared, dd-agent} to {nv (prod only), mount-models, podman-static, podman-bootstrap, ollama.{prod,preview}.json, openclaw, cloudflared, dd-agent}. - apps/podman-bootstrap: install the wrapper as `podman` (not `dd-podman`) so bare `podman ps` from PATH reaches the right storage root instead of erroring with `mkdir /var/lib/containers: read-only file system`. Raw binary moves to .podman-raw; dd-podman becomes a symlink for back-compat. - deploy-cp.yml + local-agents.sh bake() now pass envsubst a restricted var list — only the uppercase `${VAR}` references the template actually declares. Lowercase shell locals ($i, $((…))) inside openclaw's `until` loop are no longer eaten. - apps/README.md: canonical reference for the workload spec, lifecycle matrix (CP / preview agent / prod agent), and a "deploying your own" walkthrough. Main README.md points at it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-04-18T20:27:49Z

DD preview ready

URL: https://pr-133.devopsdefender.com

Browser login: paste gh auth token output at https://pr-133.devopsdefender.com/auth/pat

CLI / curl: curl -H "Authorization: Bearer $(gh auth token)" https://pr-133.devopsdefender.com/

Register endpoint for a local agent: wss://pr-133.devopsdefender.com/register

…ollama+openclaw Consolidates PRs #127, #131, #132, #133 into one commit. Every fleet path runs through one workflow, every agent VM boots the full container stack, and audit-surfaced cleanup gaps are closed. ## Workload spec — single source of truth apps/<name>/workload.{json,json.tmpl} is the one place a workload is defined. Boot-time (config.iso) and runtime (/deploy) bake from the same file. New apps/README.md documents the schema, the lifecycle matrix (CP / preview agent / prod agent), ordering via `until` polling, and a "deploying your own" walkthrough. Main README points at it. The bake helper (envsubst + jq strip-empty-env-entries) is inlined in two places — .github/workflows/deploy-cp.yml (CI, CP workloads) and apps/_infra/local-agents.sh (tdx2, agent workloads) — restricted to the uppercase ${VAR} refs each template declares so shell locals ($i, $((…))) in cmd strings aren't eaten. ## Workflow topology — one entry point release.yml drives everything: pull_request → build → deploy-preview → dd-local-preview relaunch push main → build → deploy-production → dd-local-prod relaunch push v* → build only (versioned artifact, no deploy) workflow_dispatch → build → deploy-production (rollback: release_tag input) .github/workflows/deploy-cp.yml is the reusable workflow both paths call. Preview CI exercises the exact code prod uses. It provisions the GCP CP VM, verifies /cp/attest MRTD + dashboard + STONITH, comments on PR, and cascades the agent relaunch — blocking on the agent re-registering with the CP (proves "everything works"). .github/actions/relaunch-agent/ is the composite that SSHes into tdx2, runs apps/_infra/dd-relaunch.sh, polls /api/agents for the freshly-registered entry (5-min budget). Retired workflows: production-deploy.yml → folded into release.yml as deploy-production local-agents.yml → no remaining purpose (cascade handles relaunch) retire-staging.yml → one-shot, already run scripts/ deleted entirely. gcp-deploy.sh inlined into deploy-cp.yml; dd-relaunch.sh + local-agents.sh moved to apps/_infra/ (host-side); ollama-deploy.sh, redeploy-workload.sh, workloads.sh removed as unused. ## Container stack on agent VMs dd-local-{preview,prod} now boot the full chain: nv (prod only) → mount-models → podman-static → podman-bootstrap → ollama (prod.json with GPU devices / preview.json CPU only) → openclaw (qwen2.5:7b on prod, qwen2.5:0.5b on preview) → cloudflared → dd-agent podman-bootstrap installs the wrapper as `podman` (not `dd-podman`), so bare `podman ps` from a guest shell reaches the right storage root instead of failing with `mkdir /var/lib/containers: read-only file system`. Raw binary moves to .podman-raw; dd-podman becomes a symlink for back-compat. ## Cleanup audit gcloud survey found dd-pr-121 RUNNING in staging a day after PR #121 merged (branch not deleted → pr-teardown.yml never fired; cleanup.yml only reaps TERMINATED). Added reap-merged-pr-previews: for each RUNNING pr-N VM, gh-resolves the PR state and tears down VM + CF tunnel + DNS CNAME when MERGED/CLOSED. Dropped reap-staging (dead — no dd_env=staging VMs exist), trimmed the workflow_run trigger (Production Deploy is gone). ## Deferred (blocked on easyenclave) Preview → easyenclave-staging / prod → easyenclave-stable image-family split + tdx2 auto-refresh of the base qcow2. easyenclave has no stable GCP family and no qcow2 on v0.1.14 today. Follow-up PR once easyenclave publishes stable images with qcow2 + a stable GCP image family. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

posix4e · 2026-04-18T20:31:38Z

Folded into #132 (single-commit consolidation).

…ollama+openclaw (#132) Consolidates PRs #127, #131, #132, #133 into one commit. Every fleet path runs through one workflow, every agent VM boots the full container stack, and audit-surfaced cleanup gaps are closed. ## Workload spec — single source of truth apps/<name>/workload.{json,json.tmpl} is the one place a workload is defined. Boot-time (config.iso) and runtime (/deploy) bake from the same file. New apps/README.md documents the schema, the lifecycle matrix (CP / preview agent / prod agent), ordering via `until` polling, and a "deploying your own" walkthrough. Main README points at it. The bake helper (envsubst + jq strip-empty-env-entries) is inlined in two places — .github/workflows/deploy-cp.yml (CI, CP workloads) and apps/_infra/local-agents.sh (tdx2, agent workloads) — restricted to the uppercase ${VAR} refs each template declares so shell locals ($i, $((…))) in cmd strings aren't eaten. ## Workflow topology — one entry point release.yml drives everything: pull_request → build → deploy-preview → dd-local-preview relaunch push main → build → deploy-production → dd-local-prod relaunch push v* → build only (versioned artifact, no deploy) workflow_dispatch → build → deploy-production (rollback: release_tag input) .github/workflows/deploy-cp.yml is the reusable workflow both paths call. Preview CI exercises the exact code prod uses. It provisions the GCP CP VM, verifies /cp/attest MRTD + dashboard + STONITH, comments on PR, and cascades the agent relaunch — blocking on the agent re-registering with the CP (proves "everything works"). .github/actions/relaunch-agent/ is the composite that SSHes into tdx2, runs apps/_infra/dd-relaunch.sh, polls /api/agents for the freshly-registered entry (5-min budget). Retired workflows: production-deploy.yml → folded into release.yml as deploy-production local-agents.yml → no remaining purpose (cascade handles relaunch) retire-staging.yml → one-shot, already run scripts/ deleted entirely. gcp-deploy.sh inlined into deploy-cp.yml; dd-relaunch.sh + local-agents.sh moved to apps/_infra/ (host-side); ollama-deploy.sh, redeploy-workload.sh, workloads.sh removed as unused. ## Container stack on agent VMs dd-local-{preview,prod} now boot the full chain: nv (prod only) → mount-models → podman-static → podman-bootstrap → ollama (prod.json with GPU devices / preview.json CPU only) → openclaw (qwen2.5:7b on prod, qwen2.5:0.5b on preview) → cloudflared → dd-agent podman-bootstrap installs the wrapper as `podman` (not `dd-podman`), so bare `podman ps` from a guest shell reaches the right storage root instead of failing with `mkdir /var/lib/containers: read-only file system`. Raw binary moves to .podman-raw; dd-podman becomes a symlink for back-compat. ## Cleanup audit gcloud survey found dd-pr-121 RUNNING in staging a day after PR #121 merged (branch not deleted → pr-teardown.yml never fired; cleanup.yml only reaps TERMINATED). Added reap-merged-pr-previews: for each RUNNING pr-N VM, gh-resolves the PR state and tears down VM + CF tunnel + DNS CNAME when MERGED/CLOSED. Dropped reap-staging (dead — no dd_env=staging VMs exist), trimmed the workflow_run trigger (Production Deploy is gone). ## Deferred (blocked on easyenclave) Preview → easyenclave-staging / prod → easyenclave-stable image-family split + tdx2 auto-refresh of the base qcow2. easyenclave has no stable GCP family and no qcow2 on v0.1.14 today. Follow-up PR once easyenclave publishes stable images with qcow2 + a stable GCP image family. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

posix4e and others added 2 commits April 18, 2026 18:12

posix4e temporarily deployed to staging April 18, 2026 20:26 — with GitHub Actions Inactive

posix4e force-pushed the chore/cleanup-audit branch from eaf40d3 to 19f52b7 Compare April 18, 2026 20:31

posix4e closed this Apr 18, 2026

posix4e mentioned this pull request Apr 18, 2026

ci: unify workloads + collapse deploy paths into Release + cleanup audit #132

Merged

5 tasks

posix4e deleted the feat/wire-containers branch April 18, 2026 21:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(apps): wire ollama+openclaw into dd-local-{preview,prod}#133

feat(apps): wire ollama+openclaw into dd-local-{preview,prod}#133
posix4e wants to merge 2 commits into
chore/cleanup-auditfrom
feat/wire-containers

posix4e commented Apr 18, 2026

Uh oh!

github-actions Bot commented Apr 18, 2026

Uh oh!

posix4e commented Apr 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

posix4e commented Apr 18, 2026

Summary

Changes

Stacked on #132.

Test plan

Follow-ups (not in this PR)

Uh oh!

github-actions Bot commented Apr 18, 2026

DD preview ready

Uh oh!

posix4e commented Apr 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant