feat(apps): wire ollama+openclaw into dd-local-{preview,prod}#133
Closed
posix4e wants to merge 2 commits into
Closed
feat(apps): wire ollama+openclaw into dd-local-{preview,prod}#133posix4e wants to merge 2 commits into
posix4e wants to merge 2 commits into
Conversation
Consolidates PRs #127, #131, #132 into one commit. Net -381 lines, one "Release" workflow drives the whole fleet lifecycle. Workload spec: apps/<name>/workload.{json,json.tmpl} becomes the single source of truth for every EE workload (cloudflared, dd-agent, dd-management, ollama, openclaw, nv, mount-models, podman-static, podman-bootstrap). Boot-time (config.iso / ee-config metadata) and runtime (/deploy) both bake from the same file. Workflow topology: release.yml is the one entry point. pull_request → build → deploy-preview → dd-local-preview relaunch push main → build → deploy-production → dd-local-prod relaunch push v* → build only (versioned artifact, no deploy) workflow_dispatch → build → deploy-production (rollback: release_tag input) .github/workflows/deploy-cp.yml is the reusable workflow both paths call, so preview CI exercises the exact code prod uses. It provisions the GCP CP VM, verifies health + /cp/attest MRTD + dashboard + STONITH, comments on PR, and cascades agent relaunch. The cascade is blocking: a release goes green only when the matching dd-local-{kind} VM re-registers with the freshly-deployed CP (proves "everything works"). .github/actions/relaunch-agent/ is the composite action that SSHes into tdx2, runs dd-relaunch.sh, and polls /api/agents for the freshly-registered entry (5-min budget). Deleted: .github/workflows/production-deploy.yml (folded into release.yml as deploy-production job). .github/workflows/local-agents.yml (manual-dispatch path gone; push a commit to trigger a relaunch). .github/workflows/retire-staging.yml (one-shot, already run, no dd_env=staging VMs exist). scripts/ entirely. gcp-deploy.sh inlined into deploy-cp.yml; dd-relaunch.sh + local-agents.sh moved to apps/_infra/ (host-side); ollama-deploy.sh, redeploy-workload.sh, workloads.sh deleted as unused after the refactor. Cleanup audit: gcloud survey revealed dd-pr-121-1776434711 RUNNING in staging a day after PR #121 merged — branch not deleted, so pr-teardown.yml never fired and cleanup.yml only reaps TERMINATED. Added a reap-merged-pr-previews job to cleanup.yml that resolves each RUNNING pr-N VM's PR state via gh and tears down VM + CF tunnel + DNS CNAME when MERGED/CLOSED. Dropped reap-staging (dead), trimmed workflow_run trigger (Production Deploy is gone). Deferred (blocked on easyenclave): easyenclave image family split — preview → easyenclave-staging, prod → easyenclave-stable. easyenclave has no stable GCP family and no qcow2 on v0.1.14 today, so nothing to point prod at. Plus tdx2 auto-refresh of the base qcow2 from easyenclave releases. Follow-up PR once easyenclave publishes stable images. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…README.md
Agent VMs boot the full container stack now — every PR push and every
main merge exercises podman + ollama + openclaw end-to-end. Preview
runs CPU inference with qwen2.5:0.5b; prod runs GPU inference with
qwen2.5:7b on the H100.
Changes:
- apps/_infra/local-agents.sh: replace the inline `jq -c -n` workload
literals with a `bake()` helper that reads from apps/<name>/workload
.{json,json.tmpl}. The workload set grows from
{nv, mount-models, cloudflared, dd-agent} to
{nv (prod only), mount-models, podman-static, podman-bootstrap,
ollama.{prod,preview}.json, openclaw, cloudflared, dd-agent}.
- apps/podman-bootstrap: install the wrapper as `podman` (not
`dd-podman`) so bare `podman ps` from PATH reaches the right
storage root instead of erroring with `mkdir /var/lib/containers:
read-only file system`. Raw binary moves to .podman-raw; dd-podman
becomes a symlink for back-compat.
- deploy-cp.yml + local-agents.sh bake() now pass envsubst a
restricted var list — only the uppercase `${VAR}` references the
template actually declares. Lowercase shell locals ($i, $((…)))
inside openclaw's `until` loop are no longer eaten.
- apps/README.md: canonical reference for the workload spec, lifecycle
matrix (CP / preview agent / prod agent), and a "deploying your own"
walkthrough. Main README.md points at it.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
DD preview readyURL: https://pr-133.devopsdefender.com Browser login: paste CLI / curl: Register endpoint for a local agent: |
eaf40d3 to
19f52b7
Compare
posix4e
added a commit
that referenced
this pull request
Apr 18, 2026
…ollama+openclaw Consolidates PRs #127, #131, #132, #133 into one commit. Every fleet path runs through one workflow, every agent VM boots the full container stack, and audit-surfaced cleanup gaps are closed. ## Workload spec — single source of truth apps/<name>/workload.{json,json.tmpl} is the one place a workload is defined. Boot-time (config.iso) and runtime (/deploy) bake from the same file. New apps/README.md documents the schema, the lifecycle matrix (CP / preview agent / prod agent), ordering via `until` polling, and a "deploying your own" walkthrough. Main README points at it. The bake helper (envsubst + jq strip-empty-env-entries) is inlined in two places — .github/workflows/deploy-cp.yml (CI, CP workloads) and apps/_infra/local-agents.sh (tdx2, agent workloads) — restricted to the uppercase ${VAR} refs each template declares so shell locals ($i, $((…))) in cmd strings aren't eaten. ## Workflow topology — one entry point release.yml drives everything: pull_request → build → deploy-preview → dd-local-preview relaunch push main → build → deploy-production → dd-local-prod relaunch push v* → build only (versioned artifact, no deploy) workflow_dispatch → build → deploy-production (rollback: release_tag input) .github/workflows/deploy-cp.yml is the reusable workflow both paths call. Preview CI exercises the exact code prod uses. It provisions the GCP CP VM, verifies /cp/attest MRTD + dashboard + STONITH, comments on PR, and cascades the agent relaunch — blocking on the agent re-registering with the CP (proves "everything works"). .github/actions/relaunch-agent/ is the composite that SSHes into tdx2, runs apps/_infra/dd-relaunch.sh, polls /api/agents for the freshly-registered entry (5-min budget). Retired workflows: production-deploy.yml → folded into release.yml as deploy-production local-agents.yml → no remaining purpose (cascade handles relaunch) retire-staging.yml → one-shot, already run scripts/ deleted entirely. gcp-deploy.sh inlined into deploy-cp.yml; dd-relaunch.sh + local-agents.sh moved to apps/_infra/ (host-side); ollama-deploy.sh, redeploy-workload.sh, workloads.sh removed as unused. ## Container stack on agent VMs dd-local-{preview,prod} now boot the full chain: nv (prod only) → mount-models → podman-static → podman-bootstrap → ollama (prod.json with GPU devices / preview.json CPU only) → openclaw (qwen2.5:7b on prod, qwen2.5:0.5b on preview) → cloudflared → dd-agent podman-bootstrap installs the wrapper as `podman` (not `dd-podman`), so bare `podman ps` from a guest shell reaches the right storage root instead of failing with `mkdir /var/lib/containers: read-only file system`. Raw binary moves to .podman-raw; dd-podman becomes a symlink for back-compat. ## Cleanup audit gcloud survey found dd-pr-121 RUNNING in staging a day after PR #121 merged (branch not deleted → pr-teardown.yml never fired; cleanup.yml only reaps TERMINATED). Added reap-merged-pr-previews: for each RUNNING pr-N VM, gh-resolves the PR state and tears down VM + CF tunnel + DNS CNAME when MERGED/CLOSED. Dropped reap-staging (dead — no dd_env=staging VMs exist), trimmed the workflow_run trigger (Production Deploy is gone). ## Deferred (blocked on easyenclave) Preview → easyenclave-staging / prod → easyenclave-stable image-family split + tdx2 auto-refresh of the base qcow2. easyenclave has no stable GCP family and no qcow2 on v0.1.14 today. Follow-up PR once easyenclave publishes stable images with qcow2 + a stable GCP image family. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Member
Author
|
Folded into #132 (single-commit consolidation). |
5 tasks
posix4e
added a commit
that referenced
this pull request
Apr 18, 2026
…ollama+openclaw (#132) Consolidates PRs #127, #131, #132, #133 into one commit. Every fleet path runs through one workflow, every agent VM boots the full container stack, and audit-surfaced cleanup gaps are closed. ## Workload spec — single source of truth apps/<name>/workload.{json,json.tmpl} is the one place a workload is defined. Boot-time (config.iso) and runtime (/deploy) bake from the same file. New apps/README.md documents the schema, the lifecycle matrix (CP / preview agent / prod agent), ordering via `until` polling, and a "deploying your own" walkthrough. Main README points at it. The bake helper (envsubst + jq strip-empty-env-entries) is inlined in two places — .github/workflows/deploy-cp.yml (CI, CP workloads) and apps/_infra/local-agents.sh (tdx2, agent workloads) — restricted to the uppercase ${VAR} refs each template declares so shell locals ($i, $((…))) in cmd strings aren't eaten. ## Workflow topology — one entry point release.yml drives everything: pull_request → build → deploy-preview → dd-local-preview relaunch push main → build → deploy-production → dd-local-prod relaunch push v* → build only (versioned artifact, no deploy) workflow_dispatch → build → deploy-production (rollback: release_tag input) .github/workflows/deploy-cp.yml is the reusable workflow both paths call. Preview CI exercises the exact code prod uses. It provisions the GCP CP VM, verifies /cp/attest MRTD + dashboard + STONITH, comments on PR, and cascades the agent relaunch — blocking on the agent re-registering with the CP (proves "everything works"). .github/actions/relaunch-agent/ is the composite that SSHes into tdx2, runs apps/_infra/dd-relaunch.sh, polls /api/agents for the freshly-registered entry (5-min budget). Retired workflows: production-deploy.yml → folded into release.yml as deploy-production local-agents.yml → no remaining purpose (cascade handles relaunch) retire-staging.yml → one-shot, already run scripts/ deleted entirely. gcp-deploy.sh inlined into deploy-cp.yml; dd-relaunch.sh + local-agents.sh moved to apps/_infra/ (host-side); ollama-deploy.sh, redeploy-workload.sh, workloads.sh removed as unused. ## Container stack on agent VMs dd-local-{preview,prod} now boot the full chain: nv (prod only) → mount-models → podman-static → podman-bootstrap → ollama (prod.json with GPU devices / preview.json CPU only) → openclaw (qwen2.5:7b on prod, qwen2.5:0.5b on preview) → cloudflared → dd-agent podman-bootstrap installs the wrapper as `podman` (not `dd-podman`), so bare `podman ps` from a guest shell reaches the right storage root instead of failing with `mkdir /var/lib/containers: read-only file system`. Raw binary moves to .podman-raw; dd-podman becomes a symlink for back-compat. ## Cleanup audit gcloud survey found dd-pr-121 RUNNING in staging a day after PR #121 merged (branch not deleted → pr-teardown.yml never fired; cleanup.yml only reaps TERMINATED). Added reap-merged-pr-previews: for each RUNNING pr-N VM, gh-resolves the PR state and tears down VM + CF tunnel + DNS CNAME when MERGED/CLOSED. Dropped reap-staging (dead — no dd_env=staging VMs exist), trimmed the workflow_run trigger (Production Deploy is gone). ## Deferred (blocked on easyenclave) Preview → easyenclave-staging / prod → easyenclave-stable image-family split + tdx2 auto-refresh of the base qcow2. easyenclave has no stable GCP family and no qcow2 on v0.1.14 today. Follow-up PR once easyenclave publishes stable images with qcow2 + a stable GCP image family. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Agent VMs boot the full container stack now. Every PR push exercises podman + ollama + openclaw end-to-end; every main merge reproduces the same chain on dd-local-prod with the H100.
dd-local-preview, CPU): qwen2.5:0.5b (~400 MB).dd-local-prod, GPU): qwen2.5:7b (~4.4 GB) on the H100 NVL.Changes
apps/_infra/local-agents.sh: replace inlinejq -c -nworkload literals with abake()helper that reads fromapps/<name>/workload.{json,json.tmpl}. Agent workload set grows from{nv, mount-models, cloudflared, dd-agent}→{nv (prod only), mount-models, podman-static, podman-bootstrap, ollama.{prod,preview}, openclaw, cloudflared, dd-agent}.apps/podman-bootstrap/workload.json: install the wrapper aspodman(notdd-podman) so barepodman psfrom PATH works instead of failing withmkdir /var/lib/containers: read-only file system. Raw binary moves to.podman-raw;dd-podmanbecomes a symlink for back-compat with openclaw's spec.bake()in bothdeploy-cp.ymlandlocal-agents.shnow passes envsubst a restricted var list (uppercase${VAR}references only). Lowercase$i/$((…))inside openclaw'suntilloop no longer get eaten.apps/README.md: canonical reference for the workload spec, lifecycle matrix (CP / preview agent / prod agent), ordering pattern (untilpolling), and a "deploying your own" walkthrough. Main README points at it.Stacked on #132.
Test plan
deploy-previewbrings uppr-N.devopsdefender.com,dd-local-previewrelaunches with the full workload set, re-registers. PR goes green.deploy-productioncascade runs: dd-local-prod relaunches with GPU workload + qwen2.5:7b.podman psinside a dd-local-{kind} guest returns the running ollama container (no more read-only-fs error).curl -H "Authorization: Bearer $PAT" https://<agent>/exec -d '{"cmd":["podman","exec","ollama","curl","-fsS","http://127.0.0.1:18789/healthz"],"timeout_secs":15}'returns exit 0 after first-boot npm install settles.curl … /exec -d '{"cmd":["podman","exec","ollama","openclaw","agent","--message","ping","--thinking","low"],"timeout_secs":120}'returns exit 0 with a non-empty reply.Follow-ups (not in this PR)
apps/README.mdin marketing voice + "this is what pr-N.devopsdefender.com is running right now" link.🤖 Generated with Claude Code