Skip to content

feat(apps): wire ollama+openclaw into dd-local-{preview,prod}#133

Closed
posix4e wants to merge 2 commits into
chore/cleanup-auditfrom
feat/wire-containers
Closed

feat(apps): wire ollama+openclaw into dd-local-{preview,prod}#133
posix4e wants to merge 2 commits into
chore/cleanup-auditfrom
feat/wire-containers

Conversation

@posix4e
Copy link
Copy Markdown
Member

@posix4e posix4e commented Apr 18, 2026

Summary

Agent VMs boot the full container stack now. Every PR push exercises podman + ollama + openclaw end-to-end; every main merge reproduces the same chain on dd-local-prod with the H100.

  • Preview (dd-local-preview, CPU): qwen2.5:0.5b (~400 MB).
  • Prod (dd-local-prod, GPU): qwen2.5:7b (~4.4 GB) on the H100 NVL.

Changes

  • apps/_infra/local-agents.sh: replace inline jq -c -n workload literals with a bake() helper that reads from apps/<name>/workload.{json,json.tmpl}. Agent workload set grows from {nv, mount-models, cloudflared, dd-agent}{nv (prod only), mount-models, podman-static, podman-bootstrap, ollama.{prod,preview}, openclaw, cloudflared, dd-agent}.
  • apps/podman-bootstrap/workload.json: install the wrapper as podman (not dd-podman) so bare podman ps from PATH works instead of failing with mkdir /var/lib/containers: read-only file system. Raw binary moves to .podman-raw; dd-podman becomes a symlink for back-compat with openclaw's spec.
  • bake() in both deploy-cp.yml and local-agents.sh now passes envsubst a restricted var list (uppercase ${VAR} references only). Lowercase $i / $((…)) inside openclaw's until loop no longer get eaten.
  • New apps/README.md: canonical reference for the workload spec, lifecycle matrix (CP / preview agent / prod agent), ordering pattern (until polling), and a "deploying your own" walkthrough. Main README points at it.

Stacked on #132.

Test plan

  • This PR's Release cascade: deploy-preview brings up pr-N.devopsdefender.com, dd-local-preview relaunches with the full workload set, re-registers. PR goes green.
  • After merge chain lands on main, deploy-production cascade runs: dd-local-prod relaunches with GPU workload + qwen2.5:7b.
  • podman ps inside a dd-local-{kind} guest returns the running ollama container (no more read-only-fs error).
  • Openclaw health (observational, not a release gate): curl -H "Authorization: Bearer $PAT" https://<agent>/exec -d '{"cmd":["podman","exec","ollama","curl","-fsS","http://127.0.0.1:18789/healthz"],"timeout_secs":15}' returns exit 0 after first-boot npm install settles.
  • Model inference smoke (prod, manual): curl … /exec -d '{"cmd":["podman","exec","ollama","openclaw","agent","--message","ping","--thinking","low"],"timeout_secs":120}' returns exit 0 with a non-empty reply.

Follow-ups (not in this PR)

  • gh-pages walkthrough reproducing apps/README.md in marketing voice + "this is what pr-N.devopsdefender.com is running right now" link.
  • Image-family split preview→staging / prod→stable, tdx2 base-image auto-refresh from easyenclave releases. Blocked on easyenclave publishing stable qcow2 assets.

🤖 Generated with Claude Code

posix4e and others added 2 commits April 18, 2026 18:12
Consolidates PRs #127, #131, #132 into one commit. Net -381 lines,
one "Release" workflow drives the whole fleet lifecycle.

Workload spec:
  apps/<name>/workload.{json,json.tmpl} becomes the single source of
  truth for every EE workload (cloudflared, dd-agent, dd-management,
  ollama, openclaw, nv, mount-models, podman-static, podman-bootstrap).
  Boot-time (config.iso / ee-config metadata) and runtime (/deploy)
  both bake from the same file.

Workflow topology:
  release.yml is the one entry point.
    pull_request      → build → deploy-preview      → dd-local-preview relaunch
    push main         → build → deploy-production   → dd-local-prod    relaunch
    push v*           → build only (versioned artifact, no deploy)
    workflow_dispatch → build → deploy-production   (rollback: release_tag input)
  .github/workflows/deploy-cp.yml is the reusable workflow both paths
  call, so preview CI exercises the exact code prod uses. It provisions
  the GCP CP VM, verifies health + /cp/attest MRTD + dashboard + STONITH,
  comments on PR, and cascades agent relaunch. The cascade is blocking:
  a release goes green only when the matching dd-local-{kind} VM
  re-registers with the freshly-deployed CP (proves "everything works").
  .github/actions/relaunch-agent/ is the composite action that SSHes
  into tdx2, runs dd-relaunch.sh, and polls /api/agents for the
  freshly-registered entry (5-min budget).

Deleted:
  .github/workflows/production-deploy.yml (folded into release.yml
    as deploy-production job).
  .github/workflows/local-agents.yml (manual-dispatch path gone;
    push a commit to trigger a relaunch).
  .github/workflows/retire-staging.yml (one-shot, already run,
    no dd_env=staging VMs exist).
  scripts/ entirely. gcp-deploy.sh inlined into deploy-cp.yml;
  dd-relaunch.sh + local-agents.sh moved to apps/_infra/ (host-side);
  ollama-deploy.sh, redeploy-workload.sh, workloads.sh deleted as
  unused after the refactor.

Cleanup audit:
  gcloud survey revealed dd-pr-121-1776434711 RUNNING in staging a
  day after PR #121 merged — branch not deleted, so pr-teardown.yml
  never fired and cleanup.yml only reaps TERMINATED. Added a
  reap-merged-pr-previews job to cleanup.yml that resolves each
  RUNNING pr-N VM's PR state via gh and tears down VM + CF tunnel
  + DNS CNAME when MERGED/CLOSED. Dropped reap-staging (dead),
  trimmed workflow_run trigger (Production Deploy is gone).

Deferred (blocked on easyenclave):
  easyenclave image family split — preview → easyenclave-staging,
  prod → easyenclave-stable. easyenclave has no stable GCP family
  and no qcow2 on v0.1.14 today, so nothing to point prod at. Plus
  tdx2 auto-refresh of the base qcow2 from easyenclave releases.
  Follow-up PR once easyenclave publishes stable images.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…README.md

Agent VMs boot the full container stack now — every PR push and every
main merge exercises podman + ollama + openclaw end-to-end. Preview
runs CPU inference with qwen2.5:0.5b; prod runs GPU inference with
qwen2.5:7b on the H100.

Changes:
  - apps/_infra/local-agents.sh: replace the inline `jq -c -n` workload
    literals with a `bake()` helper that reads from apps/<name>/workload
    .{json,json.tmpl}. The workload set grows from
    {nv, mount-models, cloudflared, dd-agent} to
    {nv (prod only), mount-models, podman-static, podman-bootstrap,
     ollama.{prod,preview}.json, openclaw, cloudflared, dd-agent}.

  - apps/podman-bootstrap: install the wrapper as `podman` (not
    `dd-podman`) so bare `podman ps` from PATH reaches the right
    storage root instead of erroring with `mkdir /var/lib/containers:
    read-only file system`. Raw binary moves to .podman-raw; dd-podman
    becomes a symlink for back-compat.

  - deploy-cp.yml + local-agents.sh bake() now pass envsubst a
    restricted var list — only the uppercase `${VAR}` references the
    template actually declares. Lowercase shell locals ($i, $((…)))
    inside openclaw's `until` loop are no longer eaten.

  - apps/README.md: canonical reference for the workload spec, lifecycle
    matrix (CP / preview agent / prod agent), and a "deploying your own"
    walkthrough. Main README.md points at it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

DD preview ready

URL: https://pr-133.devopsdefender.com

Browser login: paste gh auth token output at https://pr-133.devopsdefender.com/auth/pat

CLI / curl: curl -H "Authorization: Bearer $(gh auth token)" https://pr-133.devopsdefender.com/

Register endpoint for a local agent: wss://pr-133.devopsdefender.com/register

@posix4e posix4e force-pushed the chore/cleanup-audit branch from eaf40d3 to 19f52b7 Compare April 18, 2026 20:31
posix4e added a commit that referenced this pull request Apr 18, 2026
…ollama+openclaw

Consolidates PRs #127, #131, #132, #133 into one commit. Every fleet
path runs through one workflow, every agent VM boots the full
container stack, and audit-surfaced cleanup gaps are closed.

## Workload spec — single source of truth

apps/<name>/workload.{json,json.tmpl} is the one place a workload is
defined. Boot-time (config.iso) and runtime (/deploy) bake from the
same file. New apps/README.md documents the schema, the lifecycle
matrix (CP / preview agent / prod agent), ordering via `until` polling,
and a "deploying your own" walkthrough. Main README points at it.

The bake helper (envsubst + jq strip-empty-env-entries) is inlined in
two places — .github/workflows/deploy-cp.yml (CI, CP workloads) and
apps/_infra/local-agents.sh (tdx2, agent workloads) — restricted to
the uppercase ${VAR} refs each template declares so shell locals
($i, $((…))) in cmd strings aren't eaten.

## Workflow topology — one entry point

release.yml drives everything:
  pull_request     → build → deploy-preview → dd-local-preview relaunch
  push main        → build → deploy-production → dd-local-prod relaunch
  push v*          → build only (versioned artifact, no deploy)
  workflow_dispatch → build → deploy-production (rollback: release_tag input)

.github/workflows/deploy-cp.yml is the reusable workflow both paths
call. Preview CI exercises the exact code prod uses. It provisions
the GCP CP VM, verifies /cp/attest MRTD + dashboard + STONITH,
comments on PR, and cascades the agent relaunch — blocking on the
agent re-registering with the CP (proves "everything works").

.github/actions/relaunch-agent/ is the composite that SSHes into
tdx2, runs apps/_infra/dd-relaunch.sh, polls /api/agents for the
freshly-registered entry (5-min budget).

Retired workflows:
  production-deploy.yml → folded into release.yml as deploy-production
  local-agents.yml      → no remaining purpose (cascade handles relaunch)
  retire-staging.yml    → one-shot, already run

scripts/ deleted entirely. gcp-deploy.sh inlined into deploy-cp.yml;
dd-relaunch.sh + local-agents.sh moved to apps/_infra/ (host-side);
ollama-deploy.sh, redeploy-workload.sh, workloads.sh removed as unused.

## Container stack on agent VMs

dd-local-{preview,prod} now boot the full chain:
  nv (prod only) → mount-models → podman-static → podman-bootstrap
    → ollama (prod.json with GPU devices / preview.json CPU only)
    → openclaw (qwen2.5:7b on prod, qwen2.5:0.5b on preview)
    → cloudflared → dd-agent

podman-bootstrap installs the wrapper as `podman` (not `dd-podman`),
so bare `podman ps` from a guest shell reaches the right storage
root instead of failing with `mkdir /var/lib/containers: read-only
file system`. Raw binary moves to .podman-raw; dd-podman becomes a
symlink for back-compat.

## Cleanup audit

gcloud survey found dd-pr-121 RUNNING in staging a day after PR #121
merged (branch not deleted → pr-teardown.yml never fired;
cleanup.yml only reaps TERMINATED). Added reap-merged-pr-previews:
for each RUNNING pr-N VM, gh-resolves the PR state and tears down
VM + CF tunnel + DNS CNAME when MERGED/CLOSED. Dropped reap-staging
(dead — no dd_env=staging VMs exist), trimmed the workflow_run
trigger (Production Deploy is gone).

## Deferred (blocked on easyenclave)

Preview → easyenclave-staging / prod → easyenclave-stable image-family
split + tdx2 auto-refresh of the base qcow2. easyenclave has no stable
GCP family and no qcow2 on v0.1.14 today. Follow-up PR once easyenclave
publishes stable images with qcow2 + a stable GCP image family.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@posix4e
Copy link
Copy Markdown
Member Author

posix4e commented Apr 18, 2026

Folded into #132 (single-commit consolidation).

@posix4e posix4e closed this Apr 18, 2026
posix4e added a commit that referenced this pull request Apr 18, 2026
…ollama+openclaw (#132)

Consolidates PRs #127, #131, #132, #133 into one commit. Every fleet
path runs through one workflow, every agent VM boots the full
container stack, and audit-surfaced cleanup gaps are closed.

## Workload spec — single source of truth

apps/<name>/workload.{json,json.tmpl} is the one place a workload is
defined. Boot-time (config.iso) and runtime (/deploy) bake from the
same file. New apps/README.md documents the schema, the lifecycle
matrix (CP / preview agent / prod agent), ordering via `until` polling,
and a "deploying your own" walkthrough. Main README points at it.

The bake helper (envsubst + jq strip-empty-env-entries) is inlined in
two places — .github/workflows/deploy-cp.yml (CI, CP workloads) and
apps/_infra/local-agents.sh (tdx2, agent workloads) — restricted to
the uppercase ${VAR} refs each template declares so shell locals
($i, $((…))) in cmd strings aren't eaten.

## Workflow topology — one entry point

release.yml drives everything:
  pull_request     → build → deploy-preview → dd-local-preview relaunch
  push main        → build → deploy-production → dd-local-prod relaunch
  push v*          → build only (versioned artifact, no deploy)
  workflow_dispatch → build → deploy-production (rollback: release_tag input)

.github/workflows/deploy-cp.yml is the reusable workflow both paths
call. Preview CI exercises the exact code prod uses. It provisions
the GCP CP VM, verifies /cp/attest MRTD + dashboard + STONITH,
comments on PR, and cascades the agent relaunch — blocking on the
agent re-registering with the CP (proves "everything works").

.github/actions/relaunch-agent/ is the composite that SSHes into
tdx2, runs apps/_infra/dd-relaunch.sh, polls /api/agents for the
freshly-registered entry (5-min budget).

Retired workflows:
  production-deploy.yml → folded into release.yml as deploy-production
  local-agents.yml      → no remaining purpose (cascade handles relaunch)
  retire-staging.yml    → one-shot, already run

scripts/ deleted entirely. gcp-deploy.sh inlined into deploy-cp.yml;
dd-relaunch.sh + local-agents.sh moved to apps/_infra/ (host-side);
ollama-deploy.sh, redeploy-workload.sh, workloads.sh removed as unused.

## Container stack on agent VMs

dd-local-{preview,prod} now boot the full chain:
  nv (prod only) → mount-models → podman-static → podman-bootstrap
    → ollama (prod.json with GPU devices / preview.json CPU only)
    → openclaw (qwen2.5:7b on prod, qwen2.5:0.5b on preview)
    → cloudflared → dd-agent

podman-bootstrap installs the wrapper as `podman` (not `dd-podman`),
so bare `podman ps` from a guest shell reaches the right storage
root instead of failing with `mkdir /var/lib/containers: read-only
file system`. Raw binary moves to .podman-raw; dd-podman becomes a
symlink for back-compat.

## Cleanup audit

gcloud survey found dd-pr-121 RUNNING in staging a day after PR #121
merged (branch not deleted → pr-teardown.yml never fired;
cleanup.yml only reaps TERMINATED). Added reap-merged-pr-previews:
for each RUNNING pr-N VM, gh-resolves the PR state and tears down
VM + CF tunnel + DNS CNAME when MERGED/CLOSED. Dropped reap-staging
(dead — no dd_env=staging VMs exist), trimmed the workflow_run
trigger (Production Deploy is gone).

## Deferred (blocked on easyenclave)

Preview → easyenclave-staging / prod → easyenclave-stable image-family
split + tdx2 auto-refresh of the base qcow2. easyenclave has no stable
GCP family and no qcow2 on v0.1.14 today. Follow-up PR once easyenclave
publishes stable images with qcow2 + a stable GCP image family.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@posix4e posix4e deleted the feat/wire-containers branch April 18, 2026 21:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant