Skip to content

feat(workloads): unify boot + runtime workloads under apps/<name>/workload.json#127

Closed
posix4e wants to merge 5 commits into
mainfrom
feat/unify-workloads
Closed

feat(workloads): unify boot + runtime workloads under apps/<name>/workload.json#127
posix4e wants to merge 5 commits into
mainfrom
feat/unify-workloads

Conversation

@posix4e
Copy link
Copy Markdown
Member

@posix4e posix4e commented Apr 17, 2026

Summary

Three scripts were emitting the same JSON schema (EE's DeployRequest) in three different jq styles at three different lifecycle points:

script workloads delivery
scripts/local-agents.sh nv, mount-models, cloudflared, dd-agent config.iso EE_BOOT_WORKLOADS
scripts/gcp-deploy.sh cloudflared, dd-management GCE metadata EE_BOOT_WORKLOADS
scripts/ollama-deploy.sh podman-static, ollama, openclaw POST /deploy over HTTPS

One source of truth per app; all consumers go through the same helper.

Changes

  • New apps/<name>/workload.json (plain) or .json.tmpl (with ${VAR} placeholders) — one file per app.
  • New scripts/workloads.shbake + join helpers. Loads a file, envsubsts placeholders, strips env entries with empty values (replaces the old if $gh_client_id == "" then [] else … jq conditional), emits a JSON array.
  • scripts/local-agents.sh + scripts/gcp-deploy.sh now call join with a list of app files.
  • Ollama + openclaw become boot workloads. Self-sequencing via until loops in each cmd (podman bin exists → ollama responds → openclaw starts). No runner-side polling anymore.
  • .github/workflows/local-agents.yml loses the deploy ollama (HTTPS) step — step 1 (SSH + virsh relaunch) is the whole workflow now.
  • scripts/ollama-deploy.sh → deleted. Replaced by scripts/redeploy-workload.sh, a 40-line tool that POSTs one baked workload to a live agent for iteration.
  • dd-podman wrapper script generated at bootstrap time — all podman CLI flags (--conmon, --root, --runroot, --runtime, --cgroup-manager) baked in. apps/ollama/workload.{prod,preview}.json and apps/openclaw/workload.json.tmpl just call dd-podman run / dd-podman exec.

Why this matters

  • Adding a new app (openclaw today, whatever tomorrow) now means one new apps/<name>/workload.json plus one line in whichever lifecycle script starts it. No more three jq blocks in three styles.
  • PR previews can run ollama+openclaw by having gcp-deploy.sh include those workloads — closes the "why didn't we see openclaw on pr-N" gap.
  • External users can check their own apps/myapp/workload.json into their repo and reference it from the restored deploy-workload composite action (the webpage example works verbatim).

Test plan

  • Merge triggers Local Agents push-run → SSH+relaunch → VM boots with all 8 workloads (nv, mount-models, podman-static, podman-bootstrap, ollama, openclaw, cloudflared, dd-agent) in config.iso.
  • virsh console output shows nv: loadedmount-models: okpodman-bootstrap: okollama serveopenclaw gateway, all from boot workloads. No HTTPS ollama-deploy step.
  • /agent/<id> on app.devopsdefender.com lists all 8 apps under Workloads.
  • scripts/redeploy-workload.sh re-posts a single app without VM restart (smoke test with a trivial spec change on openclaw).
  • Preview PRs (via gh workflow run local-agents.yml -f kind=preview) still pick the CPU-only ollama spec from apps/ollama/workload.preview.json.

Hazards

  • Boot ordering depends on poll-waits. EE spawns workloads concurrently; dep chains rely on until loops in cmd. Wastes a few seconds of CPU at boot; safe because each check is idempotent.
  • MODEL env must be set when baking apps/openclaw/workload.json.tmplscripts/local-agents.sh sets it from with_gpu.
  • Cascading changes to base domain XML + dd-relaunch.sh are untouched — dd-relaunch.sh still pulls scripts from origin/main on the user's host before rebuilding.

🤖 Generated with Claude Code

@github-actions
Copy link
Copy Markdown

DD preview ready

URL: https://pr-127.devopsdefender.com

Browser login: paste gh auth token output at https://pr-127.devopsdefender.com/auth/pat

CLI / curl: curl -H "Authorization: Bearer $(gh auth token)" https://pr-127.devopsdefender.com/

Register endpoint for a local agent: wss://pr-127.devopsdefender.com/register

@posix4e posix4e force-pushed the feat/unify-workloads branch from f8ae520 to 0607aba Compare April 18, 2026 00:32
@posix4e posix4e force-pushed the feat/unify-workloads branch from 0607aba to 2621974 Compare April 18, 2026 00:50
@posix4e posix4e force-pushed the feat/unify-workloads branch from 2621974 to ae6cdd4 Compare April 18, 2026 01:19
@posix4e posix4e force-pushed the feat/unify-workloads branch from ae6cdd4 to f158c2c Compare April 18, 2026 01:26
@posix4e posix4e force-pushed the feat/unify-workloads branch from f158c2c to 60d089f Compare April 18, 2026 02:22
@posix4e posix4e force-pushed the feat/unify-workloads branch from 60d089f to eb00df2 Compare April 18, 2026 10:34
@posix4e posix4e force-pushed the feat/unify-workloads branch from eb00df2 to 04eca6e Compare April 18, 2026 10:37
@posix4e posix4e force-pushed the feat/unify-workloads branch from 04eca6e to 9053b00 Compare April 18, 2026 10:41
@posix4e posix4e force-pushed the feat/unify-workloads branch from 9053b00 to ea495ca Compare April 18, 2026 10:49
@posix4e posix4e force-pushed the feat/unify-workloads branch from ea495ca to c3e79f1 Compare April 18, 2026 11:56
@posix4e posix4e force-pushed the feat/unify-workloads branch from c3e79f1 to b025e6d Compare April 18, 2026 12:09
@posix4e posix4e force-pushed the feat/unify-workloads branch from b025e6d to 9f10674 Compare April 18, 2026 12:31
posix4e added a commit that referenced this pull request Apr 18, 2026
Three months of iteration have solidified how DD actually deploys
and measures things. Update the landing page so it matches:

* **How It Works** steps now cover: Declare a workload JSON spec,
  Deploy via the composite action or bake as a boot workload,
  Attest & run (TDX + Intel ITA verify at /register; live fleet
  metrics). Less "push a container image"; more "drop a JSON".

* **Deploy example** switches to inline `deploy-spec-inline` YAML
  heredoc (file form stays commented out above). Follows the
  ergonomic default we land in the composite action. Adds a line
  naming the on-disk convention (`apps/<name>/workload.json`) so
  users know where to put long specs.

* **Features** cards:
  - Fleet Management → Fleet Metrics — mention per-disk capacity
    and per-NIC rx/tx (PR #126) plus in-browser terminal.
  - API-Driven Deploys → Workloads as JSON — leads with the apps/
    tree as the single source of truth.
  - New: Signed Releases — Sigstore-backed GitHub attestations on
    every published binary (PR #125); `gh attestation verify`
    proves provenance.
  - Cloudflare Tunnels — drop "dd-register" in favour of "the CP".

* **Architecture diagram** re-organised:
  - easyenclave spawns workloads from apps/*/workload.json.
  - Podman shows up as its own workload step, with ollama + openclaw
    running as a container on top — matches the current reality
    (PR #127's boot-workload chain).
  - Subtitle becomes "1 binary, 2 modes, workloads as code".

* **Code-block `.c` class** added to style.css — renders commented
  lines dim/italic so the file-vs-inline split in the example reads
  cleanly.

* Powered-by EasyEnclave callout + footer links unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@posix4e posix4e force-pushed the feat/unify-workloads branch from 9f10674 to b012f36 Compare April 18, 2026 13:21
…kload.json

Every DD workload is now described by a single JSON file under
apps/. Three scripts previously emitted jq-interpolated workload
specs in three different styles at three lifecycle points:

  local-agents.sh   — nv, mount-models, cloudflared, dd-agent  (config.iso)
  gcp-deploy.sh     — cloudflared, dd-management               (GCE metadata)
  ollama-deploy.sh  — podman-static, ollama, openclaw          (POST /deploy)

All three JSON schemas were identical (EE's DeployRequest), just
assembled differently. Unify:

* New `apps/` tree:
    apps/nv/workload.json                 (prod-only nvidia insmod)
    apps/mount-models/workload.json       (mount /dev/vdc → /var/lib/…)
    apps/cloudflared/workload.json        (fetch-only)
    apps/dd-agent/workload.json.tmpl      (DD_CP_URL, DD_PAT, … substituted)
    apps/dd-management/workload.json.tmpl (CP-side)
    apps/podman-static/workload.json      (fetch-only)
    apps/podman-bootstrap/workload.json   (stages binaries + writes dd-podman wrapper)
    apps/ollama/workload.prod.json        (with --device=/dev/nvidia*)
    apps/ollama/workload.preview.json     (CPU-only)
    apps/openclaw/workload.json.tmpl      (MODEL substituted)

* New `scripts/workloads.sh` helper with `bake` + `join` functions —
  loads a .json or .json.tmpl file, envsubsts ${VAR} placeholders,
  strips env entries with empty values (replaces the old `if
  $gh_client_id == ""` jq conditional), and prints a JSON array.

* `scripts/local-agents.sh` and `scripts/gcp-deploy.sh` now call
  `join` with the list of workload files for their VM flavour.

* Ollama + openclaw become BOOT workloads baked into config.iso.
  Self-sequencing is encoded in each cmd script via `until` loops
  that wait for the previous dep (podman bin exists → ollama
  container responds → openclaw gateway starts). No runner-side
  polling anymore.

* `.github/workflows/local-agents.yml` loses the `deploy ollama
  (HTTPS)` step. Step 1 (SSH + virsh relaunch) is all there is.

* `scripts/ollama-deploy.sh` replaced by `scripts/redeploy-workload.sh`
  — a 40-line tool that POSTs ONE baked workload to a live agent's
  /deploy, for iterating on an apps/<x>/workload.json without
  recreating the VM. Use:
    DD_PAT=$(gh auth token) ./scripts/redeploy-workload.sh \
      https://app.devopsdefender.com dd-local-prod \
      apps/openclaw/workload.json.tmpl

One new workload now means one new apps/<name>/workload.json plus
a line in whichever script enumerates its lifecycle — no more
three copies in three styles.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
dd-local-preview doesn't currently bake ollama/openclaw into its boot
workloads (local-agents.sh only wires nv/mount-models/cloudflared/dd-agent),
so the /healthz probe will always time out. Treat verify as signal-only
and let SSH+relaunch gate the PR — full e2e integration of the apps/
tree on the preview VM is a follow-up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
posix4e and others added 2 commits April 18, 2026 13:18
…130)

Preview and production share scripts/gcp-deploy.sh but each had its own
job body in release.yml and production-deploy.yml — three copies of the
same health-wait, STONITH, dashboard verify, drifting apart (preview
already ran /cp/attest MRTD verify; prod didn't).

Extract the common body into .github/workflows/deploy-cp.yml as a
reusable workflow. release.yml deploy-preview and production-deploy.yml
deploy both call it with env-specific inputs. Prod now runs the stronger
MRTD attestation check preview already had, and every PR push exercises
the exact code prod uses.

Move the SSH+relaunch of dd-local-{kind} into a composite action
.github/actions/relaunch-agent/ so deploy-cp.yml can cascade it
directly. local-agents.yml shrinks to a workflow_dispatch-only entry
point for operator-driven one-shots.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Release's deploy-preview and Production Deploy's deploy call
deploy-cp.yml via `uses:` at the job level. When a reusable workflow
declares `permissions:` on its job, the caller must grant the same set
(or more) at the calling-job level — the callee's grant is the
intersection of workflow-level caller grants and the callee's ask, but
for the resulting permissions to match the callee must explicitly have
them via the caller's job-level permissions block. The previous commit
missed that, causing Release to startup-fail.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
gcp-deploy.sh references DD_DOMAIN under set -u (for dd-management
workload's DD_CF_DOMAIN env var). The pre-refactor release.yml /
production-deploy.yml had DD_DOMAIN in a workflow-level env block;
dropping that block left the variable unbound, failing the deploy
step with "DD_DOMAIN: unbound variable".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
posix4e added a commit that referenced this pull request Apr 18, 2026
Consolidates PRs #127, #131, #132 into one commit. Net -381 lines,
one "Release" workflow drives the whole fleet lifecycle.

Workload spec:
  apps/<name>/workload.{json,json.tmpl} becomes the single source of
  truth for every EE workload (cloudflared, dd-agent, dd-management,
  ollama, openclaw, nv, mount-models, podman-static, podman-bootstrap).
  Boot-time (config.iso / ee-config metadata) and runtime (/deploy)
  both bake from the same file.

Workflow topology:
  release.yml is the one entry point.
    pull_request      → build → deploy-preview      → dd-local-preview relaunch
    push main         → build → deploy-production   → dd-local-prod    relaunch
    push v*           → build only (versioned artifact, no deploy)
    workflow_dispatch → build → deploy-production   (rollback: release_tag input)
  .github/workflows/deploy-cp.yml is the reusable workflow both paths
  call, so preview CI exercises the exact code prod uses. It provisions
  the GCP CP VM, verifies health + /cp/attest MRTD + dashboard + STONITH,
  comments on PR, and cascades agent relaunch. The cascade is blocking:
  a release goes green only when the matching dd-local-{kind} VM
  re-registers with the freshly-deployed CP (proves "everything works").
  .github/actions/relaunch-agent/ is the composite action that SSHes
  into tdx2, runs dd-relaunch.sh, and polls /api/agents for the
  freshly-registered entry (5-min budget).

Deleted:
  .github/workflows/production-deploy.yml (folded into release.yml
    as deploy-production job).
  .github/workflows/local-agents.yml (manual-dispatch path gone;
    push a commit to trigger a relaunch).
  .github/workflows/retire-staging.yml (one-shot, already run,
    no dd_env=staging VMs exist).
  scripts/ entirely. gcp-deploy.sh inlined into deploy-cp.yml;
  dd-relaunch.sh + local-agents.sh moved to apps/_infra/ (host-side);
  ollama-deploy.sh, redeploy-workload.sh, workloads.sh deleted as
  unused after the refactor.

Cleanup audit:
  gcloud survey revealed dd-pr-121-1776434711 RUNNING in staging a
  day after PR #121 merged — branch not deleted, so pr-teardown.yml
  never fired and cleanup.yml only reaps TERMINATED. Added a
  reap-merged-pr-previews job to cleanup.yml that resolves each
  RUNNING pr-N VM's PR state via gh and tears down VM + CF tunnel
  + DNS CNAME when MERGED/CLOSED. Dropped reap-staging (dead),
  trimmed workflow_run trigger (Production Deploy is gone).

Deferred (blocked on easyenclave):
  easyenclave image family split — preview → easyenclave-staging,
  prod → easyenclave-stable. easyenclave has no stable GCP family
  and no qcow2 on v0.1.14 today, so nothing to point prod at. Plus
  tdx2 auto-refresh of the base qcow2 from easyenclave releases.
  Follow-up PR once easyenclave publishes stable images.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@posix4e
Copy link
Copy Markdown
Member Author

posix4e commented Apr 18, 2026

Folded into #132 (single-commit consolidation).

@posix4e posix4e closed this Apr 18, 2026
posix4e added a commit that referenced this pull request Apr 18, 2026
…ollama+openclaw

Consolidates PRs #127, #131, #132, #133 into one commit. Every fleet
path runs through one workflow, every agent VM boots the full
container stack, and audit-surfaced cleanup gaps are closed.

## Workload spec — single source of truth

apps/<name>/workload.{json,json.tmpl} is the one place a workload is
defined. Boot-time (config.iso) and runtime (/deploy) bake from the
same file. New apps/README.md documents the schema, the lifecycle
matrix (CP / preview agent / prod agent), ordering via `until` polling,
and a "deploying your own" walkthrough. Main README points at it.

The bake helper (envsubst + jq strip-empty-env-entries) is inlined in
two places — .github/workflows/deploy-cp.yml (CI, CP workloads) and
apps/_infra/local-agents.sh (tdx2, agent workloads) — restricted to
the uppercase ${VAR} refs each template declares so shell locals
($i, $((…))) in cmd strings aren't eaten.

## Workflow topology — one entry point

release.yml drives everything:
  pull_request     → build → deploy-preview → dd-local-preview relaunch
  push main        → build → deploy-production → dd-local-prod relaunch
  push v*          → build only (versioned artifact, no deploy)
  workflow_dispatch → build → deploy-production (rollback: release_tag input)

.github/workflows/deploy-cp.yml is the reusable workflow both paths
call. Preview CI exercises the exact code prod uses. It provisions
the GCP CP VM, verifies /cp/attest MRTD + dashboard + STONITH,
comments on PR, and cascades the agent relaunch — blocking on the
agent re-registering with the CP (proves "everything works").

.github/actions/relaunch-agent/ is the composite that SSHes into
tdx2, runs apps/_infra/dd-relaunch.sh, polls /api/agents for the
freshly-registered entry (5-min budget).

Retired workflows:
  production-deploy.yml → folded into release.yml as deploy-production
  local-agents.yml      → no remaining purpose (cascade handles relaunch)
  retire-staging.yml    → one-shot, already run

scripts/ deleted entirely. gcp-deploy.sh inlined into deploy-cp.yml;
dd-relaunch.sh + local-agents.sh moved to apps/_infra/ (host-side);
ollama-deploy.sh, redeploy-workload.sh, workloads.sh removed as unused.

## Container stack on agent VMs

dd-local-{preview,prod} now boot the full chain:
  nv (prod only) → mount-models → podman-static → podman-bootstrap
    → ollama (prod.json with GPU devices / preview.json CPU only)
    → openclaw (qwen2.5:7b on prod, qwen2.5:0.5b on preview)
    → cloudflared → dd-agent

podman-bootstrap installs the wrapper as `podman` (not `dd-podman`),
so bare `podman ps` from a guest shell reaches the right storage
root instead of failing with `mkdir /var/lib/containers: read-only
file system`. Raw binary moves to .podman-raw; dd-podman becomes a
symlink for back-compat.

## Cleanup audit

gcloud survey found dd-pr-121 RUNNING in staging a day after PR #121
merged (branch not deleted → pr-teardown.yml never fired;
cleanup.yml only reaps TERMINATED). Added reap-merged-pr-previews:
for each RUNNING pr-N VM, gh-resolves the PR state and tears down
VM + CF tunnel + DNS CNAME when MERGED/CLOSED. Dropped reap-staging
(dead — no dd_env=staging VMs exist), trimmed the workflow_run
trigger (Production Deploy is gone).

## Deferred (blocked on easyenclave)

Preview → easyenclave-staging / prod → easyenclave-stable image-family
split + tdx2 auto-refresh of the base qcow2. easyenclave has no stable
GCP family and no qcow2 on v0.1.14 today. Follow-up PR once easyenclave
publishes stable images with qcow2 + a stable GCP image family.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
posix4e added a commit that referenced this pull request Apr 18, 2026
…ollama+openclaw (#132)

Consolidates PRs #127, #131, #132, #133 into one commit. Every fleet
path runs through one workflow, every agent VM boots the full
container stack, and audit-surfaced cleanup gaps are closed.

## Workload spec — single source of truth

apps/<name>/workload.{json,json.tmpl} is the one place a workload is
defined. Boot-time (config.iso) and runtime (/deploy) bake from the
same file. New apps/README.md documents the schema, the lifecycle
matrix (CP / preview agent / prod agent), ordering via `until` polling,
and a "deploying your own" walkthrough. Main README points at it.

The bake helper (envsubst + jq strip-empty-env-entries) is inlined in
two places — .github/workflows/deploy-cp.yml (CI, CP workloads) and
apps/_infra/local-agents.sh (tdx2, agent workloads) — restricted to
the uppercase ${VAR} refs each template declares so shell locals
($i, $((…))) in cmd strings aren't eaten.

## Workflow topology — one entry point

release.yml drives everything:
  pull_request     → build → deploy-preview → dd-local-preview relaunch
  push main        → build → deploy-production → dd-local-prod relaunch
  push v*          → build only (versioned artifact, no deploy)
  workflow_dispatch → build → deploy-production (rollback: release_tag input)

.github/workflows/deploy-cp.yml is the reusable workflow both paths
call. Preview CI exercises the exact code prod uses. It provisions
the GCP CP VM, verifies /cp/attest MRTD + dashboard + STONITH,
comments on PR, and cascades the agent relaunch — blocking on the
agent re-registering with the CP (proves "everything works").

.github/actions/relaunch-agent/ is the composite that SSHes into
tdx2, runs apps/_infra/dd-relaunch.sh, polls /api/agents for the
freshly-registered entry (5-min budget).

Retired workflows:
  production-deploy.yml → folded into release.yml as deploy-production
  local-agents.yml      → no remaining purpose (cascade handles relaunch)
  retire-staging.yml    → one-shot, already run

scripts/ deleted entirely. gcp-deploy.sh inlined into deploy-cp.yml;
dd-relaunch.sh + local-agents.sh moved to apps/_infra/ (host-side);
ollama-deploy.sh, redeploy-workload.sh, workloads.sh removed as unused.

## Container stack on agent VMs

dd-local-{preview,prod} now boot the full chain:
  nv (prod only) → mount-models → podman-static → podman-bootstrap
    → ollama (prod.json with GPU devices / preview.json CPU only)
    → openclaw (qwen2.5:7b on prod, qwen2.5:0.5b on preview)
    → cloudflared → dd-agent

podman-bootstrap installs the wrapper as `podman` (not `dd-podman`),
so bare `podman ps` from a guest shell reaches the right storage
root instead of failing with `mkdir /var/lib/containers: read-only
file system`. Raw binary moves to .podman-raw; dd-podman becomes a
symlink for back-compat.

## Cleanup audit

gcloud survey found dd-pr-121 RUNNING in staging a day after PR #121
merged (branch not deleted → pr-teardown.yml never fired;
cleanup.yml only reaps TERMINATED). Added reap-merged-pr-previews:
for each RUNNING pr-N VM, gh-resolves the PR state and tears down
VM + CF tunnel + DNS CNAME when MERGED/CLOSED. Dropped reap-staging
(dead — no dd_env=staging VMs exist), trimmed the workflow_run
trigger (Production Deploy is gone).

## Deferred (blocked on easyenclave)

Preview → easyenclave-staging / prod → easyenclave-stable image-family
split + tdx2 auto-refresh of the base qcow2. easyenclave has no stable
GCP family and no qcow2 on v0.1.14 today. Follow-up PR once easyenclave
publishes stable images with qcow2 + a stable GCP image family.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@posix4e posix4e deleted the feat/unify-workloads branch April 18, 2026 21:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant