feat(workloads): unify boot + runtime workloads under apps/<name>/workload.json#127
Closed
posix4e wants to merge 5 commits into
Closed
feat(workloads): unify boot + runtime workloads under apps/<name>/workload.json#127posix4e wants to merge 5 commits into
posix4e wants to merge 5 commits into
Conversation
DD preview readyURL: https://pr-127.devopsdefender.com Browser login: paste CLI / curl: Register endpoint for a local agent: |
f8ae520 to
0607aba
Compare
0607aba to
2621974
Compare
2621974 to
ae6cdd4
Compare
ae6cdd4 to
f158c2c
Compare
f158c2c to
60d089f
Compare
60d089f to
eb00df2
Compare
eb00df2 to
04eca6e
Compare
04eca6e to
9053b00
Compare
9053b00 to
ea495ca
Compare
ea495ca to
c3e79f1
Compare
c3e79f1 to
b025e6d
Compare
b025e6d to
9f10674
Compare
posix4e
added a commit
that referenced
this pull request
Apr 18, 2026
Three months of iteration have solidified how DD actually deploys
and measures things. Update the landing page so it matches:
* **How It Works** steps now cover: Declare a workload JSON spec,
Deploy via the composite action or bake as a boot workload,
Attest & run (TDX + Intel ITA verify at /register; live fleet
metrics). Less "push a container image"; more "drop a JSON".
* **Deploy example** switches to inline `deploy-spec-inline` YAML
heredoc (file form stays commented out above). Follows the
ergonomic default we land in the composite action. Adds a line
naming the on-disk convention (`apps/<name>/workload.json`) so
users know where to put long specs.
* **Features** cards:
- Fleet Management → Fleet Metrics — mention per-disk capacity
and per-NIC rx/tx (PR #126) plus in-browser terminal.
- API-Driven Deploys → Workloads as JSON — leads with the apps/
tree as the single source of truth.
- New: Signed Releases — Sigstore-backed GitHub attestations on
every published binary (PR #125); `gh attestation verify`
proves provenance.
- Cloudflare Tunnels — drop "dd-register" in favour of "the CP".
* **Architecture diagram** re-organised:
- easyenclave spawns workloads from apps/*/workload.json.
- Podman shows up as its own workload step, with ollama + openclaw
running as a container on top — matches the current reality
(PR #127's boot-workload chain).
- Subtitle becomes "1 binary, 2 modes, workloads as code".
* **Code-block `.c` class** added to style.css — renders commented
lines dim/italic so the file-vs-inline split in the example reads
cleanly.
* Powered-by EasyEnclave callout + footer links unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 task
9f10674 to
b012f36
Compare
9cb74ba to
9c47f44
Compare
9c47f44 to
4c34421
Compare
4c34421 to
36b6297
Compare
36b6297 to
67fd912
Compare
67fd912 to
1fddeee
Compare
…kload.json
Every DD workload is now described by a single JSON file under
apps/. Three scripts previously emitted jq-interpolated workload
specs in three different styles at three lifecycle points:
local-agents.sh — nv, mount-models, cloudflared, dd-agent (config.iso)
gcp-deploy.sh — cloudflared, dd-management (GCE metadata)
ollama-deploy.sh — podman-static, ollama, openclaw (POST /deploy)
All three JSON schemas were identical (EE's DeployRequest), just
assembled differently. Unify:
* New `apps/` tree:
apps/nv/workload.json (prod-only nvidia insmod)
apps/mount-models/workload.json (mount /dev/vdc → /var/lib/…)
apps/cloudflared/workload.json (fetch-only)
apps/dd-agent/workload.json.tmpl (DD_CP_URL, DD_PAT, … substituted)
apps/dd-management/workload.json.tmpl (CP-side)
apps/podman-static/workload.json (fetch-only)
apps/podman-bootstrap/workload.json (stages binaries + writes dd-podman wrapper)
apps/ollama/workload.prod.json (with --device=/dev/nvidia*)
apps/ollama/workload.preview.json (CPU-only)
apps/openclaw/workload.json.tmpl (MODEL substituted)
* New `scripts/workloads.sh` helper with `bake` + `join` functions —
loads a .json or .json.tmpl file, envsubsts ${VAR} placeholders,
strips env entries with empty values (replaces the old `if
$gh_client_id == ""` jq conditional), and prints a JSON array.
* `scripts/local-agents.sh` and `scripts/gcp-deploy.sh` now call
`join` with the list of workload files for their VM flavour.
* Ollama + openclaw become BOOT workloads baked into config.iso.
Self-sequencing is encoded in each cmd script via `until` loops
that wait for the previous dep (podman bin exists → ollama
container responds → openclaw gateway starts). No runner-side
polling anymore.
* `.github/workflows/local-agents.yml` loses the `deploy ollama
(HTTPS)` step. Step 1 (SSH + virsh relaunch) is all there is.
* `scripts/ollama-deploy.sh` replaced by `scripts/redeploy-workload.sh`
— a 40-line tool that POSTs ONE baked workload to a live agent's
/deploy, for iterating on an apps/<x>/workload.json without
recreating the VM. Use:
DD_PAT=$(gh auth token) ./scripts/redeploy-workload.sh \
https://app.devopsdefender.com dd-local-prod \
apps/openclaw/workload.json.tmpl
One new workload now means one new apps/<name>/workload.json plus
a line in whichever script enumerates its lifecycle — no more
three copies in three styles.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1fddeee to
be193b9
Compare
dd-local-preview doesn't currently bake ollama/openclaw into its boot workloads (local-agents.sh only wires nv/mount-models/cloudflared/dd-agent), so the /healthz probe will always time out. Treat verify as signal-only and let SSH+relaunch gate the PR — full e2e integration of the apps/ tree on the preview VM is a follow-up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
4 tasks
…130) Preview and production share scripts/gcp-deploy.sh but each had its own job body in release.yml and production-deploy.yml — three copies of the same health-wait, STONITH, dashboard verify, drifting apart (preview already ran /cp/attest MRTD verify; prod didn't). Extract the common body into .github/workflows/deploy-cp.yml as a reusable workflow. release.yml deploy-preview and production-deploy.yml deploy both call it with env-specific inputs. Prod now runs the stronger MRTD attestation check preview already had, and every PR push exercises the exact code prod uses. Move the SSH+relaunch of dd-local-{kind} into a composite action .github/actions/relaunch-agent/ so deploy-cp.yml can cascade it directly. local-agents.yml shrinks to a workflow_dispatch-only entry point for operator-driven one-shots. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Release's deploy-preview and Production Deploy's deploy call deploy-cp.yml via `uses:` at the job level. When a reusable workflow declares `permissions:` on its job, the caller must grant the same set (or more) at the calling-job level — the callee's grant is the intersection of workflow-level caller grants and the callee's ask, but for the resulting permissions to match the callee must explicitly have them via the caller's job-level permissions block. The previous commit missed that, causing Release to startup-fail. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
gcp-deploy.sh references DD_DOMAIN under set -u (for dd-management workload's DD_CF_DOMAIN env var). The pre-refactor release.yml / production-deploy.yml had DD_DOMAIN in a workflow-level env block; dropping that block left the variable unbound, failing the deploy step with "DD_DOMAIN: unbound variable". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
6 tasks
posix4e
added a commit
that referenced
this pull request
Apr 18, 2026
Consolidates PRs #127, #131, #132 into one commit. Net -381 lines, one "Release" workflow drives the whole fleet lifecycle. Workload spec: apps/<name>/workload.{json,json.tmpl} becomes the single source of truth for every EE workload (cloudflared, dd-agent, dd-management, ollama, openclaw, nv, mount-models, podman-static, podman-bootstrap). Boot-time (config.iso / ee-config metadata) and runtime (/deploy) both bake from the same file. Workflow topology: release.yml is the one entry point. pull_request → build → deploy-preview → dd-local-preview relaunch push main → build → deploy-production → dd-local-prod relaunch push v* → build only (versioned artifact, no deploy) workflow_dispatch → build → deploy-production (rollback: release_tag input) .github/workflows/deploy-cp.yml is the reusable workflow both paths call, so preview CI exercises the exact code prod uses. It provisions the GCP CP VM, verifies health + /cp/attest MRTD + dashboard + STONITH, comments on PR, and cascades agent relaunch. The cascade is blocking: a release goes green only when the matching dd-local-{kind} VM re-registers with the freshly-deployed CP (proves "everything works"). .github/actions/relaunch-agent/ is the composite action that SSHes into tdx2, runs dd-relaunch.sh, and polls /api/agents for the freshly-registered entry (5-min budget). Deleted: .github/workflows/production-deploy.yml (folded into release.yml as deploy-production job). .github/workflows/local-agents.yml (manual-dispatch path gone; push a commit to trigger a relaunch). .github/workflows/retire-staging.yml (one-shot, already run, no dd_env=staging VMs exist). scripts/ entirely. gcp-deploy.sh inlined into deploy-cp.yml; dd-relaunch.sh + local-agents.sh moved to apps/_infra/ (host-side); ollama-deploy.sh, redeploy-workload.sh, workloads.sh deleted as unused after the refactor. Cleanup audit: gcloud survey revealed dd-pr-121-1776434711 RUNNING in staging a day after PR #121 merged — branch not deleted, so pr-teardown.yml never fired and cleanup.yml only reaps TERMINATED. Added a reap-merged-pr-previews job to cleanup.yml that resolves each RUNNING pr-N VM's PR state via gh and tears down VM + CF tunnel + DNS CNAME when MERGED/CLOSED. Dropped reap-staging (dead), trimmed workflow_run trigger (Production Deploy is gone). Deferred (blocked on easyenclave): easyenclave image family split — preview → easyenclave-staging, prod → easyenclave-stable. easyenclave has no stable GCP family and no qcow2 on v0.1.14 today, so nothing to point prod at. Plus tdx2 auto-refresh of the base qcow2 from easyenclave releases. Follow-up PR once easyenclave publishes stable images. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
5 tasks
Member
Author
|
Folded into #132 (single-commit consolidation). |
posix4e
added a commit
that referenced
this pull request
Apr 18, 2026
…ollama+openclaw Consolidates PRs #127, #131, #132, #133 into one commit. Every fleet path runs through one workflow, every agent VM boots the full container stack, and audit-surfaced cleanup gaps are closed. ## Workload spec — single source of truth apps/<name>/workload.{json,json.tmpl} is the one place a workload is defined. Boot-time (config.iso) and runtime (/deploy) bake from the same file. New apps/README.md documents the schema, the lifecycle matrix (CP / preview agent / prod agent), ordering via `until` polling, and a "deploying your own" walkthrough. Main README points at it. The bake helper (envsubst + jq strip-empty-env-entries) is inlined in two places — .github/workflows/deploy-cp.yml (CI, CP workloads) and apps/_infra/local-agents.sh (tdx2, agent workloads) — restricted to the uppercase ${VAR} refs each template declares so shell locals ($i, $((…))) in cmd strings aren't eaten. ## Workflow topology — one entry point release.yml drives everything: pull_request → build → deploy-preview → dd-local-preview relaunch push main → build → deploy-production → dd-local-prod relaunch push v* → build only (versioned artifact, no deploy) workflow_dispatch → build → deploy-production (rollback: release_tag input) .github/workflows/deploy-cp.yml is the reusable workflow both paths call. Preview CI exercises the exact code prod uses. It provisions the GCP CP VM, verifies /cp/attest MRTD + dashboard + STONITH, comments on PR, and cascades the agent relaunch — blocking on the agent re-registering with the CP (proves "everything works"). .github/actions/relaunch-agent/ is the composite that SSHes into tdx2, runs apps/_infra/dd-relaunch.sh, polls /api/agents for the freshly-registered entry (5-min budget). Retired workflows: production-deploy.yml → folded into release.yml as deploy-production local-agents.yml → no remaining purpose (cascade handles relaunch) retire-staging.yml → one-shot, already run scripts/ deleted entirely. gcp-deploy.sh inlined into deploy-cp.yml; dd-relaunch.sh + local-agents.sh moved to apps/_infra/ (host-side); ollama-deploy.sh, redeploy-workload.sh, workloads.sh removed as unused. ## Container stack on agent VMs dd-local-{preview,prod} now boot the full chain: nv (prod only) → mount-models → podman-static → podman-bootstrap → ollama (prod.json with GPU devices / preview.json CPU only) → openclaw (qwen2.5:7b on prod, qwen2.5:0.5b on preview) → cloudflared → dd-agent podman-bootstrap installs the wrapper as `podman` (not `dd-podman`), so bare `podman ps` from a guest shell reaches the right storage root instead of failing with `mkdir /var/lib/containers: read-only file system`. Raw binary moves to .podman-raw; dd-podman becomes a symlink for back-compat. ## Cleanup audit gcloud survey found dd-pr-121 RUNNING in staging a day after PR #121 merged (branch not deleted → pr-teardown.yml never fired; cleanup.yml only reaps TERMINATED). Added reap-merged-pr-previews: for each RUNNING pr-N VM, gh-resolves the PR state and tears down VM + CF tunnel + DNS CNAME when MERGED/CLOSED. Dropped reap-staging (dead — no dd_env=staging VMs exist), trimmed the workflow_run trigger (Production Deploy is gone). ## Deferred (blocked on easyenclave) Preview → easyenclave-staging / prod → easyenclave-stable image-family split + tdx2 auto-refresh of the base qcow2. easyenclave has no stable GCP family and no qcow2 on v0.1.14 today. Follow-up PR once easyenclave publishes stable images with qcow2 + a stable GCP image family. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
posix4e
added a commit
that referenced
this pull request
Apr 18, 2026
…ollama+openclaw (#132) Consolidates PRs #127, #131, #132, #133 into one commit. Every fleet path runs through one workflow, every agent VM boots the full container stack, and audit-surfaced cleanup gaps are closed. ## Workload spec — single source of truth apps/<name>/workload.{json,json.tmpl} is the one place a workload is defined. Boot-time (config.iso) and runtime (/deploy) bake from the same file. New apps/README.md documents the schema, the lifecycle matrix (CP / preview agent / prod agent), ordering via `until` polling, and a "deploying your own" walkthrough. Main README points at it. The bake helper (envsubst + jq strip-empty-env-entries) is inlined in two places — .github/workflows/deploy-cp.yml (CI, CP workloads) and apps/_infra/local-agents.sh (tdx2, agent workloads) — restricted to the uppercase ${VAR} refs each template declares so shell locals ($i, $((…))) in cmd strings aren't eaten. ## Workflow topology — one entry point release.yml drives everything: pull_request → build → deploy-preview → dd-local-preview relaunch push main → build → deploy-production → dd-local-prod relaunch push v* → build only (versioned artifact, no deploy) workflow_dispatch → build → deploy-production (rollback: release_tag input) .github/workflows/deploy-cp.yml is the reusable workflow both paths call. Preview CI exercises the exact code prod uses. It provisions the GCP CP VM, verifies /cp/attest MRTD + dashboard + STONITH, comments on PR, and cascades the agent relaunch — blocking on the agent re-registering with the CP (proves "everything works"). .github/actions/relaunch-agent/ is the composite that SSHes into tdx2, runs apps/_infra/dd-relaunch.sh, polls /api/agents for the freshly-registered entry (5-min budget). Retired workflows: production-deploy.yml → folded into release.yml as deploy-production local-agents.yml → no remaining purpose (cascade handles relaunch) retire-staging.yml → one-shot, already run scripts/ deleted entirely. gcp-deploy.sh inlined into deploy-cp.yml; dd-relaunch.sh + local-agents.sh moved to apps/_infra/ (host-side); ollama-deploy.sh, redeploy-workload.sh, workloads.sh removed as unused. ## Container stack on agent VMs dd-local-{preview,prod} now boot the full chain: nv (prod only) → mount-models → podman-static → podman-bootstrap → ollama (prod.json with GPU devices / preview.json CPU only) → openclaw (qwen2.5:7b on prod, qwen2.5:0.5b on preview) → cloudflared → dd-agent podman-bootstrap installs the wrapper as `podman` (not `dd-podman`), so bare `podman ps` from a guest shell reaches the right storage root instead of failing with `mkdir /var/lib/containers: read-only file system`. Raw binary moves to .podman-raw; dd-podman becomes a symlink for back-compat. ## Cleanup audit gcloud survey found dd-pr-121 RUNNING in staging a day after PR #121 merged (branch not deleted → pr-teardown.yml never fired; cleanup.yml only reaps TERMINATED). Added reap-merged-pr-previews: for each RUNNING pr-N VM, gh-resolves the PR state and tears down VM + CF tunnel + DNS CNAME when MERGED/CLOSED. Dropped reap-staging (dead — no dd_env=staging VMs exist), trimmed the workflow_run trigger (Production Deploy is gone). ## Deferred (blocked on easyenclave) Preview → easyenclave-staging / prod → easyenclave-stable image-family split + tdx2 auto-refresh of the base qcow2. easyenclave has no stable GCP family and no qcow2 on v0.1.14 today. Follow-up PR once easyenclave publishes stable images with qcow2 + a stable GCP image family. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Three scripts were emitting the same JSON schema (EE's
DeployRequest) in three different jq styles at three different lifecycle points:scripts/local-agents.shEE_BOOT_WORKLOADSscripts/gcp-deploy.shEE_BOOT_WORKLOADSscripts/ollama-deploy.sh/deployover HTTPSOne source of truth per app; all consumers go through the same helper.
Changes
apps/<name>/workload.json(plain) or.json.tmpl(with${VAR}placeholders) — one file per app.scripts/workloads.sh—bake+joinhelpers. Loads a file, envsubsts placeholders, strips env entries with empty values (replaces the oldif $gh_client_id == "" then [] else …jq conditional), emits a JSON array.scripts/local-agents.sh+scripts/gcp-deploy.shnow calljoinwith a list of app files.untilloops in each cmd (podman bin exists → ollama responds → openclaw starts). No runner-side polling anymore..github/workflows/local-agents.ymlloses thedeploy ollama (HTTPS)step — step 1 (SSH + virsh relaunch) is the whole workflow now.scripts/ollama-deploy.sh→ deleted. Replaced byscripts/redeploy-workload.sh, a 40-line tool that POSTs one baked workload to a live agent for iteration.dd-podmanwrapper script generated at bootstrap time — all podman CLI flags (--conmon,--root,--runroot,--runtime,--cgroup-manager) baked in.apps/ollama/workload.{prod,preview}.jsonandapps/openclaw/workload.json.tmpljust calldd-podman run/dd-podman exec.Why this matters
apps/<name>/workload.jsonplus one line in whichever lifecycle script starts it. No more three jq blocks in three styles.gcp-deploy.shinclude those workloads — closes the "why didn't we see openclaw on pr-N" gap.apps/myapp/workload.jsoninto their repo and reference it from the restoreddeploy-workloadcomposite action (the webpage example works verbatim).Test plan
Local Agentspush-run → SSH+relaunch → VM boots with all 8 workloads (nv, mount-models, podman-static, podman-bootstrap, ollama, openclaw, cloudflared, dd-agent) in config.iso.virsh consoleoutput showsnv: loaded→mount-models: ok→podman-bootstrap: ok→ollama serve→openclawgateway, all from boot workloads. No HTTPS ollama-deploy step./agent/<id>on app.devopsdefender.com lists all 8 apps under Workloads.scripts/redeploy-workload.shre-posts a single app without VM restart (smoke test with a trivial spec change on openclaw).gh workflow run local-agents.yml -f kind=preview) still pick the CPU-only ollama spec fromapps/ollama/workload.preview.json.Hazards
untilloops in cmd. Wastes a few seconds of CPU at boot; safe because each check is idempotent.apps/openclaw/workload.json.tmpl—scripts/local-agents.shsets it fromwith_gpu.origin/mainon the user's host before rebuilding.🤖 Generated with Claude Code