Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
102 changes: 79 additions & 23 deletions apps/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,15 @@
# apps/ — workload specs
# apps/ — worked example of a DD agent VM

This directory is DD's canonical reference for **how to deploy a workload**. Every directory here is one workload — a process easyenclave runs inside a TDX-sealed VM. The specs are both the live deployment configuration and the worked example for operators writing their own.
This directory is **a worked example**, not a bundle dd ships to users. Every
directory here is one easyenclave workload. Together they describe a complete
DD agent VM: the minimum infra to boot podman, run one demo container
(`web-nvidia-smi`), register with a control plane, and expose the demo on a
stable hostname.

The goal is to be the shortest legible "agent VM from scratch" that you can
copy and adapt. For orchestrating many workloads, assembling them from
templates, and the run / teardown lifecycle, see
[slopandmop](https://github.com/slopandmop/slopandmop).

## Layout

Expand All @@ -14,7 +23,8 @@ apps/

## What a workload looks like

A **workload** is a JSON object consumed by easyenclave's `DeployRequest` (see `src/easyenclave/src/workload.rs`). Minimum shape:
A **workload** is a JSON object consumed by easyenclave's `DeployRequest` (see
`src/easyenclave/src/workload.rs`). Minimum shape:

```json
{
Expand All @@ -23,7 +33,9 @@ A **workload** is a JSON object consumed by easyenclave's `DeployRequest` (see `
}
```

Add `github_release` to fetch a binary asset directly from a GitHub release — no OCI registry, no Dockerfile. The asset lands in `/var/lib/easyenclave/bin/` and is spawned by `cmd`:
Add `github_release` to fetch a binary asset directly from a GitHub release —
no OCI registry, no Dockerfile. The asset lands in `/var/lib/easyenclave/bin/`
and is spawned by `cmd`:

```json
{
Expand All @@ -44,15 +56,41 @@ Add `env` to inject config:
}
```

Add `expose` to ask DD to route a public hostname to a workload's port:

```json
{
"app_name": "web-nvidia-smi",
"expose": { "hostname_label": "gpu", "port": 8081 },
"cmd": [...]
}
```

At agent boot, `apps/_infra/local-agents.sh` collects every `expose` entry
into `DD_EXTRA_INGRESS`. dd-agent forwards them on `/register` and the CP
prepends them to the agent's cloudflared tunnel ingress. A workload declaring
`{"hostname_label": "gpu", "port": 8081}` becomes reachable at
`gpu.<agent-hostname>` — in addition to the default dashboard at
`<agent-hostname>`. easyenclave itself ignores the field; it's a DD-level
hint about tunnel routing.

Per-workload ingress is **boot-time only** today. Workloads POSTed later via
`/deploy` don't get auto-exposed — declare your exposure on boot workloads in
this tree.

## Templates

Files ending in `.json.tmpl` carry `${VAR}` placeholders. At bake time:

1. `envsubst` substitutes every uppercase `${VAR}` that appears in the template using the caller's environment.
2. `jq` drops env-array entries whose value ended up empty (so you can make OAuth creds / optional secrets conditional by just leaving them unset).
1. `envsubst` substitutes every uppercase `${VAR}` that appears in the
template using the caller's environment.
2. `jq` drops env-array entries whose value ended up empty (so you can make
OAuth creds / optional secrets conditional by just leaving them unset).
3. The result is a plain `workload.json` ready for EE.

Only uppercase placeholders get substituted — shell locals like `$i` or `$((n+1))` inside `cmd` strings are left alone. The bake helper is duplicated inline in two places so both lifecycle points behave identically:
Only uppercase placeholders get substituted — shell locals like `$i` or
`$((n+1))` inside `cmd` strings are left alone. The bake helper is duplicated
inline in two places so both lifecycle points behave identically:

- `.github/workflows/deploy-cp.yml` (CI, for CP workloads)
- `apps/_infra/local-agents.sh` (tdx2 host, for agent VMs)
Expand All @@ -62,26 +100,30 @@ Only uppercase placeholders get substituted — shell locals like `$i` or `$((n+
| workload | CP VM | agent VM (preview) | agent VM (prod) |
|---|---|---|---|
| `cloudflared` | ✅ | ✅ | ✅ |
| `dd-management` | ✅ | | |
| `dd-agent` | | ✅ | ✅ |
| `mount-models` | | ✅ | |
| `dd-management` | ✅ | | |
| `nv` | | | ✅ (GPU insmod) |
| `podman-static` | | ✅ | ✅ |
| `podman-bootstrap` | | ✅ | ✅ |
| `ollama` | | ✅ (CPU, preview.json) | ✅ (GPU, prod.json) |
| `openclaw` | | ✅ (qwen2.5:0.5b) | ✅ (qwen2.5:7b) |
| `web-nvidia-smi` | | | ✅ (`gpu.<agent-host>`) |

CP stays slim: just `cloudflared` + `dd-management`. Containerised LLM serving lives on agent VMs where the `vdc` ext4 disk holds models + image storage.
CP stays slim: just `cloudflared` + `dd-management`. Preview agent VMs run a
bare agent + podman for CI to prove registration end-to-end. Prod agent VMs
add the GPU insmod and the `web-nvidia-smi` demo on `gpu.<agent-host>`.

## Ordering

EasyEnclave spawns boot workloads concurrently — there's no declared dependency graph. Dependents self-sequence by polling for their prerequisites. Worked example from this tree:
EasyEnclave spawns boot workloads concurrently — there's no declared
dependency graph. Dependents self-sequence by polling for their prerequisites.
Worked examples from this tree:

- `podman-bootstrap` waits for `podman-static`'s tarball (`until [ -x $SRC/usr/local/bin/podman ]; do sleep 1; done`).
- `ollama`'s cmd waits for the wrapper (`until [ -x /var/lib/easyenclave/bin/podman ]; do sleep 2; done`).
- `openclaw`'s cmd waits for ollama's HTTP endpoint (`until wget -q -O- http://127.0.0.1:11434/api/tags; do sleep 5; done`) before pulling the model and launching the gateway.
- `podman-bootstrap` waits for `podman-static`'s tarball
(`until [ -x $SRC/usr/local/bin/podman ]; do sleep 1; done`).
- `web-nvidia-smi`'s cmd waits for the wrapper
(`until [ -x /var/lib/easyenclave/bin/podman ]; do sleep 2; done`).

Costs seconds of wasted polling at boot; easy to reason about; no workload-runner changes needed.
Costs seconds of wasted polling at boot; easy to reason about; no
workload-runner changes needed.

## Deploying your own

Expand All @@ -91,9 +133,12 @@ Costs seconds of wasted polling at boot; easy to reason about; no workload-runne
$EDITOR apps/myapp/workload.json
```
2. Decide where it runs:
- **CP VM**: add a `bake apps/myapp/workload.json` line to the workload-building `run:` step in `.github/workflows/deploy-cp.yml`.
- **Agent VM**: add the same call to `apps/_infra/local-agents.sh` in `build_config_iso()`.
- **Ad-hoc, runtime-only**: POST the baked JSON to `/deploy` on a running agent:
- **CP VM**: add a `bake apps/myapp/workload.json` line to the
workload-building `run:` step in `.github/workflows/deploy-cp.yml`.
- **Agent VM**: add the same call to `apps/_infra/local-agents.sh` in
`build_config_iso()`.
- **Ad-hoc, runtime-only**: POST the baked JSON to `/deploy` on a running
agent:
```
curl -H "Authorization: Bearer $DD_PAT" \
-H "Content-Type: application/json" \
Expand All @@ -103,6 +148,17 @@ Costs seconds of wasted polling at boot; easy to reason about; no workload-runne

## Reference

- Schema source of truth: [`src/easyenclave/src/workload.rs`](../src/easyenclave/src/workload.rs) — the `DeployRequest` struct EE deserializes on `/deploy`.
- CP deploy caller: [`.github/workflows/deploy-cp.yml`](../.github/workflows/deploy-cp.yml) — inline `bake()` + CP workload set.
- Agent VM builder: [`apps/_infra/local-agents.sh`](_infra/local-agents.sh) — inline `bake()` + agent workload set per kind.
- Schema source of truth:
[`src/easyenclave/src/workload.rs`](../src/easyenclave/src/workload.rs) —
the `DeployRequest` struct EE deserializes on `/deploy`. `expose` is not in
this struct; EE silently ignores it. DD reads it at the bake + register
boundary.
- CP deploy caller:
[`.github/workflows/deploy-cp.yml`](../.github/workflows/deploy-cp.yml) —
inline `bake()` + CP workload set.
- Agent VM builder:
[`apps/_infra/local-agents.sh`](_infra/local-agents.sh) — inline `bake()` +
agent workload set per kind.
- Ingress plumbing: `src/cf.rs` (`create()` takes per-workload ingress),
`src/cp.rs` (`register` handler accepts `extra_ingress`), `src/agent.rs`
(reads `DD_EXTRA_INGRESS`, forwards on `/register`).
137 changes: 48 additions & 89 deletions apps/_infra/local-agents.sh
Original file line number Diff line number Diff line change
@@ -1,8 +1,14 @@
#!/usr/bin/env bash
# local-agents.sh — define two local TDX agent VMs on this host:
#
# dd-local-preview : no GPU, registers with the PR-preview CP
# dd-local-prod : H100 passthrough, registers with production
# dd-local-preview : no GPU, registers with the PR-preview CP. Bare
# agent + podman — no demo workload — so the release
# pipeline can prove registration + tunnel end-to-end
# against per-PR CPs without needing GPU hardware.
# dd-local-prod : H100 passthrough, registers with production. Boots
# the web-nvidia-smi demo workload + declares a
# `gpu.<agent-host>` ingress so the output is reachable
# from the public internet.
#
# Both reuse the existing easyenclave base qcow2 via copy-on-write
# overlays; each gets its own config.iso baking in DD_CP_URL + DD_PAT +
Expand Down Expand Up @@ -46,9 +52,7 @@ BASE_DOMAIN="easyenclave-local"
#
# envsubst is restricted to the ALL-CAPS `${VAR}` references that
# appear in the template itself. Lowercase `$i`, `${i}`, and bare
# `$((…))` arithmetic inside shell cmd strings are left alone —
# otherwise envsubst would eat shell locals in openclaw's `until`
# loop and produce broken scripts.
# `$((…))` arithmetic inside shell cmd strings are left alone.
bake() {
case "$1" in
*.json.tmpl)
Expand All @@ -67,6 +71,13 @@ bake() {
esac
}

# Extract `expose` entries from a stream of baked workloads and emit
# them as a compact JSON array of `{hostname_label, port}` — the
# shape dd-agent expects in $DD_EXTRA_INGRESS.
extract_extra_ingress() {
jq -cs '[.[] | select(.expose) | .expose]'
}

[ -r "$BASE" ] || { echo "missing $BASE" >&2; exit 1; }
virsh dominfo "$BASE_DOMAIN" >/dev/null 2>&1 || {
echo "base libvirt domain '$BASE_DOMAIN' not defined — rebuild the EE image first" >&2
Expand All @@ -90,44 +101,41 @@ build_config_iso() {
tmp=$(mktemp -d)
trap "rm -rf $tmp" RETURN

# Boot workload chain (EE spawns concurrently; each uses `until`
# loops to self-sequence):
# nv — insmod nvidia driver (prod only, first so the
# device nodes exist by the time ollama runs)
# mount-models — mount /dev/vdc at /var/lib/easyenclave/ollama
# podman-static — fetch the podman binary tarball into /var/lib/easyenclave/bin
# podman-bootstrap — stage binaries, write containers.conf + policy.json,
# install /var/lib/easyenclave/bin/podman as the wrapper
# (symlinked from dd-podman for back-compat)
# ollama — run docker.io/ollama/ollama:latest serve via the wrapper
# openclaw — wait for ollama, pull $MODEL, launch openclaw gateway
# cloudflared — fetch cloudflared binary (dd-register spawns it)
# dd-agent — run devopsdefender agent, register with CP, serve workloads
#
# Prod gets the GPU model; preview gets the tiny CPU-friendly one.
local model ollama_spec
if [ "$with_gpu" = "yes" ]; then
model="qwen2.5:7b"
ollama_spec="$REPO_ROOT/apps/ollama/workload.prod.json"
else
model="qwen2.5:0.5b"
ollama_spec="$REPO_ROOT/apps/ollama/workload.preview.json"
fi

local workloads
workloads=$({
# Boot workload chain (EE spawns concurrently; dependents self-sequence
# via `until` loops):
# nv — insmod nvidia driver (prod only, first so device
# nodes exist by the time web-nvidia-smi runs)
# podman-static — fetch the podman tarball into /var/lib/easyenclave/bin
# podman-bootstrap — stage binaries, install /var/lib/easyenclave/bin/podman
# wrapper + containers.conf + policy.json
# web-nvidia-smi — prod only. Run nvidia/cuda container, serve
# `nvidia-smi` output on :8081.
# cloudflared — fetch binary (agent spawns the tunnel process)
# dd-agent — register with CP, serve workloads. Requests the
# gpu.<agent-host> ingress via $DD_EXTRA_INGRESS,
# computed below from `expose` entries on the
# baked workloads.
local bare_workloads
bare_workloads=$({
[ "$with_gpu" = "yes" ] && bake "$REPO_ROOT/apps/nv/workload.json"
bake "$REPO_ROOT/apps/mount-models/workload.json"
bake "$REPO_ROOT/apps/podman-static/workload.json"
bake "$REPO_ROOT/apps/podman-bootstrap/workload.json"
bake "$ollama_spec"
MODEL="$model" bake "$REPO_ROOT/apps/openclaw/workload.json.tmpl"
[ "$with_gpu" = "yes" ] && bake "$REPO_ROOT/apps/web-nvidia-smi/workload.json"
bake "$REPO_ROOT/apps/cloudflared/workload.json"
})

local extra_ingress
extra_ingress=$(echo "$bare_workloads" | extract_extra_ingress)

local workloads
workloads=$({
echo "$bare_workloads"
DD_CP_URL="$cp" \
DD_PAT="$DD_PAT" \
DD_ITA_API_KEY="$DD_ITA_API_KEY" \
DD_ENV="$env" \
DD_VM_NAME="dd-local-$name" \
DD_EXTRA_INGRESS="$extra_ingress" \
bake "$REPO_ROOT/apps/dd-agent/workload.json.tmpl"
} | jq -cs '.')

Expand All @@ -139,7 +147,7 @@ build_config_iso() {
# ext4 — EE rootfs has no iso9660 module.
truncate -s 4M "$out"
mkfs.ext4 -q -d "$tmp" "$out"
echo " wrote $out (env=$env, gpu=$with_gpu, model=$model)"
echo " wrote $out (env=$env, gpu=$with_gpu, extra_ingress=$extra_ingress)"
}

build_overlay() {
Expand All @@ -154,24 +162,6 @@ build_overlay() {
echo " wrote $overlay (backing $BASE)"
}

# Persistent models disk — survives VM relaunch, so ollama doesn't
# re-download the model each time. Pre-formatted ext4 on the host;
# the guest just mounts it.
build_models_disk() {
# $1=name, $2=size_gb
local name="$1" size_gb="$2"
local models="$IMG_DIR/dd-local-$name-models.qcow2"
if [ -f "$models" ]; then
echo " models disk $models already exists (reusing)"
return
fi
qemu-img create -q -f raw "$models.raw" "${size_gb}G"
mkfs.ext4 -q -F "$models.raw"
qemu-img convert -q -f raw -O qcow2 "$models.raw" "$models"
rm -f "$models.raw"
echo " wrote $models (${size_gb}G ext4)"
}

render_domain_xml() {
# $1=name, $2=with_gpu (yes/no)
local name="$1" with_gpu="$2"
Expand All @@ -193,22 +183,11 @@ render_domain_xml() {
sed -i "s|/var/log/ee-local\\.log|/var/log/ee-local-$name.log|g" "$out"

# Size the VM for the workload. Base easyenclave-local is 4 GiB /
# 2 vCPU — fine for a bare agent, undersized for podman + ollama
# + the openclaw gateway on a 900 MB container image. Host has
# 243 GiB / 64 cores, so we can be generous.
#
# prod: 32 GiB / 16 vCPU (GPU handles the model; host RAM
# for podman, openclaw, image pull
# scratch, model load spill)
# preview: 16 GiB / 8 vCPU (CPU-only inference; qwen2.5:0.5b
# + 64k ctx + gateway)
if [ "$with_gpu" = "yes" ]; then
local mem_kib=33554432 # 32 GiB
local vcpus=16
else
local mem_kib=16777216 # 16 GiB
local vcpus=8
fi
# 2 vCPU — fine for a bare agent. The demo workloads are modest
# (web-nvidia-smi just runs nvidia-smi on demand + one apt-get at
# boot for netcat). Host has 243 GiB / 64 cores.
local mem_kib=8388608 # 8 GiB
local vcpus=8
sed -i -E "s|<memory unit='KiB'>[0-9]+</memory>|<memory unit='KiB'>$mem_kib</memory>|" "$out"
sed -i -E "s|<currentMemory unit='KiB'>[0-9]+</currentMemory>|<currentMemory unit='KiB'>$mem_kib</currentMemory>|" "$out"
sed -i -E "s|<vcpu placement='static'>[0-9]+</vcpu>|<vcpu placement='static'>$vcpus</vcpu>|" "$out"
Expand All @@ -231,20 +210,6 @@ render_domain_xml() {
/<\/hostdev>/{skip=0}' "$out" > "$out.tmp" && mv "$out.tmp" "$out"
fi

# Add a persistent models disk as vdc. EE will mount it at
# /var/lib/easyenclave/ollama via the mount-models boot workload.
local models="$IMG_DIR/dd-local-$name-models.qcow2"
local disk_block=" <disk type='file' device='disk'>
<driver name='qemu' type='qcow2'/>
<source file='$models'/>
<target dev='vdc' bus='virtio'/>
</disk>"
# Insert before </devices>.
awk -v block="$disk_block" '
/<\/devices>/ { print block }
{ print }
' "$out" > "$out.tmp" && mv "$out.tmp" "$out"

echo "$out"
}

Expand All @@ -256,12 +221,6 @@ define_agent() {

echo "== dd-local-$name → $cp (env=$env_label, gpu=$with_gpu) =="
build_overlay "$name"
# Models disk: prod holds the GPU model (few GB), preview holds the small CPU one.
if [ "$with_gpu" = "yes" ]; then
build_models_disk "$name" 40
else
build_models_disk "$name" 10
fi
build_config_iso "$name" "$cp" "$env_label" "$with_gpu"
local xml
xml=$(render_domain_xml "$name" "$with_gpu")
Expand Down
3 changes: 2 additions & 1 deletion apps/dd-agent/workload.json.tmpl
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@
"DD_OWNER=devopsdefender",
"DD_ENV=${DD_ENV}",
"DD_VM_NAME=${DD_VM_NAME}",
"DD_PORT=8080"
"DD_PORT=8080",
"DD_EXTRA_INGRESS=${DD_EXTRA_INGRESS}"
]
}
7 changes: 0 additions & 7 deletions apps/mount-models/workload.json

This file was deleted.

7 changes: 0 additions & 7 deletions apps/ollama/workload.preview.json

This file was deleted.

Loading
Loading