From eaf40d30d86cdc5b7932d680b42a4dcdecbddc3a Mon Sep 17 00:00:00 2001 From: Alex Newman Date: Sat, 18 Apr 2026 18:12:22 +0000 Subject: [PATCH 1/2] ci: unify workloads + collapse deploy paths into Release + cleanup audit MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Consolidates PRs #127, #131, #132 into one commit. Net -381 lines, one "Release" workflow drives the whole fleet lifecycle. Workload spec: apps//workload.{json,json.tmpl} becomes the single source of truth for every EE workload (cloudflared, dd-agent, dd-management, ollama, openclaw, nv, mount-models, podman-static, podman-bootstrap). Boot-time (config.iso / ee-config metadata) and runtime (/deploy) both bake from the same file. Workflow topology: release.yml is the one entry point. pull_request → build → deploy-preview → dd-local-preview relaunch push main → build → deploy-production → dd-local-prod relaunch push v* → build only (versioned artifact, no deploy) workflow_dispatch → build → deploy-production (rollback: release_tag input) .github/workflows/deploy-cp.yml is the reusable workflow both paths call, so preview CI exercises the exact code prod uses. It provisions the GCP CP VM, verifies health + /cp/attest MRTD + dashboard + STONITH, comments on PR, and cascades agent relaunch. The cascade is blocking: a release goes green only when the matching dd-local-{kind} VM re-registers with the freshly-deployed CP (proves "everything works"). .github/actions/relaunch-agent/ is the composite action that SSHes into tdx2, runs dd-relaunch.sh, and polls /api/agents for the freshly-registered entry (5-min budget). Deleted: .github/workflows/production-deploy.yml (folded into release.yml as deploy-production job). .github/workflows/local-agents.yml (manual-dispatch path gone; push a commit to trigger a relaunch). .github/workflows/retire-staging.yml (one-shot, already run, no dd_env=staging VMs exist). scripts/ entirely. gcp-deploy.sh inlined into deploy-cp.yml; dd-relaunch.sh + local-agents.sh moved to apps/_infra/ (host-side); ollama-deploy.sh, redeploy-workload.sh, workloads.sh deleted as unused after the refactor. Cleanup audit: gcloud survey revealed dd-pr-121-1776434711 RUNNING in staging a day after PR #121 merged — branch not deleted, so pr-teardown.yml never fired and cleanup.yml only reaps TERMINATED. Added a reap-merged-pr-previews job to cleanup.yml that resolves each RUNNING pr-N VM's PR state via gh and tears down VM + CF tunnel + DNS CNAME when MERGED/CLOSED. Dropped reap-staging (dead), trimmed workflow_run trigger (Production Deploy is gone). Deferred (blocked on easyenclave): easyenclave image family split — preview → easyenclave-staging, prod → easyenclave-stable. easyenclave has no stable GCP family and no qcow2 on v0.1.14 today, so nothing to point prod at. Plus tdx2 auto-refresh of the base qcow2 from easyenclave releases. Follow-up PR once easyenclave publishes stable images. Co-Authored-By: Claude Opus 4.7 (1M context) --- .github/actions/relaunch-agent/action.yml | 103 ++++++ .github/workflows/cleanup.yml | 152 ++++++--- .github/workflows/deploy-cp.yml | 362 ++++++++++++++++++++++ .github/workflows/local-agents.yml | 111 ------- .github/workflows/production-deploy.yml | 146 --------- .github/workflows/release.yml | 297 ++++-------------- .github/workflows/retire-staging.yml | 98 ------ README.md | 12 +- apps/_infra/dd-relaunch.sh | 52 ++++ {scripts => apps/_infra}/local-agents.sh | 11 +- apps/cloudflared/workload.json | 8 + apps/dd-agent/workload.json.tmpl | 22 ++ apps/dd-management/workload.json.tmpl | 29 ++ apps/mount-models/workload.json | 7 + apps/nv/workload.json | 7 + apps/ollama/workload.preview.json | 7 + apps/ollama/workload.prod.json | 7 + apps/openclaw/workload.json.tmpl | 7 + apps/podman-bootstrap/workload.json | 7 + apps/podman-static/workload.json | 7 + scripts/dd-relaunch.sh | 53 ---- scripts/gcp-deploy.sh | 177 ----------- scripts/ollama-deploy.sh | 327 ------------------- 23 files changed, 814 insertions(+), 1195 deletions(-) create mode 100644 .github/actions/relaunch-agent/action.yml create mode 100644 .github/workflows/deploy-cp.yml delete mode 100644 .github/workflows/local-agents.yml delete mode 100644 .github/workflows/production-deploy.yml delete mode 100644 .github/workflows/retire-staging.yml create mode 100755 apps/_infra/dd-relaunch.sh rename {scripts => apps/_infra}/local-agents.sh (95%) create mode 100644 apps/cloudflared/workload.json create mode 100644 apps/dd-agent/workload.json.tmpl create mode 100644 apps/dd-management/workload.json.tmpl create mode 100644 apps/mount-models/workload.json create mode 100644 apps/nv/workload.json create mode 100644 apps/ollama/workload.preview.json create mode 100644 apps/ollama/workload.prod.json create mode 100644 apps/openclaw/workload.json.tmpl create mode 100644 apps/podman-bootstrap/workload.json create mode 100644 apps/podman-static/workload.json delete mode 100755 scripts/dd-relaunch.sh delete mode 100755 scripts/gcp-deploy.sh delete mode 100755 scripts/ollama-deploy.sh diff --git a/.github/actions/relaunch-agent/action.yml b/.github/actions/relaunch-agent/action.yml new file mode 100644 index 0000000..b449289 --- /dev/null +++ b/.github/actions/relaunch-agent/action.yml @@ -0,0 +1,103 @@ +name: Relaunch local TDX agent +description: >- + SSH into the tdx2 host, recreate the matching dd-local-{kind} libvirt + domain against the given CP url (pulling apps/ from the given git ref), + then block until the agent re-registers with the CP. A release is "done" + only when this action succeeds end-to-end. + +inputs: + kind: + description: 'prod | preview — which libvirt domain to relaunch' + required: true + url: + description: 'CP URL the agent should register against (e.g. https://app.devopsdefender.com)' + required: true + ref: + description: 'git ref whose scripts/apps tree dd-relaunch.sh should check out on the host' + required: true + ssh-key: + description: 'Private SSH key for tdx2@host' + required: true + host: + description: 'Public host address of the tdx2 node' + required: true + dd-pat: + description: 'GitHub PAT the agent uses to talk to the CP' + required: true + ita-api-key: + description: 'Intel Trust Authority API key for attestation' + required: true + +runs: + using: composite + steps: + # CP must be reachable before we SSH — on PR pushes we race with + # Release's deploy-preview standing up the pr-N CP. /health is public. + - name: Wait for CP to be healthy + shell: bash + env: + URL: ${{ inputs.url }} + run: | + for i in $(seq 1 60); do + if curl -fsS --max-time 5 "$URL/health" >/dev/null 2>&1; then + echo "CP $URL healthy after ${i} attempts" + exit 0 + fi + echo " waiting for $URL... (${i}/60)" + sleep 10 + done + echo "::error::CP $URL never came up within 10 min" + exit 1 + + # SSH in and relaunch the VM (destroy + redefine + start). Finishes + # in ~10 s — the baked config.iso's EE_BOOT_WORKLOADS drives the rest. + - name: ssh + relaunch VM + shell: bash + env: + SSH_KEY: ${{ inputs.ssh-key }} + HOST: ${{ inputs.host }} + DD_PAT: ${{ inputs.dd-pat }} + DD_ITA_API_KEY: ${{ inputs.ita-api-key }} + KIND: ${{ inputs.kind }} + URL: ${{ inputs.url }} + REF: ${{ inputs.ref }} + run: | + mkdir -p ~/.ssh + printf '%s\n' "$SSH_KEY" > ~/.ssh/id_ed25519 + chmod 600 ~/.ssh/id_ed25519 + ssh-keyscan -H "$HOST" >> ~/.ssh/known_hosts 2>/dev/null + ssh -o BatchMode=yes -o StrictHostKeyChecking=yes \ + -i ~/.ssh/id_ed25519 "tdx2@$HOST" \ + "DD_PAT='$DD_PAT' DD_ITA_API_KEY='$DD_ITA_API_KEY' /home/tdx2/src/dd/apps/_infra/dd-relaunch.sh '$KIND' '$URL' '$REF'" + + # Block until the freshly-booted agent VM registers with the CP. + # This is the "I can see the local agent deployment worked" signal + # that gates the whole release. 5-min budget covers a cold VM boot + # (~60s) + cloudflared tunnel (~30s) + agent startup + register — + # plenty of headroom. Doesn't probe openclaw/ollama readiness — + # that first-boot pays a 30-min npm-install tax and isn't part + # of the release gate. + - name: Verify agent registered with CP + shell: bash + env: + URL: ${{ inputs.url }} + DD_PAT: ${{ inputs.dd-pat }} + KIND: ${{ inputs.kind }} + run: | + vm="dd-local-$KIND" + started_at=$(date -u +%Y-%m-%dT%H:%M:%SZ) + AUTH=(-H "Authorization: Bearer $DD_PAT") + for i in $(seq 1 30); do + host=$(curl -fsS --max-time 10 "${AUTH[@]}" "$URL/api/agents" 2>/dev/null \ + | jq -r --arg since "$started_at" --arg vm "$vm" ' + [.[] | select(.vm_name==$vm and .status=="healthy" and .last_seen > $since)] + | sort_by(.last_seen) | reverse | .[0].hostname // empty' 2>/dev/null || true) + if [ -n "$host" ] && [ "$host" != "null" ]; then + echo "$vm registered at https://$host" + exit 0 + fi + echo " waiting for $vm to register with $URL... (${i}/30)" + sleep 10 + done + echo "::error::$vm never registered with $URL within 5 min" + exit 1 diff --git a/.github/workflows/cleanup.yml b/.github/workflows/cleanup.yml index 989b304..923ee87 100644 --- a/.github/workflows/cleanup.yml +++ b/.github/workflows/cleanup.yml @@ -1,29 +1,34 @@ name: Cleanup -# Reap TERMINATED dd-{env}-* VMs. STONITH self-poweroff leaves the VM -# in TERMINATED state — it uses no compute but clutters the inventory -# and a long enough chain of deploys turns into pages of dead VMs. +# Background safety net that reaps GCE VMs the primary cleanup paths +# missed. Primary paths today: # -# Two jobs run in parallel, one per environment, so a regression in -# either auth/zone/project doesn't block the other. The cleanup is -# idempotent: skip if nothing to reap. +# - STONITH: dd-register deletes the old VM's CF tunnel on startup → +# old cloudflared exits → old dd-register poweroffs → +# TERMINATED. Happens on every deploy of the same env. +# - Teardown: pr-teardown.yml fires on branch-delete and deletes the +# VM + tunnel + DNS. Happens when a dev deletes the branch. +# +# Gaps this workflow covers: +# - TERMINATED VMs accumulate between STONITH and branch-delete. +# - A PR that's merged/closed but whose branch survives → the preview +# VM stays RUNNING forever, burning compute. reap-merged-pr-previews +# finds these and treats them like a branch-delete (VM + tunnel + DNS). # # Triggers: # - workflow_dispatch (operator-initiated cleanup) -# - workflow_run completion of Release / Production Deploy (catch -# post-deploy zombies opportunistically) +# - workflow_run completion of Release (catch post-deploy zombies +# opportunistically; Release covers both preview and prod now) # - schedule, every 6 hours (background safety net) on: workflow_dispatch: workflow_run: - workflows: ["Release", "Production Deploy"] + workflows: ["Release"] types: [completed] schedule: - cron: '0 */6 * * *' -# Don't pile up identical reaps when several deploys land in quick -# succession — one in-flight reap is enough. concurrency: group: dd-cleanup cancel-in-progress: false @@ -31,11 +36,14 @@ concurrency: permissions: contents: read +env: + GCP_ZONE: us-central1-c + jobs: - # PR preview envs (dd_env=pr-*) accumulate during active PRs — every - # push STONITHs the old VM into TERMINATED. PR close runs - # pr-teardown.yml which deletes them, but between pushes they stack - # up. This job reaps them in place. + # PR preview envs (dd_env=pr-*) accumulate TERMINATED VMs during + # active PRs — every push STONITHs the old VM. Branch-delete reaps + # the matching VMs; between pushes they stack up. This reaps them + # in place. reap-pr-previews: runs-on: ubuntu-latest environment: staging @@ -51,11 +59,7 @@ jobs: - name: Reap TERMINATED dd-pr-* VMs env: GCP_PROJECT_ID: ${{ secrets.GCP_PROJECT_ID }} - GCP_ZONE: us-central1-c run: | - # gcloud filter regex: `~` matches against the value. Anchor - # to start so we don't accidentally match an env like - # "foo-pr-bar" in the future. DEAD=$(gcloud compute instances list \ --project="$GCP_PROJECT_ID" \ --filter='labels.devopsdefender=managed AND labels.dd_env~"^pr-" AND status=TERMINATED' \ @@ -69,29 +73,28 @@ jobs: gcloud compute instances delete $DEAD \ --project="$GCP_PROJECT_ID" --zone="$GCP_ZONE" --quiet - reap-staging: + reap-production: runs-on: ubuntu-latest - environment: staging + environment: production permissions: contents: read id-token: write steps: - uses: google-github-actions/auth@v2 with: - workload_identity_provider: 'projects/654815109728/locations/global/workloadIdentityPools/github-actions-pool/providers/github-provider' - service_account: 'easyenclave-staging-ci@eestaging.iam.gserviceaccount.com' + workload_identity_provider: 'projects/779946350556/locations/global/workloadIdentityPools/github-actions-pool/providers/github-provider' + service_account: 'easyenclave-production-ci@easyenclave.iam.gserviceaccount.com' - uses: google-github-actions/setup-gcloud@v2 - - name: Reap TERMINATED dd-staging VMs + - name: Reap TERMINATED dd-production VMs env: GCP_PROJECT_ID: ${{ secrets.GCP_PROJECT_ID }} - GCP_ZONE: us-central1-c run: | DEAD=$(gcloud compute instances list \ --project="$GCP_PROJECT_ID" \ - --filter="labels.devopsdefender=managed AND labels.dd_env=staging AND status=TERMINATED" \ + --filter="labels.devopsdefender=managed AND labels.dd_env=production AND status=TERMINATED" \ --format="value(name)") if [ -z "$DEAD" ]; then - echo "No TERMINATED dd-staging VMs to reap." + echo "No TERMINATED dd-production VMs to reap." exit 0 fi echo "Reaping: $(echo "$DEAD" | tr '\n' ' ')" @@ -99,32 +102,97 @@ jobs: gcloud compute instances delete $DEAD \ --project="$GCP_PROJECT_ID" --zone="$GCP_ZONE" --quiet - reap-production: + # RUNNING pr-N VMs whose PR is merged or closed are leaked compute — + # neither STONITH (waits for a new deploy) nor pr-teardown.yml (waits + # for branch-delete) reaches them. This finds them and tears them + # down like a branch-delete would have: VM + CF tunnel + DNS CNAME. + reap-merged-pr-previews: runs-on: ubuntu-latest - environment: production + environment: staging permissions: contents: read id-token: write + pull-requests: read steps: - uses: google-github-actions/auth@v2 with: - workload_identity_provider: 'projects/779946350556/locations/global/workloadIdentityPools/github-actions-pool/providers/github-provider' - service_account: 'easyenclave-production-ci@easyenclave.iam.gserviceaccount.com' + workload_identity_provider: 'projects/654815109728/locations/global/workloadIdentityPools/github-actions-pool/providers/github-provider' + service_account: 'easyenclave-staging-ci@eestaging.iam.gserviceaccount.com' - uses: google-github-actions/setup-gcloud@v2 - - name: Reap TERMINATED dd-production VMs + - name: Reap RUNNING pr-N VMs whose PR is closed or merged env: GCP_PROJECT_ID: ${{ secrets.GCP_PROJECT_ID }} - GCP_ZONE: us-central1-c + CF_API_TOKEN: ${{ secrets.DD_CP_CF_API_TOKEN }} + CF_ACCOUNT_ID: ${{ secrets.DD_CP_CF_ACCOUNT_ID }} + CF_ZONE_ID: ${{ secrets.DD_CP_CF_ZONE_ID }} + DD_DOMAIN: ${{ vars.DD_CF_DOMAIN || 'devopsdefender.com' }} + GH_TOKEN: ${{ github.token }} run: | - DEAD=$(gcloud compute instances list \ + # Unique set of pr-N envs currently RUNNING. + envs=$(gcloud compute instances list \ --project="$GCP_PROJECT_ID" \ - --filter="labels.devopsdefender=managed AND labels.dd_env=production AND status=TERMINATED" \ - --format="value(name)") - if [ -z "$DEAD" ]; then - echo "No TERMINATED dd-production VMs to reap." + --filter='labels.devopsdefender=managed AND labels.dd_env~"^pr-" AND status=RUNNING' \ + --format='value(labels.dd_env)' | sort -u) + if [ -z "$envs" ]; then + echo "No RUNNING dd-pr-* VMs to consider." exit 0 fi - echo "Reaping: $(echo "$DEAD" | tr '\n' ' ')" - # shellcheck disable=SC2086 - gcloud compute instances delete $DEAD \ - --project="$GCP_PROJECT_ID" --zone="$GCP_ZONE" --quiet + + for env in $envs; do + pr="${env#pr-}" + state=$(gh pr view "$pr" --repo "${{ github.repository }}" \ + --json state --jq .state 2>/dev/null || echo "UNKNOWN") + if [ "$state" = "OPEN" ]; then + echo "pr-$pr still OPEN — leaving RUNNING VMs alone." + continue + fi + if [ "$state" = "UNKNOWN" ]; then + echo "::warning::could not resolve state for pr-$pr (gh pr view failed); leaving alone" + continue + fi + echo "pr-$pr is $state — tearing down preview env $env" + + # VMs + vms=$(gcloud compute instances list \ + --project="$GCP_PROJECT_ID" \ + --filter="labels.devopsdefender=managed AND labels.dd_env=$env" \ + --format='value(name)') + if [ -n "$vms" ]; then + echo " deleting VMs: $(echo "$vms" | tr '\n' ' ')" + # shellcheck disable=SC2086 + gcloud compute instances delete $vms \ + --project="$GCP_PROJECT_ID" --zone="$GCP_ZONE" --quiet + fi + + # CF tunnels — named `dd-{env}-{uuid}`. + resp=$(curl -fsS \ + -H "Authorization: Bearer $CF_API_TOKEN" \ + "https://api.cloudflare.com/client/v4/accounts/$CF_ACCOUNT_ID/cfd_tunnel?is_deleted=false&per_page=200") + ids=$(echo "$resp" | jq -r --arg prefix "dd-$env-" \ + '.result[] | select(.name | startswith($prefix)) | .id') + for id in $ids; do + echo " deleting tunnel $id" + curl -fsS -X DELETE \ + -H "Authorization: Bearer $CF_API_TOKEN" \ + "https://api.cloudflare.com/client/v4/accounts/$CF_ACCOUNT_ID/cfd_tunnel/$id/connections" \ + >/dev/null || true + curl -fsS -X DELETE \ + -H "Authorization: Bearer $CF_API_TOKEN" \ + "https://api.cloudflare.com/client/v4/accounts/$CF_ACCOUNT_ID/cfd_tunnel/$id" \ + >/dev/null || echo "::warning::tunnel $id delete failed (may already be gone)" + done + + # DNS CNAME for pr-N.{domain} + hostname="$env.$DD_DOMAIN" + record_id=$(curl -fsS \ + -H "Authorization: Bearer $CF_API_TOKEN" \ + "https://api.cloudflare.com/client/v4/zones/$CF_ZONE_ID/dns_records?type=CNAME&name=$hostname" \ + | jq -r '.result[0].id // empty') + if [ -n "$record_id" ]; then + echo " deleting CNAME $hostname ($record_id)" + curl -fsS -X DELETE \ + -H "Authorization: Bearer $CF_API_TOKEN" \ + "https://api.cloudflare.com/client/v4/zones/$CF_ZONE_ID/dns_records/$record_id" \ + >/dev/null + fi + done diff --git a/.github/workflows/deploy-cp.yml b/.github/workflows/deploy-cp.yml new file mode 100644 index 0000000..d21265c --- /dev/null +++ b/.github/workflows/deploy-cp.yml @@ -0,0 +1,362 @@ +name: Deploy CP + +# Reusable workflow: provision the CP TDX VM on GCP, wait for it to be +# healthy, verify attestation + dashboard + STONITH, then cascade a +# relaunch of the matching dd-local agent VM and block until it +# re-registers. Called from release.yml's deploy-preview (PR path) and +# deploy-production (main / dispatch path) with env-specific inputs — +# both paths share this exact set of verification steps so every PR +# exercises the prod deploy code. +# +# GitHub Actions allows ≤4 levels of workflow_call nesting. Today's +# chain is `release.yml → deploy-cp.yml` (2). The agent-relaunch +# cascade uses a composite action (same-job, no nesting) to keep +# headroom for future wrapping. + +on: + workflow_call: + inputs: + env: + description: 'DD_ENV (e.g. "production", "pr-42")' + required: true + type: string + hostname: + description: 'Public hostname (e.g. app.devopsdefender.com)' + required: true + type: string + gcp_environment: + description: 'GitHub environment name — "production" | "staging"' + required: true + type: string + workload_identity_provider: + description: 'GCP Workload Identity Federation provider resource name' + required: true + type: string + service_account: + description: 'GCP service account email' + required: true + type: string + release_tag: + description: 'devopsdefender release tag to deploy (e.g. "latest", "pr-abc123")' + required: true + type: string + oauth_enabled: + description: 'Enable GitHub OAuth (prod only; previews use PAT)' + required: false + type: boolean + default: false + comment_on_pr: + description: 'Leave a PR comment with the preview URL' + required: false + type: boolean + default: false + relaunch_agent: + description: 'After CP deploy, cascade a relaunch of dd-local-{env} via SSH' + required: false + type: boolean + default: true + ref: + description: 'Git ref the tdx2 host should pull before relaunching the agent VM' + required: false + type: string + default: main + +concurrency: + group: deploy-cp-${{ inputs.env }} + cancel-in-progress: false + +jobs: + deploy: + runs-on: ubuntu-latest + environment: ${{ inputs.gcp_environment }} + permissions: + contents: read + id-token: write + pull-requests: write + env: + DD_ENV: ${{ inputs.env }} + DD_HOSTNAME: ${{ inputs.hostname }} + GCP_ZONE: us-central1-c + steps: + - uses: actions/checkout@v4 + + - uses: google-github-actions/auth@v2 + with: + workload_identity_provider: ${{ inputs.workload_identity_provider }} + service_account: ${{ inputs.service_account }} + - uses: google-github-actions/setup-gcloud@v2 + + - name: Create TDX VM (boots from easyenclave, fetches dd from GitHub releases) + env: + GCP_PROJECT_ID: ${{ secrets.GCP_PROJECT_ID }} + DD_DOMAIN: ${{ vars.DD_CF_DOMAIN || 'devopsdefender.com' }} + CLOUDFLARE_API_TOKEN: ${{ secrets.DD_CP_CF_API_TOKEN }} + CLOUDFLARE_ACCOUNT_ID: ${{ secrets.DD_CP_CF_ACCOUNT_ID }} + CLOUDFLARE_ZONE_ID: ${{ secrets.DD_CP_CF_ZONE_ID }} + # OAuth only in environments that have these set (production). + # Empty placeholder values get stripped below before baking the + # workload spec, so dd-web disables /auth/github/* and serves + # /auth/pat only in those envs. + DD_GITHUB_CLIENT_ID: ${{ inputs.oauth_enabled && (vars.DD_GITHUB_CLIENT_ID || secrets.DD_GITHUB_CLIENT_ID) || '' }} + DD_GITHUB_CALLBACK_URL: ${{ inputs.oauth_enabled && vars.DD_GITHUB_CALLBACK_URL || '' }} + DD_GITHUB_CLIENT_SECRET: ${{ inputs.oauth_enabled && secrets.DD_GITHUB_CLIENT_SECRET || '' }} + DD_ITA_API_KEY: ${{ secrets.DD_ITA_API_KEY }} + DD_RELEASE_TAG: ${{ inputs.release_tag }} + EE_IMAGE_FAMILY: easyenclave-staging + EE_IMAGE_PROJECT: easyenclave + VM_MACHINE_TYPE: c3-standard-4 + VM_DISK_SIZE: 10GB + DD_ITA_BASE_URL: https://api.trustauthority.intel.com + DD_ITA_JWKS_URL: https://portal.trustauthority.intel.com/certs + DD_ITA_ISSUER: https://portal.trustauthority.intel.com + run: | + set -euo pipefail + + VM_NAME="dd-${DD_ENV}-$(date +%s)" + : "${DD_ITA_API_KEY:?set DD_ITA_API_KEY via secrets.DD_ITA_API_KEY}" + export DD_GITHUB_CALLBACK_URL="${DD_GITHUB_CALLBACK_URL:-https://${DD_HOSTNAME}/auth/github/callback}" + + # Bake a workload template: envsubst ${VAR} placeholders and + # strip any "KEY=" env entries that ended up with empty values + # (e.g. OAuth creds in non-prod envs). + bake() { + case "$1" in + *.json.tmpl) + envsubst < "$1" \ + | jq -c 'if .env then .env |= map(select(test("^[^=]+=.+"))) else . end' + ;; + *.json) + jq -c . "$1" + ;; + *) + echo "::error::unknown workload file type: $1" >&2 + return 1 + ;; + esac + } + + # Boot workloads come from apps//workload.{json,json.tmpl}. + # cloudflared fetches the binary onto PATH; dd-management runs + # devopsdefender in DD_MODE=management (CP + dashboard). + EE_BOOT_WORKLOADS=$({ + bake apps/cloudflared/workload.json + bake apps/dd-management/workload.json.tmpl + } | jq -cs '.') + + jq -c -n \ + --arg workloads "$EE_BOOT_WORKLOADS" \ + '{ "EE_BOOT_WORKLOADS": $workloads, "EE_OWNER": "devopsdefender" }' \ + > /tmp/ee-config.json + + gcloud compute instances create "$VM_NAME" \ + --project="$GCP_PROJECT_ID" \ + --zone="$GCP_ZONE" \ + --machine-type="$VM_MACHINE_TYPE" \ + --confidential-compute-type=TDX \ + --maintenance-policy=TERMINATE \ + --boot-disk-size="$VM_DISK_SIZE" \ + --image-family="$EE_IMAGE_FAMILY" \ + --image-project="$EE_IMAGE_PROJECT" \ + --metadata-from-file=ee-config=/tmp/ee-config.json \ + --labels=devopsdefender=managed,dd_env="${DD_ENV}" \ + --tags=dd-management + + rm -f /tmp/ee-config.json + echo "VM: $VM_NAME ($DD_HOSTNAME, release $DD_RELEASE_TAG)" + + - name: Wait for agent health (streams serial console) + env: + AGENT_URL: https://${{ inputs.hostname }} + GCP_PROJECT_ID: ${{ secrets.GCP_PROJECT_ID }} + run: | + VM_NAME=$(gcloud compute instances list \ + --project="$GCP_PROJECT_ID" \ + --filter="labels.devopsdefender=managed AND labels.dd_env=${DD_ENV}" \ + --format="value(name)" --sort-by=~creationTimestamp | head -1) + if [ -z "$VM_NAME" ]; then + echo "::error::no dd-${DD_ENV} VM found — gcp-deploy.sh must have failed" + exit 1 + fi + echo "Watching VM: $VM_NAME (zone: $GCP_ZONE)" + + LAST_LINES=0 + for i in $(seq 1 60); do + # Stream serial console so boot failures (DHCP hang, release + # fetch error, cloudflared exit, etc.) are visible without + # shelling into GCP. + gcloud compute instances get-serial-port-output "$VM_NAME" \ + --project="$GCP_PROJECT_ID" --zone="$GCP_ZONE" 2>/dev/null \ + > /tmp/serial.log || true + TOTAL_LINES=$(wc -l < /tmp/serial.log) + if [ "$TOTAL_LINES" -gt "$LAST_LINES" ]; then + tail -n +$((LAST_LINES + 1)) /tmp/serial.log \ + | sed 's/^/[serial] /' + LAST_LINES=$TOTAL_LINES + fi + + if grep -qE "FATAL|Kernel panic|Invalid ELF header|/bin/sh: can't access tty" /tmp/serial.log; then + echo "::error::boot failed — serial log shows fatal pattern" + exit 1 + fi + + if curl -fsS "${AGENT_URL}/health" >/dev/null 2>&1; then + echo "Agent healthy at ${AGENT_URL}" + exit 0 + fi + echo " waiting for tunnel... (${i}/60)" + sleep 5 + done + echo "::error::Agent not healthy within 5 minutes" + echo "--- final serial tail ---" + tail -80 /tmp/serial.log | sed 's/^/[serial] /' + exit 1 + + - name: Verify NEW VM via TDX attestation + env: + AGENT_URL: https://${{ inputs.hostname }} + GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }} + run: | + # /cp/attest proves the freshly-deployed VM is serving the tunnel + # (stale tunnels point at old VMs that 404 on this endpoint). + # MRTD = 48 bytes at offset 184 in TDX quote v4; if non-zero, + # attestation actually worked. + NONCE=$(openssl rand -base64 16) + for attempt in $(seq 1 60); do + BODY=$(curl -sG -w '\n%{http_code}' \ + -H "Authorization: Bearer ${GITHUB_TOKEN}" \ + --data-urlencode "nonce=${NONCE}" \ + "${AGENT_URL}/cp/attest" || echo $'\n000') + CODE=$(echo "$BODY" | tail -n1) + JSON=$(echo "$BODY" | sed '$d') + if [ "$CODE" = "200" ]; then + QUOTE_B64=$(echo "$JSON" | jq -r '.quote_b64 // empty') + if [ -n "$QUOTE_B64" ] && [ "$QUOTE_B64" != "null" ]; then + MRTD=$(echo "$QUOTE_B64" | base64 -d \ + | dd bs=1 skip=184 count=48 status=none | xxd -p -c 48) + if [ -n "$MRTD" ] && [ "$MRTD" != "$(printf '00%.0s' {1..48})" ]; then + echo "NEW VM verified — MRTD: $MRTD" + exit 0 + fi + echo " /cp/attest 200 but MRTD empty/zero, retrying... (${attempt}/60)" + else + echo " /cp/attest 200 but no quote_b64, retrying... (${attempt}/60)" + fi + else + echo " /cp/attest returned HTTP ${CODE}, retrying... (${attempt}/60)" + fi + sleep 10 + done + echo "::error::/cp/attest never returned a valid quote — stale tunnel or new VM never came up" + exit 1 + + - name: Verify dashboard renders + env: + AGENT_URL: https://${{ inputs.hostname }} + GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }} + run: | + # Fast sanity check on top of /cp/attest — proves dd-web is up + # and accepts the CI PAT's Bearer auth. + for attempt in $(seq 1 12); do + code=$(curl -s -o /dev/null -w '%{http_code}' \ + -H "Authorization: Bearer ${GITHUB_TOKEN}" \ + "${AGENT_URL}/" || echo 000) + if [ "$code" = "200" ]; then + echo "Dashboard renders (HTTP 200, attempt ${attempt})" + exit 0 + fi + echo " dashboard returned HTTP ${code}, retrying... (${attempt}/12)" + sleep 5 + done + echo "::error::dashboard / never returned 200 (last HTTP ${code})" + exit 1 + + - name: Verify STONITH halted prior VM(s) in this env + env: + GCP_PROJECT_ID: ${{ secrets.GCP_PROJECT_ID }} + run: | + # dd-register STONITHs the old VM on startup by deleting its + # CF tunnel → old cloudflared exits → old dd-register poweroffs. + # Scoped to this env — per-PR previews are hostname-isolated, + # so this only reaps prior deploys of the same env. + NEW_VM=$(gcloud compute instances list \ + --project="$GCP_PROJECT_ID" \ + --filter="labels.devopsdefender=managed AND labels.dd_env=${DD_ENV}" \ + --format="value(name)" --sort-by=~creationTimestamp | head -1) + echo "new VM: $NEW_VM" + SURVIVORS="" + for i in $(seq 1 24); do + SURVIVORS=$(gcloud compute instances list \ + --project="$GCP_PROJECT_ID" \ + --filter="labels.devopsdefender=managed AND labels.dd_env=${DD_ENV} AND status=RUNNING" \ + --format="value(name)" \ + | grep -vx "$NEW_VM" || true) + if [ -z "$SURVIVORS" ]; then + echo "STONITH verified — only $NEW_VM running in ${DD_ENV}" + exit 0 + fi + echo " still running besides $NEW_VM: $(echo "$SURVIVORS" | tr '\n' ' ')" + echo " waiting for STONITH poweroff... (${i}/24)" + sleep 5 + done + echo "::warning::STONITH-by-tunnel-delete timed out; force-deleting zombies:" + echo "$SURVIVORS" + # shellcheck disable=SC2086 + gcloud compute instances delete $SURVIVORS \ + --project="$GCP_PROJECT_ID" --zone="$GCP_ZONE" --quiet || true + echo "zombies reaped; $NEW_VM is the only ${DD_ENV} VM" + + - name: Comment preview URL on PR + if: inputs.comment_on_pr && github.event_name == 'pull_request' + uses: actions/github-script@v7 + with: + script: | + const url = `https://${{ inputs.hostname }}`; + const body = [ + `### DD preview ready`, + ``, + `**URL:** ${url}`, + ``, + `Browser login: paste \`gh auth token\` output at ${url}/auth/pat`, + ``, + `CLI / curl: \`curl -H "Authorization: Bearer $(gh auth token)" ${url}/\``, + ``, + `Register endpoint for a local agent: \`wss://${{ inputs.hostname }}/register\``, + ].join('\n'); + const { data: comments } = await github.rest.issues.listComments({ + owner: context.repo.owner, + repo: context.repo.repo, + issue_number: context.issue.number, + }); + const marker = '### DD preview ready'; + const existing = comments.find(c => c.user.type === 'Bot' && c.body && c.body.includes(marker)); + if (existing) { + await github.rest.issues.updateComment({ + owner: context.repo.owner, + repo: context.repo.repo, + comment_id: existing.id, + body, + }); + } else { + await github.rest.issues.createComment({ + owner: context.repo.owner, + repo: context.repo.repo, + issue_number: context.issue.number, + body, + }); + } + + # Cascade a relaunch of the matching dd-local-{env} libvirt domain + # on the tdx2 host, then block on it registering with the freshly- + # deployed CP. This is the gate: a release is "done" only when the + # local agent is back online talking to the new CP. + - name: Relaunch dd-local-${{ inputs.env == 'production' && 'prod' || 'preview' }} + if: inputs.relaunch_agent + uses: ./.github/actions/relaunch-agent + with: + kind: ${{ inputs.env == 'production' && 'prod' || 'preview' }} + url: https://${{ inputs.hostname }} + ref: ${{ inputs.ref }} + ssh-key: ${{ secrets.DD_LOCAL_SSH_KEY }} + host: ${{ secrets.DD_LOCAL_HOST }} + dd-pat: ${{ secrets.GITHUB_TOKEN }} + ita-api-key: ${{ secrets.DD_ITA_API_KEY }} diff --git a/.github/workflows/local-agents.yml b/.github/workflows/local-agents.yml deleted file mode 100644 index 345dbc3..0000000 --- a/.github/workflows/local-agents.yml +++ /dev/null @@ -1,111 +0,0 @@ -name: Local Agents - -# Relaunches the local TDX agent VM on this user's host whenever the -# corresponding CP gets new code: -# - Production Deploy success → reboot dd-local-prod against app.devopsdefender.com -# - Release success on a PR → reboot dd-local-preview against pr-N.devopsdefender.com -# -# SSHs in via key auth to a public-IP host, then invokes -# scripts/dd-relaunch.sh which handles the destroy/recreate cycle. - -on: - workflow_run: - workflows: ["Release", "Production Deploy"] - types: [completed] - # Every non-README push to main also fires a prod relaunch directly, - # so fixes to the relaunch / deploy scripts get exercised even when - # they don't cascade through Release → Production Deploy. - push: - branches: [main] - paths-ignore: - - "README.md" - workflow_dispatch: - inputs: - kind: - description: 'prod | preview' - required: true - default: 'prod' - cp_url: - description: 'CP URL (e.g. https://app.devopsdefender.com)' - required: true - default: 'https://app.devopsdefender.com' - -permissions: - contents: read - pull-requests: read - -concurrency: - group: local-agents-${{ github.event.workflow_run.name || github.event.inputs.kind }} - cancel-in-progress: false - -jobs: - relaunch: - if: | - github.event_name == 'workflow_dispatch' - || github.event_name == 'push' - || github.event.workflow_run.conclusion == 'success' - runs-on: ubuntu-latest - steps: - - uses: actions/checkout@v4 - - id: pick - env: - GH_TOKEN: ${{ secrets.GITHUB_TOKEN }} - EVENT: ${{ github.event_name }} - WF: ${{ github.event.workflow_run.name }} - BRANCH: ${{ github.event.workflow_run.head_branch }} - DISPATCH_KIND: ${{ github.event.inputs.kind }} - DISPATCH_URL: ${{ github.event.inputs.cp_url }} - run: | - if [ "$EVENT" = "workflow_dispatch" ]; then - echo "kind=$DISPATCH_KIND" >> "$GITHUB_OUTPUT" - echo "url=$DISPATCH_URL" >> "$GITHUB_OUTPUT" - elif [ "$EVENT" = "push" ] || [ "$WF" = "Production Deploy" ]; then - # push-to-main on local-agent scripts, or a prod CP redeploy - # → relaunch dd-local-prod against the live prod CP. - echo "kind=prod" >> "$GITHUB_OUTPUT" - echo "url=https://app.devopsdefender.com" >> "$GITHUB_OUTPUT" - else - # Release on a PR: derive pr-N. Released-on-main returns - # no open PR → skip (Production Deploy will fire shortly). - pr=$(gh pr list --head "$BRANCH" --state open \ - --repo "${{ github.repository }}" \ - --json number --jq '.[0].number' 2>/dev/null || true) - if [ -n "$pr" ]; then - echo "kind=preview" >> "$GITHUB_OUTPUT" - echo "url=https://pr-$pr.devopsdefender.com" >> "$GITHUB_OUTPUT" - else - echo "kind=skip" >> "$GITHUB_OUTPUT" - fi - fi - - # Step 1: SSH in and relaunch the VM (destroy + redefine + start). - # Finishes in ~10 s — doesn't need keepalives. Only does the - # libvirt operations that require host-level access. - - name: ssh + relaunch VM - if: steps.pick.outputs.kind != 'skip' - env: - SSH_KEY: ${{ secrets.DD_LOCAL_SSH_KEY }} - HOST: ${{ secrets.DD_LOCAL_HOST }} - DD_PAT: ${{ secrets.GITHUB_TOKEN }} - DD_ITA_API_KEY: ${{ secrets.DD_ITA_API_KEY }} - KIND: ${{ steps.pick.outputs.kind }} - URL: ${{ steps.pick.outputs.url }} - run: | - mkdir -p ~/.ssh - printf '%s\n' "$SSH_KEY" > ~/.ssh/id_ed25519 - chmod 600 ~/.ssh/id_ed25519 - ssh-keyscan -H "$HOST" >> ~/.ssh/known_hosts 2>/dev/null - ssh -o BatchMode=yes -o StrictHostKeyChecking=yes \ - -i ~/.ssh/id_ed25519 "tdx2@$HOST" \ - "DD_PAT='$DD_PAT' DD_ITA_API_KEY='$DD_ITA_API_KEY' /home/tdx2/src/dd/scripts/dd-relaunch.sh '$KIND' '$URL'" - - # Step 2: Deploy ollama / pull model / sample query. Pure HTTPS - # against the CP + the newly-registered agent's tunnel. Can take - # minutes (model pull) — no SSH to keep alive. - - name: deploy ollama (HTTPS) - if: steps.pick.outputs.kind != 'skip' - env: - DD_PAT: ${{ secrets.GITHUB_TOKEN }} - KIND: ${{ steps.pick.outputs.kind }} - URL: ${{ steps.pick.outputs.url }} - run: ./scripts/ollama-deploy.sh "$KIND" "$URL" diff --git a/.github/workflows/production-deploy.yml b/.github/workflows/production-deploy.yml deleted file mode 100644 index 8253179..0000000 --- a/.github/workflows/production-deploy.yml +++ /dev/null @@ -1,146 +0,0 @@ -name: Production Deploy - -# Two triggers: -# - workflow_run: fires automatically after a successful Release run -# on main. Release publishes the `latest` tag, then this workflow -# deploys it to production. Sequential by design — if Release fails, -# we don't promote. -# - workflow_dispatch: manual re-deploy of any existing tag (e.g. a -# known-good v0.2.0 after a bad main push). - -on: - workflow_run: - workflows: ["Release"] - types: [completed] - branches: [main] - workflow_dispatch: - inputs: - release_tag: - description: 'Release tag to deploy (e.g. latest, v0.2.0)' - required: false - default: 'latest' - -concurrency: - group: dd-production - cancel-in-progress: false - -env: - GCP_ZONE: us-central1-c - DD_ENV: production - DD_DOMAIN: ${{ vars.DD_CF_DOMAIN || 'devopsdefender.com' }} - -permissions: - contents: read - -jobs: - # dd-register STONITHs the old VM on startup by deleting its CF - # tunnel, so no explicit teardown here. - deploy: - # workflow_run fires on every Release completion, including - # failures. Only promote on success. - if: github.event_name == 'workflow_dispatch' || github.event.workflow_run.conclusion == 'success' - runs-on: ubuntu-latest - environment: production - permissions: - contents: read - id-token: write - steps: - - uses: actions/checkout@v4 - - uses: google-github-actions/auth@v2 - with: - workload_identity_provider: 'projects/779946350556/locations/global/workloadIdentityPools/github-actions-pool/providers/github-provider' - service_account: 'easyenclave-production-ci@easyenclave.iam.gserviceaccount.com' - - uses: google-github-actions/setup-gcloud@v2 - - - name: Create TDX VM (boots from easyenclave, fetches dd from GitHub releases) - env: - GCP_PROJECT_ID: ${{ secrets.GCP_PROJECT_ID }} - CLOUDFLARE_API_TOKEN: ${{ secrets.DD_CP_CF_API_TOKEN }} - CLOUDFLARE_ACCOUNT_ID: ${{ secrets.DD_CP_CF_ACCOUNT_ID }} - CLOUDFLARE_ZONE_ID: ${{ secrets.DD_CP_CF_ZONE_ID }} - DD_GITHUB_CLIENT_ID: ${{ vars.DD_GITHUB_CLIENT_ID || secrets.DD_GITHUB_CLIENT_ID }} - DD_GITHUB_CALLBACK_URL: ${{ vars.DD_GITHUB_CALLBACK_URL }} - DD_GITHUB_CLIENT_SECRET: ${{ secrets.DD_GITHUB_CLIENT_SECRET }} - # Intel Trust Authority — optional. When the secret is set, - # the CP mints its own ITA token and verifies incoming agent - # registrations. DD_ITA_REQUIRED stays false (default). - DD_ITA_API_KEY: ${{ secrets.DD_ITA_API_KEY }} - # workflow_run has no `inputs`; fall back to `latest`, which - # release.yml just (re)published on push to main. - DD_RELEASE_TAG: ${{ inputs.release_tag || 'latest' }} - run: scripts/gcp-deploy.sh - - - name: Wait for agent health - env: - AGENT_URL: https://app.${{ env.DD_DOMAIN }} - run: | - for i in $(seq 1 60); do - curl -fsS "${AGENT_URL}/health" >/dev/null 2>&1 && { - echo "Agent healthy at ${AGENT_URL}" - exit 0 - } - echo " waiting for tunnel... (${i}/60)" - sleep 5 - done - echo "::error::Agent not healthy within 5 minutes" - exit 1 - - - name: Verify dashboard renders - env: - AGENT_URL: https://app.${{ env.DD_DOMAIN }} - GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }} - run: | - # New auth model: dashboard expects a GitHub PAT/GITHUB_TOKEN with - # access to the dd repo; the CP verifies against DD_OWNER via the - # standard /user + /repos/{owner}/dd fallback. No OIDC audience wiring. - for attempt in $(seq 1 12); do - code=$(curl -s -o /dev/null -w '%{http_code}' \ - -H "Authorization: Bearer ${GITHUB_TOKEN}" \ - "${AGENT_URL}/" || echo 000) - if [ "$code" = "200" ]; then - echo "Dashboard renders (HTTP 200, attempt ${attempt})" - exit 0 - fi - echo " dashboard returned HTTP ${code}, retrying... (${attempt}/12)" - sleep 5 - done - echo "::error::dashboard / never returned 200 (last HTTP ${code})" - exit 1 - - - name: Verify STONITH halted prior production VM(s) - env: - GCP_PROJECT_ID: ${{ secrets.GCP_PROJECT_ID }} - GCP_ZONE: ${{ env.GCP_ZONE }} - run: | - # Mirror of release.yml's verify-step for PR previews. Give - # STONITH-by-tunnel-delete 120s to work on well-behaved old - # prod VMs (their cloudflared exits → dd-register poweroffs - # → GCP TERMINATED → cleanup.yml reaps). After the timeout, - # force-delete any remaining RUNNING prod VMs so we don't - # leak compute indefinitely. - NEW_VM=$(gcloud compute instances list \ - --project="$GCP_PROJECT_ID" \ - --filter="labels.devopsdefender=managed AND labels.dd_env=production" \ - --format="value(name)" --sort-by=~creationTimestamp | head -1) - echo "new VM: $NEW_VM" - SURVIVORS="" - for i in $(seq 1 24); do - SURVIVORS=$(gcloud compute instances list \ - --project="$GCP_PROJECT_ID" \ - --filter="labels.devopsdefender=managed AND labels.dd_env=production AND status=RUNNING" \ - --format="value(name)" \ - | grep -vx "$NEW_VM" || true) - if [ -z "$SURVIVORS" ]; then - echo "STONITH verified — only $NEW_VM running in prod" - exit 0 - fi - echo " still running besides $NEW_VM: $(echo "$SURVIVORS" | tr '\n' ' ')" - echo " waiting for STONITH poweroff... (${i}/24)" - sleep 5 - done - echo "::warning::STONITH-by-tunnel-delete timed out in prod; force-deleting:" - echo "$SURVIVORS" - # shellcheck disable=SC2086 - gcloud compute instances delete $SURVIVORS \ - --project="$GCP_PROJECT_ID" --zone="$GCP_ZONE" --quiet || true - echo "zombies reaped; $NEW_VM is the only production VM" diff --git a/.github/workflows/release.yml b/.github/workflows/release.yml index b238c85..7d55bbc 100644 --- a/.github/workflows/release.yml +++ b/.github/workflows/release.yml @@ -1,13 +1,18 @@ name: Release -# Build the static musl binary, publish it as a GitHub release asset, -# and (on PRs) deploy it to an ephemeral per-PR preview. Replaces the -# Docker build+push pipeline — easyenclave fetches the asset directly -# via its github_release workload source. +# One workflow to rule them all: build the static musl binary, publish +# it as a GitHub release asset, and deploy it to either the PR preview +# (per-PR ephemeral CP at pr-N.domain) or production (app.domain). Both +# paths cascade into a relaunch of the matching dd-local agent VM on +# the tdx2 host, and the Release run only goes green when that agent +# re-registers with the freshly-deployed CP. # -# PR: pre-release tagged pr-{sha12}, then full PR-preview deploy. -# push to main: rolling `latest` release (no deploy — that's production) -# push v* tag: versioned release (no deploy) +# Paths: +# pull_request → build → deploy-preview → dd-local-preview relaunch +# push main → build → deploy-production → dd-local-prod relaunch +# push v* → build only (versioned release, no deploy) +# workflow_dispatch → build → deploy-production (rollback tool; +# release_tag input picks which tag to deploy) on: push: @@ -18,10 +23,18 @@ on: pull_request: paths-ignore: - "README.md" + workflow_dispatch: + inputs: + release_tag: + description: 'Release tag to deploy to production (rollback tool; default: latest)' + required: false + default: 'latest' concurrency: group: dd-release-${{ github.ref }} - cancel-in-progress: true + # PR pushes cancel old runs. Main / tag / manual dispatch queue — + # we never want to cancel an in-progress prod deploy. + cancel-in-progress: ${{ github.event_name == 'pull_request' }} permissions: contents: write @@ -31,10 +44,6 @@ permissions: id-token: write attestations: write -env: - DD_DOMAIN: ${{ vars.DD_CF_DOMAIN || 'devopsdefender.com' }} - GCP_ZONE: us-central1-c - jobs: build: runs-on: ubuntu-latest @@ -79,10 +88,7 @@ jobs: # `https://github.com/devopsdefender/dd/.github/workflows/release.yml@`). # The attestation is stored on the repo's /attestations endpoint # and retrievable via `gh attestation verify` or the REST API. - # - # For now we're tracking (not enforcing) — the CP will eventually - # use this to verify that a registering agent's artifact came - # from this workflow. Skipped on fork PRs (they lack id-token). + # Skipped on fork PRs (they lack id-token). - name: Attest devopsdefender binary if: github.event_name != 'pull_request' || github.event.pull_request.head.repo.full_name == github.repository uses: actions/attest-build-provenance@v2 @@ -117,227 +123,52 @@ jobs: | tail -n +12 \ | xargs -rI{} gh release delete {} --yes --cleanup-tag - # Deploy the freshly-built binary to the PR's ephemeral preview. - # Each PR gets its own env at pr-{N}.{domain} with DD_ENV=pr-{N} - # (hostname-isolated, no OAuth — browser access via /auth/pat). - # main/v* produce releases that production-deploy picks up separately. + # Per-PR ephemeral preview at pr-{N}.{domain}. No OAuth (browser login + # via /auth/pat). Cascades into dd-local-preview relaunch. deploy-preview: if: github.event_name == 'pull_request' needs: build - runs-on: ubuntu-latest - environment: staging permissions: contents: read id-token: write pull-requests: write - env: - DD_ENV: pr-${{ github.event.number }} - DD_HOSTNAME: pr-${{ github.event.number }}.${{ vars.DD_CF_DOMAIN || 'devopsdefender.com' }} - steps: - - uses: actions/checkout@v4 - - - uses: google-github-actions/auth@v2 - with: - workload_identity_provider: 'projects/654815109728/locations/global/workloadIdentityPools/github-actions-pool/providers/github-provider' - service_account: 'easyenclave-staging-ci@eestaging.iam.gserviceaccount.com' - - uses: google-github-actions/setup-gcloud@v2 - - - name: Create TDX VM (boots from easyenclave, fetches dd from GitHub releases) - env: - GCP_PROJECT_ID: ${{ secrets.GCP_PROJECT_ID }} - CLOUDFLARE_API_TOKEN: ${{ secrets.DD_CP_CF_API_TOKEN }} - CLOUDFLARE_ACCOUNT_ID: ${{ secrets.DD_CP_CF_ACCOUNT_ID }} - CLOUDFLARE_ZONE_ID: ${{ secrets.DD_CP_CF_ZONE_ID }} - # OAuth env vars intentionally omitted — gcp-deploy.sh sees - # empty DD_GITHUB_CLIENT_ID and skips them in the workload - # spec. dd-web then disables /auth/github/* and serves - # /auth/pat for browser access. - # - # Intel Trust Authority — optional. When the secret is set, - # the CP mints its own ITA token at startup and verifies - # agent-supplied tokens on /register. DD_ITA_REQUIRED stays - # false (default) so unsigned agents still register. - DD_ITA_API_KEY: ${{ secrets.DD_ITA_API_KEY }} - DD_RELEASE_TAG: ${{ needs.build.outputs.tag }} - run: scripts/gcp-deploy.sh - - - name: Wait for agent health (streams serial console) - env: - AGENT_URL: https://${{ env.DD_HOSTNAME }} - GCP_PROJECT_ID: ${{ secrets.GCP_PROJECT_ID }} - GCP_ZONE: ${{ env.GCP_ZONE }} - run: | - VM_NAME=$(gcloud compute instances list \ - --project="$GCP_PROJECT_ID" \ - --filter="labels.devopsdefender=managed AND labels.dd_env=${DD_ENV}" \ - --format="value(name)" --sort-by=~creationTimestamp | head -1) - if [ -z "$VM_NAME" ]; then - echo "::error::no dd-${DD_ENV} VM found — gcp-deploy.sh must have failed" - exit 1 - fi - echo "Watching VM: $VM_NAME (zone: $GCP_ZONE)" - - LAST_LINES=0 - for i in $(seq 1 60); do - # Stream serial console so boot failures (DHCP hang, GitHub - # release fetch error, cloudflared exit, etc.) are visible - # without shelling into GCP. - gcloud compute instances get-serial-port-output "$VM_NAME" \ - --project="$GCP_PROJECT_ID" --zone="$GCP_ZONE" 2>/dev/null \ - > /tmp/serial.log || true - TOTAL_LINES=$(wc -l < /tmp/serial.log) - if [ "$TOTAL_LINES" -gt "$LAST_LINES" ]; then - tail -n +$((LAST_LINES + 1)) /tmp/serial.log \ - | sed 's/^/[serial] /' - LAST_LINES=$TOTAL_LINES - fi - - if grep -qE "FATAL|Kernel panic|Invalid ELF header|/bin/sh: can't access tty" /tmp/serial.log; then - echo "::error::boot failed — serial log shows fatal pattern" - exit 1 - fi - - # /health via the Cloudflare tunnel tests the full chain: - # VM boot → easyenclave init → github_release fetch of dd + - # cloudflared → cloudflared tunnel up. - if curl -fsS "${AGENT_URL}/health" >/dev/null 2>&1; then - echo "Agent healthy at ${AGENT_URL}" - exit 0 - fi - echo " waiting for tunnel... (${i}/60)" - sleep 5 - done - echo "::error::Agent not healthy within 5 minutes" - echo "--- final serial tail ---" - tail -80 /tmp/serial.log | sed 's/^/[serial] /' - exit 1 - - - name: Verify NEW VM via TDX attestation - env: - AGENT_URL: https://${{ env.DD_HOSTNAME }} - GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }} - run: | - # /cp/attest proves the freshly-deployed VM is serving the tunnel - # (stale tunnels point at old VMs that 404 on this endpoint). - # Auth: GITHUB_TOKEN via Bearer — the CP's /repos/{owner}/dd probe - # accepts any token with repo access. No OIDC audience wiring. - NONCE=$(openssl rand -base64 16) - - # 60 × 10s = 10 min. New VM has to boot, fetch cloudflared - # and dd from GitHub releases, start, and bring its tunnel up. - for attempt in $(seq 1 60); do - BODY=$(curl -sG -w '\n%{http_code}' \ - -H "Authorization: Bearer ${GITHUB_TOKEN}" \ - --data-urlencode "nonce=${NONCE}" \ - "${AGENT_URL}/cp/attest" || echo $'\n000') - CODE=$(echo "$BODY" | tail -n1) - JSON=$(echo "$BODY" | sed '$d') - if [ "$CODE" = "200" ]; then - QUOTE_B64=$(echo "$JSON" | jq -r '.quote_b64 // empty') - if [ -n "$QUOTE_B64" ] && [ "$QUOTE_B64" != "null" ]; then - # MRTD = 48 bytes at offset 184 in TDX quote v4. - # If it's non-zero, attestation actually worked. - MRTD=$(echo "$QUOTE_B64" | base64 -d \ - | dd bs=1 skip=184 count=48 status=none | xxd -p -c 48) - if [ -n "$MRTD" ] && [ "$MRTD" != "$(printf '00%.0s' {1..48})" ]; then - echo "NEW VM verified — MRTD: $MRTD" - exit 0 - fi - echo " /cp/attest 200 but MRTD empty/zero, retrying... (${attempt}/60)" - else - echo " /cp/attest 200 but no quote_b64, retrying... (${attempt}/60)" - fi - else - echo " /cp/attest returned HTTP ${CODE}, retrying... (${attempt}/60)" - fi - sleep 10 - done - echo "::error::/cp/attest never returned a valid quote — stale tunnel or new VM never came up" - exit 1 - - - name: Verify STONITH halted prior VM(s) in this env - env: - GCP_PROJECT_ID: ${{ secrets.GCP_PROJECT_ID }} - GCP_ZONE: ${{ env.GCP_ZONE }} - run: | - # STONITH (dd-register deletes the old tunnel → old cloudflared - # exits → old dd-register poweroffs the VM) is the ONLY cleanup - # mechanism. Scoped to this PR's env — previews are - # hostname-isolated from each other, so this only reaps prior - # deploys of the same PR (re-pushes). - NEW_VM=$(gcloud compute instances list \ - --project="$GCP_PROJECT_ID" \ - --filter="labels.devopsdefender=managed AND labels.dd_env=${DD_ENV}" \ - --format="value(name)" --sort-by=~creationTimestamp | head -1) - echo "new VM: $NEW_VM" - - # Give STONITH-by-tunnel-delete 120s to work on well-behaved - # old VMs (their cloudflared exits → dd-register poweroffs). - # After that, force-delete any remaining survivors: they're - # zombies whose dd-register failed before creating a tunnel - # (e.g. CF auth error at boot — see src/cp.rs - # which now kernel_poweroff's on init failure, so this is a - # safety net for pre-fix zombies and any future init failure - # modes we haven't handled). - SURVIVORS="" - for i in $(seq 1 24); do - SURVIVORS=$(gcloud compute instances list \ - --project="$GCP_PROJECT_ID" \ - --filter="labels.devopsdefender=managed AND labels.dd_env=${DD_ENV} AND status=RUNNING" \ - --format="value(name)" \ - | grep -vx "$NEW_VM" || true) - if [ -z "$SURVIVORS" ]; then - echo "STONITH verified — only $NEW_VM running in dd_env=${DD_ENV}" - exit 0 - fi - echo " still running besides $NEW_VM: $(echo "$SURVIVORS" | tr '\n' ' ')" - echo " waiting for STONITH poweroff... (${i}/24)" - sleep 5 - done - echo "::warning::STONITH-by-tunnel-delete timed out; force-deleting zombies:" - echo "$SURVIVORS" - # shellcheck disable=SC2086 - gcloud compute instances delete $SURVIVORS \ - --project="$GCP_PROJECT_ID" --zone="$GCP_ZONE" --quiet || true - echo "zombies reaped; $NEW_VM is the only $DD_ENV VM" - - - name: Comment preview URL on PR - uses: actions/github-script@v7 - with: - script: | - const url = `https://${process.env.DD_HOSTNAME}`; - const body = [ - `### DD preview ready`, - ``, - `**URL:** ${url}`, - ``, - `Browser login: paste \`gh auth token\` output at ${url}/auth/pat`, - ``, - `CLI / curl: \`curl -H "Authorization: Bearer $(gh auth token)" ${url}/\``, - ``, - `Register endpoint for a local agent: \`wss://${process.env.DD_HOSTNAME}/register\``, - ].join('\n'); - - // Update existing bot comment if present, else create. - const { data: comments } = await github.rest.issues.listComments({ - owner: context.repo.owner, - repo: context.repo.repo, - issue_number: context.issue.number, - }); - const marker = '### DD preview ready'; - const existing = comments.find(c => c.user.type === 'Bot' && c.body && c.body.includes(marker)); - if (existing) { - await github.rest.issues.updateComment({ - owner: context.repo.owner, - repo: context.repo.repo, - comment_id: existing.id, - body, - }); - } else { - await github.rest.issues.createComment({ - owner: context.repo.owner, - repo: context.repo.repo, - issue_number: context.issue.number, - body, - }); - } + uses: ./.github/workflows/deploy-cp.yml + with: + env: pr-${{ github.event.number }} + hostname: pr-${{ github.event.number }}.${{ vars.DD_CF_DOMAIN || 'devopsdefender.com' }} + gcp_environment: staging + workload_identity_provider: 'projects/654815109728/locations/global/workloadIdentityPools/github-actions-pool/providers/github-provider' + service_account: 'easyenclave-staging-ci@eestaging.iam.gserviceaccount.com' + release_tag: ${{ needs.build.outputs.tag }} + oauth_enabled: false + comment_on_pr: true + ref: ${{ github.event.pull_request.head.ref }} + secrets: inherit + + # Production deploy at app.{domain}. Fires on push-to-main OR on a + # manual workflow_dispatch (rollback to a specific release_tag). + # Tag pushes (v*) intentionally do not auto-deploy — they just + # publish the artifact. Cascades into dd-local-prod relaunch. + deploy-production: + if: >- + (github.event_name == 'push' && github.ref == 'refs/heads/main') + || github.event_name == 'workflow_dispatch' + needs: build + permissions: + contents: read + id-token: write + # Granted (though unused — comment_on_pr=false here) so the + # permissions intersection with deploy-cp.yml's job matches. + pull-requests: write + uses: ./.github/workflows/deploy-cp.yml + with: + env: production + hostname: app.${{ vars.DD_CF_DOMAIN || 'devopsdefender.com' }} + gcp_environment: production + workload_identity_provider: 'projects/779946350556/locations/global/workloadIdentityPools/github-actions-pool/providers/github-provider' + service_account: 'easyenclave-production-ci@easyenclave.iam.gserviceaccount.com' + release_tag: ${{ inputs.release_tag || 'latest' }} + oauth_enabled: true + comment_on_pr: false + ref: main + secrets: inherit diff --git a/.github/workflows/retire-staging.yml b/.github/workflows/retire-staging.yml deleted file mode 100644 index dbf4205..0000000 --- a/.github/workflows/retire-staging.yml +++ /dev/null @@ -1,98 +0,0 @@ -name: Retire Staging - -# One-shot cleanup of the shared `app-staging.{domain}` env. -# Per-PR preview envs (deploy-preview in release.yml) fully replace -# the old shared staging, so this workflow reaps whatever's left: -# - any `dd_env=staging` VMs (RUNNING or TERMINATED) -# - any CF tunnels named `dd-staging-*` -# - the CF CNAME for `app-staging.{domain}` -# -# Idempotent — skips silently if nothing to delete. Run manually once -# via the Actions UI, then this workflow can be deleted. - -on: - workflow_dispatch: - -permissions: - contents: read - -env: - DD_DOMAIN: ${{ vars.DD_CF_DOMAIN || 'devopsdefender.com' }} - GCP_ZONE: us-central1-c - -jobs: - retire: - runs-on: ubuntu-latest - environment: staging - permissions: - contents: read - id-token: write - steps: - - uses: google-github-actions/auth@v2 - with: - workload_identity_provider: 'projects/654815109728/locations/global/workloadIdentityPools/github-actions-pool/providers/github-provider' - service_account: 'easyenclave-staging-ci@eestaging.iam.gserviceaccount.com' - - uses: google-github-actions/setup-gcloud@v2 - - - name: Delete staging VMs - env: - GCP_PROJECT_ID: ${{ secrets.GCP_PROJECT_ID }} - run: | - VMS=$(gcloud compute instances list \ - --project="$GCP_PROJECT_ID" \ - --filter="labels.devopsdefender=managed AND labels.dd_env=staging" \ - --format="value(name)") - if [ -z "$VMS" ]; then - echo "No dd-staging VMs to delete." - else - echo "Deleting: $(echo "$VMS" | tr '\n' ' ')" - # shellcheck disable=SC2086 - gcloud compute instances delete $VMS \ - --project="$GCP_PROJECT_ID" --zone="$GCP_ZONE" --quiet - fi - - - name: Delete CF tunnels with dd-staging- prefix - env: - CF_API_TOKEN: ${{ secrets.DD_CP_CF_API_TOKEN }} - CF_ACCOUNT_ID: ${{ secrets.DD_CP_CF_ACCOUNT_ID }} - run: | - resp=$(curl -fsS \ - -H "Authorization: Bearer ${CF_API_TOKEN}" \ - "https://api.cloudflare.com/client/v4/accounts/${CF_ACCOUNT_ID}/cfd_tunnel?is_deleted=false&per_page=200") - ids=$(echo "$resp" | jq -r \ - '.result[] | select(.name | startswith("dd-staging-")) | .id') - if [ -z "$ids" ]; then - echo "No CF tunnels with prefix dd-staging-" - else - for id in $ids; do - echo "Deleting tunnel $id" - curl -fsS -X DELETE \ - -H "Authorization: Bearer ${CF_API_TOKEN}" \ - "https://api.cloudflare.com/client/v4/accounts/${CF_ACCOUNT_ID}/cfd_tunnel/${id}/connections" \ - >/dev/null || true - curl -fsS -X DELETE \ - -H "Authorization: Bearer ${CF_API_TOKEN}" \ - "https://api.cloudflare.com/client/v4/accounts/${CF_ACCOUNT_ID}/cfd_tunnel/${id}" \ - >/dev/null || echo "::warning::tunnel $id delete failed" - done - fi - - - name: Delete CNAME for app-staging - env: - CF_API_TOKEN: ${{ secrets.DD_CP_CF_API_TOKEN }} - CF_ZONE_ID: ${{ secrets.DD_CP_CF_ZONE_ID }} - run: | - host="app-staging.${DD_DOMAIN}" - record_id=$(curl -fsS \ - -H "Authorization: Bearer ${CF_API_TOKEN}" \ - "https://api.cloudflare.com/client/v4/zones/${CF_ZONE_ID}/dns_records?type=CNAME&name=${host}" \ - | jq -r '.result[0].id // empty') - if [ -z "$record_id" ]; then - echo "No CNAME for ${host}" - else - echo "Deleting CNAME record $record_id (${host})" - curl -fsS -X DELETE \ - -H "Authorization: Bearer ${CF_API_TOKEN}" \ - "https://api.cloudflare.com/client/v4/zones/${CF_ZONE_ID}/dns_records/${record_id}" \ - >/dev/null - fi diff --git a/README.md b/README.md index 289d19c..f8a1df1 100644 --- a/README.md +++ b/README.md @@ -39,7 +39,7 @@ The `devopsdefender` binary ships as a **GitHub release asset** — not an OCI i `cloudflared` is also pulled directly from `cloudflare/cloudflared`'s GitHub releases as a fetch-only boot workload — no bundling in our image, no Dockerfile step. -Per-VM configuration (CF credentials, GitHub OAuth, the workload spec itself) is passed to easyenclave at boot via **GCE instance metadata** (`ee-config` attribute), read by `easyenclave::init::fetch_gce_metadata_config()` and applied as env vars. `scripts/gcp-deploy.sh` builds the spec and invokes `gcloud compute instances create --image-family=easyenclave-staging --metadata-from-file=ee-config=...`. +Per-VM configuration (CF credentials, GitHub OAuth, the workload spec itself) is passed to easyenclave at boot via **GCE instance metadata** (`ee-config` attribute), read by `easyenclave::init::fetch_gce_metadata_config()` and applied as env vars. The CP-deploy step in `.github/workflows/deploy-cp.yml` builds the spec and invokes `gcloud compute instances create --image-family=easyenclave-staging --metadata-from-file=ee-config=...`. ## CI/CD @@ -48,16 +48,18 @@ PR → pre-release tagged pr-{sha12}, then ephemeral preview at pr- branch deleted → pr-teardown.yml deletes the preview's VM, CF tunnel, and DNS push to main → rolling `latest` release, then auto-deploy to production push v* tag → versioned release (no auto-deploy) -manual → production-deploy.yml promotes any existing tag +manual dispatch → redeploy any existing tag to production (rollback tool) ``` -Each PR gets its own isolated env at `pr-{N}.{domain}` with `DD_ENV=pr-{N}` — no more shared staging tier. `.github/workflows/release.yml` builds the static musl binary, publishes it as a GitHub release asset, deploys the PR's preview VM, and posts the URL back to the PR. The preview VM is verified via: +Every path lives in `.github/workflows/release.yml`: one `build` job, then either `deploy-preview` (PR) or `deploy-production` (main / dispatch), both calling the reusable `deploy-cp.yml` with env-specific inputs. Each cascades into a relaunch of the matching `dd-local-{env}` VM on the tdx2 host — the Release run only goes green when that agent re-registers with the freshly-deployed CP. Verifications along the way: 1. `/health` via the Cloudflare tunnel 2. `/cp/attest` returning a real TDX MRTD (cryptographic proof the freshly-deployed VM is running — old VMs don't have the endpoint and return 404) -3. No other `dd-pr-{N}-*` VM is RUNNING after deploy (STONITH must have halted the previous instance of this PR) +3. Dashboard `/` returning HTTP 200 under a Bearer PAT +4. No other `dd-{env}-*` VM is RUNNING after deploy (STONITH must have halted the previous instance) +5. `dd-local-{env}` re-registers with the new CP within 5 min -Browser access to a PR preview goes through `/auth/pat` (paste a GitHub PAT, validated against `DD_OWNER`). OAuth is only wired for production, which `production-deploy.yml` still targets at `app.{domain}`. +Browser access to a PR preview goes through `/auth/pat` (paste a GitHub PAT, validated against `DD_OWNER`). OAuth is only wired for production, at `app.{domain}`. ## STONITH diff --git a/apps/_infra/dd-relaunch.sh b/apps/_infra/dd-relaunch.sh new file mode 100755 index 0000000..55d380d --- /dev/null +++ b/apps/_infra/dd-relaunch.sh @@ -0,0 +1,52 @@ +#!/usr/bin/env bash +# dd-relaunch.sh — destroy and recreate one local TDX agent VM. +# +# Invoked over SSH by .github/actions/relaunch-agent during a Release +# cascade. Pulls the PR's (or main's) apps/_infra tree so this script +# and local-agents.sh are always the ones the caller authored. Tears +# down the existing VM + overlay, runs local-agents.sh to redefine, +# and starts the VM. +# +# dd-relaunch.sh prod https://app.devopsdefender.com main +# dd-relaunch.sh preview https://pr-N.devopsdefender.com feat/some-pr +# +# DD_PAT and DD_ITA_API_KEY must be set in the environment. + +set -euo pipefail + +KIND="${1?usage: dd-relaunch.sh [ref]}" +CP="${2?cp url required}" +REF="${3:-main}" +: "${DD_PAT?DD_PAT must be set}" +: "${DD_ITA_API_KEY?DD_ITA_API_KEY must be set}" + +case "$KIND" in + prod|preview) ;; + *) echo "unknown kind: $KIND (want prod|preview)" >&2; exit 2 ;; +esac + +cd /home/tdx2/src/dd + +# Refresh the infra scripts + apps/ tree from the caller's ref. Limited +# checkout so a dirty working tree elsewhere doesn't block the deploy. +# This script is already in memory, so the refresh takes effect on the +# *next* invocation. +git fetch --quiet origin "$REF" +git checkout --quiet "origin/$REF" -- apps/ +echo "dd-relaunch: refreshed apps/ from origin/$REF" + +vm="dd-local-$KIND" +overlay="/var/lib/libvirt/images/$vm.qcow2" + +virsh destroy "$vm" 2>/dev/null || true +virsh undefine "$vm" --managed-save --snapshots-metadata 2>/dev/null || true +rm -f "$overlay" + +# Redefine via local-agents.sh; "" skips the other slot. +case "$KIND" in + prod) ./apps/_infra/local-agents.sh "" "$CP" ;; + preview) ./apps/_infra/local-agents.sh "$CP" "" ;; +esac + +virsh start "$vm" +echo "relaunched $vm against $CP" diff --git a/scripts/local-agents.sh b/apps/_infra/local-agents.sh similarity index 95% rename from scripts/local-agents.sh rename to apps/_infra/local-agents.sh index 17c61fa..20b772a 100755 --- a/scripts/local-agents.sh +++ b/apps/_infra/local-agents.sh @@ -12,11 +12,11 @@ # Usage: # export DD_PAT="$(gh auth token)" # export DD_ITA_API_KEY="$(cat ~/.secrets/ita_api_key)" -# ./scripts/local-agents.sh https://pr-106.devopsdefender.com https://app.devopsdefender.com +# ./apps/_infra/local-agents.sh https://pr-106.devopsdefender.com https://app.devopsdefender.com # # Pass "" for either URL to skip defining that VM: -# ./scripts/local-agents.sh "" https://app.devopsdefender.com # prod only -# ./scripts/local-agents.sh https://pr-N.devopsdefender.com "" # preview only +# ./apps/_infra/local-agents.sh "" https://app.devopsdefender.com # prod only +# ./apps/_infra/local-agents.sh https://pr-N.devopsdefender.com "" # preview only # # After: virsh start dd-local-preview && virsh start dd-local-prod @@ -255,3 +255,8 @@ echo echo "watch registration (Ctrl-] to exit):" [ -n "$PREVIEW_CP" ] && echo " virsh console dd-local-preview" [ -n "$PROD_CP" ] && echo " virsh console dd-local-prod" + +# Explicit 0 — the tail `[ -n "$PROD_CP" ] && …` returns 1 when +# PROD_CP="" (preview-only), bubbling up as the script exit status +# and tripping set -e in dd-relaunch.sh. Force success. +exit 0 diff --git a/apps/cloudflared/workload.json b/apps/cloudflared/workload.json new file mode 100644 index 0000000..1b2270a --- /dev/null +++ b/apps/cloudflared/workload.json @@ -0,0 +1,8 @@ +{ + "app_name": "cloudflared", + "github_release": { + "repo": "cloudflare/cloudflared", + "asset": "cloudflared-linux-amd64", + "rename": "cloudflared" + } +} diff --git a/apps/dd-agent/workload.json.tmpl b/apps/dd-agent/workload.json.tmpl new file mode 100644 index 0000000..a0e6d04 --- /dev/null +++ b/apps/dd-agent/workload.json.tmpl @@ -0,0 +1,22 @@ +{ + "app_name": "dd-agent", + "github_release": { + "repo": "devopsdefender/dd", + "asset": "devopsdefender", + "tag": "latest" + }, + "cmd": ["devopsdefender", "agent"], + "env": [ + "DD_MODE=agent", + "DD_CP_URL=${DD_CP_URL}", + "DD_PAT=${DD_PAT}", + "DD_ITA_API_KEY=${DD_ITA_API_KEY}", + "DD_ITA_BASE_URL=https://api.trustauthority.intel.com", + "DD_ITA_JWKS_URL=https://portal.trustauthority.intel.com/certs", + "DD_ITA_ISSUER=https://portal.trustauthority.intel.com", + "DD_OWNER=devopsdefender", + "DD_ENV=${DD_ENV}", + "DD_VM_NAME=${DD_VM_NAME}", + "DD_PORT=8080" + ] +} diff --git a/apps/dd-management/workload.json.tmpl b/apps/dd-management/workload.json.tmpl new file mode 100644 index 0000000..e8fbe17 --- /dev/null +++ b/apps/dd-management/workload.json.tmpl @@ -0,0 +1,29 @@ +{ + "app_name": "dd-management", + "github_release": { + "repo": "devopsdefender/dd", + "asset": "devopsdefender", + "tag": "${DD_RELEASE_TAG}" + }, + "cmd": ["devopsdefender"], + "env": [ + "DD_MODE=management", + "DD_CF_API_TOKEN=${CLOUDFLARE_API_TOKEN}", + "DD_CF_ACCOUNT_ID=${CLOUDFLARE_ACCOUNT_ID}", + "DD_CF_ZONE_ID=${CLOUDFLARE_ZONE_ID}", + "DD_CF_DOMAIN=${DD_DOMAIN}", + "DD_HOSTNAME=${DD_HOSTNAME}", + "DD_ENV=${DD_ENV}", + "DD_OWNER=devopsdefender", + "DD_REGISTER_PORT=8081", + "DD_OIDC_AUDIENCE=dd-web", + "DD_PORT=8080", + "DD_GITHUB_CLIENT_ID=${DD_GITHUB_CLIENT_ID}", + "DD_GITHUB_CLIENT_SECRET=${DD_GITHUB_CLIENT_SECRET}", + "DD_GITHUB_CALLBACK_URL=${DD_GITHUB_CALLBACK_URL}", + "DD_ITA_API_KEY=${DD_ITA_API_KEY}", + "DD_ITA_BASE_URL=${DD_ITA_BASE_URL}", + "DD_ITA_JWKS_URL=${DD_ITA_JWKS_URL}", + "DD_ITA_ISSUER=${DD_ITA_ISSUER}" + ] +} diff --git a/apps/mount-models/workload.json b/apps/mount-models/workload.json new file mode 100644 index 0000000..94111d8 --- /dev/null +++ b/apps/mount-models/workload.json @@ -0,0 +1,7 @@ +{ + "app_name": "mount-models", + "cmd": [ + "/bin/busybox", "sh", "-c", + "mkdir -p /var/lib/easyenclave/ollama && mount /dev/vdc /var/lib/easyenclave/ollama && echo mount-models: ok; sleep inf" + ] +} diff --git a/apps/nv/workload.json b/apps/nv/workload.json new file mode 100644 index 0000000..047ed3d --- /dev/null +++ b/apps/nv/workload.json @@ -0,0 +1,7 @@ +{ + "app_name": "nv", + "cmd": [ + "/bin/busybox", "sh", "-c", + "/sbin/insmod /lib/modules/7.0.0-14-generic/kernel/nvidia-580srv-open/nvidia.ko NVreg_OpenRmEnableUnsupportedGpus=1 2>&1 && echo nv: loaded || echo nv: failed; sleep inf" + ] +} diff --git a/apps/ollama/workload.preview.json b/apps/ollama/workload.preview.json new file mode 100644 index 0000000..8455622 --- /dev/null +++ b/apps/ollama/workload.preview.json @@ -0,0 +1,7 @@ +{ + "app_name": "ollama", + "cmd": [ + "/bin/busybox", "sh", "-c", + "until [ -x /var/lib/easyenclave/bin/dd-podman ]; do sleep 2; done\nexec /var/lib/easyenclave/bin/dd-podman run --rm --name ollama --network=host -v /var/lib/easyenclave/ollama:/root/.ollama -e OLLAMA_HOST=127.0.0.1:11434 docker.io/ollama/ollama:latest serve" + ] +} diff --git a/apps/ollama/workload.prod.json b/apps/ollama/workload.prod.json new file mode 100644 index 0000000..eae4a9a --- /dev/null +++ b/apps/ollama/workload.prod.json @@ -0,0 +1,7 @@ +{ + "app_name": "ollama", + "cmd": [ + "/bin/busybox", "sh", "-c", + "until [ -x /var/lib/easyenclave/bin/dd-podman ]; do sleep 2; done\nexec /var/lib/easyenclave/bin/dd-podman run --rm --name ollama --network=host --device=/dev/nvidia0 --device=/dev/nvidiactl --device=/dev/nvidia-uvm -v /var/lib/easyenclave/ollama:/root/.ollama -e OLLAMA_HOST=127.0.0.1:11434 docker.io/ollama/ollama:latest serve" + ] +} diff --git a/apps/openclaw/workload.json.tmpl b/apps/openclaw/workload.json.tmpl new file mode 100644 index 0000000..6f9087d --- /dev/null +++ b/apps/openclaw/workload.json.tmpl @@ -0,0 +1,7 @@ +{ + "app_name": "openclaw", + "cmd": [ + "/bin/busybox", "sh", "-c", + "echo 'openclaw: waiting for ollama on 127.0.0.1:11434...'\ni=0\nuntil /bin/busybox wget -q -T 3 -O- http://127.0.0.1:11434/api/tags >/dev/null 2>&1; do\n i=$((i+1))\n if [ $((i % 6)) -eq 0 ]; then echo \"openclaw: still waiting for ollama ($i tries, ${i}x5s elapsed)\"; fi\n sleep 5\ndone\necho 'openclaw: ollama responding, pulling model ${MODEL}'\n/var/lib/easyenclave/bin/dd-podman exec ollama ollama pull ${MODEL} 2>&1\necho 'openclaw: model pulled, launching gateway'\nexec /var/lib/easyenclave/bin/dd-podman exec ollama ollama launch openclaw --model ${MODEL} --yes" + ] +} diff --git a/apps/podman-bootstrap/workload.json b/apps/podman-bootstrap/workload.json new file mode 100644 index 0000000..5a797e4 --- /dev/null +++ b/apps/podman-bootstrap/workload.json @@ -0,0 +1,7 @@ +{ + "app_name": "podman-bootstrap", + "cmd": [ + "/bin/busybox", "sh", "-c", + "set -e\nBIN=/var/lib/easyenclave/bin\nSRC=$BIN/podman-linux-amd64\nuntil [ -x $SRC/usr/local/bin/podman ]; do sleep 1; done\n# If there's a vdc scratch disk, wait for mount-models to actually\n# mount it before we write files under /var/lib/easyenclave/ollama —\n# otherwise our writes land on the rootfs tmpfs and get shadowed the\n# moment vdc is mounted. On VMs without vdc (GCP CP preview) there's\n# no mount-models workload and this check short-circuits.\nif [ -b /dev/vdc ]; then\n until mountpoint -q /var/lib/easyenclave/ollama 2>/dev/null; do sleep 1; done\nfi\nmkdir -p /var/lib/easyenclave/ollama\ncp -f $SRC/usr/local/bin/* $BIN/\ncp -f $SRC/usr/local/lib/podman/conmon $BIN/\ncp -f $SRC/usr/local/lib/podman/netavark $BIN/ 2>/dev/null || true\ncp -f $SRC/usr/local/lib/podman/aardvark-dns $BIN/ 2>/dev/null || true\ncp -f $SRC/usr/local/lib/podman/rootlessport $BIN/ 2>/dev/null || true\nmkdir -p /var/lib/easyenclave/ollama/.podman/storage /var/lib/easyenclave/ollama/.podman/runroot\n# /dev/shm is where podman puts its per-container POSIX shm lock\n# file (libpod_lock). EE's guest rootfs may not mount tmpfs on\n# /dev/shm; without it, podman fails 'failed to create 2048 locks\n# in /libpod_lock: no such file or directory'. mkdir + mount idempotently.\nif ! mountpoint -q /dev/shm 2>/dev/null; then\n mkdir -p /dev/shm\n mount -t tmpfs -o size=64M tmpfs /dev/shm 2>/dev/null || true\nfi\n# Pick storage driver based on substrate. vdc-backed ext4 supports\n# native overlay (fast + space-efficient). Without vdc (GCP CP\n# preview, any guest running on tmpfs rootfs), overlay-on-tmpfs\n# errors out, so fall back to vfs (slower, full copy per layer, but\n# works on any filesystem).\nif mountpoint -q /var/lib/easyenclave/ollama; then\n DRIVER=overlay\nelse\n DRIVER=vfs\nfi\n# Write containers.conf on vdc (writable). /etc is RO on EE so we\n# can't put it where podman looks by default. helper_binaries_dir\n# tells podman where we staged conmon/netavark/aardvark-dns/… —\n# podman probes those at startup even with --network=host.\nPOL=/var/lib/easyenclave/ollama/.podman/policy.json\n# Minimum viable signature policy: trust anything. EE's attestation\n# story happens one layer up (image digest pinned by the spec we\n# baked); podman's own signature checking would duplicate that.\nprintf '%s' '{\"default\":[{\"type\":\"insecureAcceptAnything\"}]}' > $POL\n# Podman's containers-common looks for policy.json at hardcoded\n# paths (/etc/containers/, $HOME/.config/containers/). /etc and\n# /root are both RO on EE, so build a fake HOME under\n# /var/lib/easyenclave/.home (writable) and set HOME there in the\n# dd-podman wrapper.\nHOME_DIR=/var/lib/easyenclave/.home\nmkdir -p $HOME_DIR/.config/containers\ncp -f $POL $HOME_DIR/.config/containers/policy.json\nCONF=/var/lib/easyenclave/ollama/.podman/containers.conf\nprintf '%s\\n' '[engine]' 'helper_binaries_dir = [\"/var/lib/easyenclave/bin\"]' > $CONF\nmkdir -p $HOME_DIR/tmp\nprintf '%s\\n' '#!/bin/sh' \"export HOME=$HOME_DIR\" \"export TMPDIR=$HOME_DIR/tmp\" \"export CONTAINERS_CONF=$CONF\" \"exec /var/lib/easyenclave/bin/podman --conmon=/var/lib/easyenclave/bin/conmon --runtime=/var/lib/easyenclave/bin/crun --storage-driver=$DRIVER --root=/var/lib/easyenclave/ollama/.podman/storage --runroot=/var/lib/easyenclave/ollama/.podman/runroot --cgroup-manager=cgroupfs \\\"\\$@\\\"\" > $BIN/dd-podman\nchmod +x $BIN/dd-podman\nls -la $CONF $POL $BIN/dd-podman 2>&1 || true\ncat $CONF\necho podman-bootstrap: v2 ok driver=$DRIVER conf=$CONF policy=$POL" + ] +} diff --git a/apps/podman-static/workload.json b/apps/podman-static/workload.json new file mode 100644 index 0000000..939125d --- /dev/null +++ b/apps/podman-static/workload.json @@ -0,0 +1,7 @@ +{ + "app_name": "podman-static", + "github_release": { + "repo": "mgoltzsche/podman-static", + "asset": "podman-linux-amd64.tar.gz" + } +} diff --git a/scripts/dd-relaunch.sh b/scripts/dd-relaunch.sh deleted file mode 100755 index bdf1d8d..0000000 --- a/scripts/dd-relaunch.sh +++ /dev/null @@ -1,53 +0,0 @@ -#!/usr/bin/env bash -# dd-relaunch.sh — destroy and recreate one local TDX agent VM. -# -# Invoked over SSH by .github/workflows/local-agents.yml after a -# Release / Production Deploy succeeds. Pulls the current main of dd -# (so this script and local-agents.sh are always the latest), tears -# down the existing VM + overlay, runs scripts/local-agents.sh to -# redefine, and starts the VM. -# -# dd-relaunch.sh prod https://app.devopsdefender.com -# dd-relaunch.sh preview https://pr-N.devopsdefender.com -# -# DD_PAT and DD_ITA_API_KEY must be set in the environment. - -set -euo pipefail - -KIND="${1?usage: dd-relaunch.sh }" -CP="${2?cp url required}" -: "${DD_PAT?DD_PAT must be set}" -: "${DD_ITA_API_KEY?DD_ITA_API_KEY must be set}" - -case "$KIND" in - prod|preview) ;; - *) echo "unknown kind: $KIND (want prod|preview)" >&2; exit 2 ;; -esac - -cd /home/tdx2/src/dd - -# Pull the latest scripts. Limit the checkout to the two scripts so a -# dirty working tree elsewhere doesn't block the deploy. The relaunch -# script itself has already been read into memory by bash, so the -# update takes effect on the *next* invocation. -git fetch --quiet origin main -git checkout --quiet origin/main -- scripts/local-agents.sh scripts/dd-relaunch.sh - -vm="dd-local-$KIND" -overlay="/var/lib/libvirt/images/$vm.qcow2" - -virsh destroy "$vm" 2>/dev/null || true -virsh undefine "$vm" --managed-save --snapshots-metadata 2>/dev/null || true -rm -f "$overlay" - -# Redefine via local-agents.sh; "" skips the other slot. -case "$KIND" in - prod) ./scripts/local-agents.sh "" "$CP" ;; - preview) ./scripts/local-agents.sh "$CP" "" ;; -esac - -virsh start "$vm" -echo "relaunched $vm against $CP" - -# ollama deploy + pull + query is driven from the workflow's HTTPS step -# on ubuntu-latest, not here — see .github/workflows/local-agents.yml. diff --git a/scripts/gcp-deploy.sh b/scripts/gcp-deploy.sh deleted file mode 100755 index 96eed6b..0000000 --- a/scripts/gcp-deploy.sh +++ /dev/null @@ -1,177 +0,0 @@ -#!/bin/bash -# gcp-deploy.sh — Create a TDX management VM on GCP that boots from a -# sealed easyenclave image and runs dd management as a native process. -# -# Both the devopsdefender binary and cloudflared are fetched straight -# from their GitHub releases by easyenclave's github_release workload -# source — no OCI registry, no Dockerfile. Cloudflared is a fetch-only -# boot workload: its binary lands in /var/lib/easyenclave/bin (now on -# PATH) so dd-register can shell out to `cloudflared` by name. -# -# Agent-side mirror: a local TDX guest with a vfio-pci-passed GPU can -# register against the CP this script deploys by using the same -# easyenclave `github_release` workload source for the devopsdefender -# binary, with `DD_REGISTER_URL=wss://{hostname}/register`. See the -# local-GPU demo notes in the commit trail. -# -# Called by .github/workflows/{staging,production}-deploy.yml. Requires -# gcloud CLI authenticated via Workload Identity Federation. -# -# Required env vars (set by the workflow): -# GCP_PROJECT_ID — GCP project where the VM lives -# GCP_ZONE — GCP zone (e.g. us-central1-c) -# DD_ENV — staging, production, or pr-{num} (ephemeral per-PR) -# DD_DOMAIN — Public domain (e.g. devopsdefender.com) -# CLOUDFLARE_API_TOKEN — CF API token (dd-register uses it) -# CLOUDFLARE_ACCOUNT_ID — CF account ID -# CLOUDFLARE_ZONE_ID — CF zone ID -# -# Optional env vars: -# DD_HOSTNAME — public hostname override. If unset, derived -# from DD_ENV (production → app.$DOMAIN, -# anything else → app-staging.$DOMAIN). Set -# explicitly for per-PR envs (pr-42.$DOMAIN). -# DD_GITHUB_CLIENT_ID — GitHub OAuth client ID. If unset, dd-web -# disables OAuth login and only PAT auth works. -# Per-PR envs leave this unset. -# DD_GITHUB_CLIENT_SECRET — GitHub OAuth client secret (paired with above) -# DD_GITHUB_CALLBACK_URL — OAuth callback, default https://{hostname}/auth/github/callback -# EE_IMAGE_FAMILY — easyenclave GCP image family -# EE_IMAGE_PROJECT — project hosting the image -# DD_RELEASE_TAG — GitHub release tag on devopsdefender/dd -# (defaults to 'latest'; PRs override with pr-{sha12}) -# VM_MACHINE_TYPE — default c3-standard-4 -# VM_DISK_SIZE — default 10GB - -set -euo pipefail - -# ── easyenclave image family ────────────────────────────────────────────── -# easyenclave-staging → rolling main, rotates on every push (5 kept) -# easyenclave-stable → v* tags, kept forever -EE_IMAGE_FAMILY="${EE_IMAGE_FAMILY:-easyenclave-staging}" -EE_IMAGE_PROJECT="${EE_IMAGE_PROJECT:-easyenclave}" -DD_RELEASE_TAG="${DD_RELEASE_TAG:-latest}" - -VM_NAME="dd-${DD_ENV}-$(date +%s)" -VM_MACHINE_TYPE="${VM_MACHINE_TYPE:-c3-standard-4}" -VM_DISK_SIZE="${VM_DISK_SIZE:-10GB}" - -if [ -z "${DD_HOSTNAME:-}" ]; then - if [ "${DD_ENV}" = "production" ]; then - DD_HOSTNAME="app.${DD_DOMAIN}" - else - DD_HOSTNAME="app-staging.${DD_DOMAIN}" - fi -fi -DD_GITHUB_CLIENT_ID="${DD_GITHUB_CLIENT_ID:-}" -DD_GITHUB_CLIENT_SECRET="${DD_GITHUB_CLIENT_SECRET:-}" -DD_GITHUB_CALLBACK_URL="${DD_GITHUB_CALLBACK_URL:-https://${DD_HOSTNAME}/auth/github/callback}" - -# Intel Trust Authority — mandatory. DD_ITA_API_KEY must be set in the -# workflow (from secrets.DD_ITA_API_KEY). The CP will refuse to start -# without one. Everything else has a default. -if [ -z "${DD_ITA_API_KEY:-}" ]; then - echo "DD_ITA_API_KEY is required (configure secrets.DD_ITA_API_KEY)" >&2 - exit 1 -fi -DD_ITA_BASE_URL="${DD_ITA_BASE_URL:-https://api.trustauthority.intel.com}" -DD_ITA_JWKS_URL="${DD_ITA_JWKS_URL:-https://portal.trustauthority.intel.com/certs}" -DD_ITA_ISSUER="${DD_ITA_ISSUER:-https://portal.trustauthority.intel.com}" - -# ── Build the workload spec ────────────────────────────────────────────── -# Two boot workloads: -# 1. cloudflared — fetch-only. easyenclave downloads cloudflare's -# static binary from their GitHub release, symlinks it as -# `cloudflared`, and exits the deploy as "completed". The binary -# sits on PATH for dd-register to spawn. -# 2. dd-management — fetches the devopsdefender binary from our own -# release and runs it. dd-register + dd-web both live in this -# single process (DD_MODE=management). -EE_BOOT_WORKLOADS=$(jq -c -n \ - --arg dd_tag "$DD_RELEASE_TAG" \ - --arg cf_token "$CLOUDFLARE_API_TOKEN" \ - --arg cf_account "$CLOUDFLARE_ACCOUNT_ID" \ - --arg cf_zone "$CLOUDFLARE_ZONE_ID" \ - --arg domain "$DD_DOMAIN" \ - --arg hostname "$DD_HOSTNAME" \ - --arg env "$DD_ENV" \ - --arg gh_client_id "$DD_GITHUB_CLIENT_ID" \ - --arg gh_client_secret "$DD_GITHUB_CLIENT_SECRET" \ - --arg gh_callback "$DD_GITHUB_CALLBACK_URL" \ - --arg ita_api_key "$DD_ITA_API_KEY" \ - --arg ita_base_url "$DD_ITA_BASE_URL" \ - --arg ita_jwks_url "$DD_ITA_JWKS_URL" \ - --arg ita_issuer "$DD_ITA_ISSUER" \ - '[ - { - "github_release": { - "repo": "cloudflare/cloudflared", - "asset": "cloudflared-linux-amd64", - "rename": "cloudflared" - }, - "app_name": "cloudflared" - }, - { - "github_release": { - "repo": "devopsdefender/dd", - "asset": "devopsdefender", - "tag": $dd_tag - }, - "cmd": ["devopsdefender"], - "app_name": "dd-management", - "env": ( - [ - "DD_MODE=management", - ("DD_CF_API_TOKEN=" + $cf_token), - ("DD_CF_ACCOUNT_ID=" + $cf_account), - ("DD_CF_ZONE_ID=" + $cf_zone), - ("DD_CF_DOMAIN=" + $domain), - ("DD_HOSTNAME=" + $hostname), - ("DD_ENV=" + $env), - "DD_OWNER=devopsdefender", - "DD_REGISTER_PORT=8081", - "DD_OIDC_AUDIENCE=dd-web", - "DD_PORT=8080" - ] - + (if $gh_client_id == "" then [] else [ - ("DD_GITHUB_CLIENT_ID=" + $gh_client_id), - ("DD_GITHUB_CLIENT_SECRET=" + $gh_client_secret), - ("DD_GITHUB_CALLBACK_URL=" + $gh_callback) - ] end) - + [ - ("DD_ITA_API_KEY=" + $ita_api_key), - ("DD_ITA_BASE_URL=" + $ita_base_url), - ("DD_ITA_JWKS_URL=" + $ita_jwks_url), - ("DD_ITA_ISSUER=" + $ita_issuer) - ] - ) - } - ]') - -# ── Wrap into ee-config ─────────────────────────────────────────────────── -jq -c -n \ - --arg workloads "$EE_BOOT_WORKLOADS" \ - '{ "EE_BOOT_WORKLOADS": $workloads, "EE_OWNER": "devopsdefender" }' \ - > /tmp/ee-config.json - -trap 'rm -f /tmp/ee-config.json' EXIT - -# ── Create the VM ───────────────────────────────────────────────────────── -gcloud compute instances create "$VM_NAME" \ - --project="$GCP_PROJECT_ID" \ - --zone="$GCP_ZONE" \ - --machine-type="$VM_MACHINE_TYPE" \ - --confidential-compute-type=TDX \ - --maintenance-policy=TERMINATE \ - --boot-disk-size="$VM_DISK_SIZE" \ - --image-family="$EE_IMAGE_FAMILY" \ - --image-project="$EE_IMAGE_PROJECT" \ - --metadata-from-file=ee-config=/tmp/ee-config.json \ - --labels=devopsdefender=managed,dd_env="${DD_ENV}" \ - --tags=dd-management - -echo "VM: $VM_NAME" -echo " image: family $EE_IMAGE_FAMILY ($EE_IMAGE_PROJECT)" -echo " hostname: $DD_HOSTNAME" -echo " dd release: $DD_RELEASE_TAG" -echo " workload: dd management" diff --git a/scripts/ollama-deploy.sh b/scripts/ollama-deploy.sh deleted file mode 100755 index 4d4babf..0000000 --- a/scripts/ollama-deploy.sh +++ /dev/null @@ -1,327 +0,0 @@ -#!/usr/bin/env bash -# ollama-deploy.sh — run ollama + OpenClaw inside a DD agent VM as -# podman containers. No ollama binary on the guest rootfs (that's -# dynamically linked and fails on EE's busybox rootfs with -# `libstdc++.so.6: cannot open shared object file`). Instead: -# -# 1. Fetch static podman (mgoltzsche/podman-static tarball) as a -# fetch-only DD workload. -# 2. One-shot bootstrap via /exec — flatten the tarball's nested -# bin dir into /var/lib/easyenclave/bin and write a minimal -# /etc/containers/containers.conf (cgroup_manager=cgroupfs so -# we don't need systemd). -# 3. Deploy the ollama container as a long-running workload -# (podman run --net=host ...). Prod also passes the three -# nvidia device nodes for H100 access. -# 4. Pull the right-sized model via `podman exec ollama ollama pull`. -# 5. Launch OpenClaw (a bridge from messaging apps to coding -# agents; subcommand of ollama, npm-installed on first run) as -# a second long-running workload using the same container. -# -# ollama-deploy.sh -# kind: prod | preview -# cp_url: https://app.devopsdefender.com | https://pr-N.devopsdefender.com -# -# Requires DD_PAT in the environment (the workflow's GITHUB_TOKEN). - -set -euo pipefail - -KIND="${1?usage: ollama-deploy.sh }" -CP_URL="${2?cp_url required}" -: "${DD_PAT?}" - -case "$KIND" in - prod) - MODEL="llama3.1:8b" - # GPU passthrough. /dev/nvidia-uvm appears once CUDA is touched; - # the nv-insmod boot workload in scripts/local-agents.sh loads - # the kernel module, so the device nodes exist by this point. - GPU_FLAGS='["--device=/dev/nvidia0","--device=/dev/nvidiactl","--device=/dev/nvidia-uvm"]' - ;; - preview) - MODEL="qwen2.5:0.5b" - GPU_FLAGS='[]' - ;; - *) echo "unknown kind: $KIND" >&2; exit 2 ;; -esac - -VM_NAME="dd-local-$KIND" -AUTH=(-H "Authorization: Bearer $DD_PAT") - -echo "== ollama-deploy $VM_NAME (model=$MODEL, cp=$CP_URL) ==" - -# ── 1. Discover the fresh agent registration on the CP ───────────── -# last_seen > started_at_iso filters out stale entries from the VM -# generation we just destroyed during `virsh destroy`. -started_at_iso="$(date -u +%Y-%m-%dT%H:%M:%SZ)" -echo " waiting for a fresh ${VM_NAME} registration (last_seen > ${started_at_iso})" -agent_host="" -for i in $(seq 1 60); do - agent_host=$(curl -fsS "${AUTH[@]}" "$CP_URL/api/agents" 2>/dev/null \ - | jq -r --arg vm "$VM_NAME" --arg since "$started_at_iso" ' - [.[] | select(.vm_name==$vm and .status=="healthy" and .last_seen > $since)] - | sort_by(.last_seen) | reverse | .[0].hostname // empty' 2>/dev/null || true) - if [ -n "$agent_host" ] && [ "$agent_host" != "null" ]; then - break - fi - sleep 10 -done -if [ -z "$agent_host" ] || [ "$agent_host" = "null" ]; then - echo "ERROR: $VM_NAME never appeared in CP fleet" >&2 - exit 1 -fi -echo " agent: https://$agent_host" - -# ── 2. Wait for Cloudflare DNS to propagate ──────────────────────── -echo " waiting for DNS on $agent_host..." -for i in $(seq 1 30); do - if getent hosts "$agent_host" >/dev/null 2>&1; then - echo " DNS resolved" - break - fi - sleep 5 -done - -agent() { curl -fsS --max-time 300 "${AUTH[@]}" "https://$agent_host$1" "${@:2}"; } - -# ── 3. Fetch podman-static (fetch-only DD workload) ──────────────── -# Tarball unpacks to /var/lib/easyenclave/bin/podman-linux-amd64/ -# with usr/local/bin/{podman,crun,conmon,netavark,...}. -# NOTE: omit `tag` — EE treats `tag: null` as "GET /releases/latest" -# (the real newest release), while `tag: "latest"` is a literal tag -# lookup and 404s on repos like mgoltzsche/podman-static that version -# their tags as v5.7.1 rather than with a rolling "latest" ref. -echo " POST /deploy podman-static..." -A_SPEC=$(jq -c -n '{ - app_name: "podman-static", - github_release: { - repo: "mgoltzsche/podman-static", - asset: "podman-linux-amd64.tar.gz" - } -}') -agent /deploy -H 'Content-Type: application/json' -d "$A_SPEC" | jq -c '.' || true - -echo " waiting for podman binary to appear..." -podman_path="/var/lib/easyenclave/bin/podman-linux-amd64/usr/local/bin/podman" -for i in $(seq 1 60); do - resp=$(agent /exec -H 'Content-Type: application/json' \ - -d "$(jq -c -n --arg p "$podman_path" '{cmd:["/bin/busybox","sh","-c",("test -x " + $p + " && echo found")],timeout_secs:5}')" \ - 2>/dev/null || true) - if echo "$resp" | grep -q found; then - echo " podman unpacked" - break - fi - sleep 5 -done - -# ── 4. Bootstrap: stage podman's helper binaries ─────────────────── -# mgoltzsche's tarball layout: -# usr/local/bin/ podman, crun, runc, fuse-overlayfs, -# fusermount3, pasta, pasta.avx2 -# usr/local/lib/podman/ conmon, netavark, aardvark-dns, -# rootlessport, catatonit -# EE's guest rootfs has BOTH /usr AND /etc mounted read-only. The -# only writable paths are under /var/lib/easyenclave (on the -# persistent vdc ext4 disk) and /run/tmp-style tmpfs locations. So -# we cannot write a containers.conf anywhere podman looks for one, -# and we cannot cp conmon into any of podman's hardcoded search -# dirs. Every path has to be on the podman CLI directly. -# -# We DO stage the helpers into /var/lib/easyenclave/bin so the -# container workload's `cmd[0]` can reach `podman`, and the -# --conmon / --runtime / --root / --runroot flags on the `podman` -# command (see step 5) point podman at the rest. -echo " bootstrapping podman (staging binaries to writable dirs)..." -bootstrap_sh='set -e -BIN=/var/lib/easyenclave/bin -SRC=$BIN/podman-linux-amd64 -cp -f $SRC/usr/local/bin/* $BIN/ -cp -f $SRC/usr/local/lib/podman/conmon $BIN/ -cp -f $SRC/usr/local/lib/podman/netavark $BIN/ 2>/dev/null || true -cp -f $SRC/usr/local/lib/podman/aardvark-dns $BIN/ 2>/dev/null || true -cp -f $SRC/usr/local/lib/podman/rootlessport $BIN/ 2>/dev/null || true -mkdir -p /var/lib/easyenclave/containers/storage /var/lib/easyenclave/containers/runroot -echo podman-bootstrap: ok' -boot_resp=$(agent /exec -H 'Content-Type: application/json' \ - -d "$(jq -c -n --arg s "$bootstrap_sh" '{cmd:["/bin/busybox","sh","-c",$s],timeout_secs:30}')") -if ! echo "$boot_resp" | jq -e '.exit_code == 0' >/dev/null 2>&1; then - echo "ERROR: podman bootstrap failed" - echo "$boot_resp" | jq . - exit 1 -fi -echo " bootstrap: $(echo "$boot_resp" | jq -r '.stdout // ""' | tail -1)" - -# ── 5. Launch the ollama container (long-running workload) ───────── -# --net=host : ollama listens on guest's 127.0.0.1:11434. -# --name : so we can `podman exec ollama ...` by name. -# --cgroup-manager=cgroupfs: matches containers.conf, still required -# on the command line because podman doesn't always -# pick it up from the engine section when invoked -# outside systemd. -# Volume : /var/lib/easyenclave/ollama is the persistent vdc -# ext4 disk (mounted by the mount-models boot workload -# in local-agents.sh); doubles as ollama's model cache -# and openclaw's npm prefix. -echo " POST /deploy ollama container..." -# Every writable path (--root, --runroot, --conmon, --runtime) is -# on the CLI because EE's /etc and /usr are read-only — podman -# can't fall back on /etc/containers/containers.conf the way it -# normally does. Storage lives on the persistent vdc disk so the -# 900 MB ollama image pull survives VM relaunches. -# --cgroup-manager=cgroupfs because there's no systemd in the guest. -# --network=host so ollama's :11434 binds on the VM's loopback, -# reachable from other EE workloads (like openclaw) and via /exec. -OLLAMA_SPEC=$(jq -c -n --argjson gpu "$GPU_FLAGS" '{ - app_name: "ollama", - cmd: ([ - "/var/lib/easyenclave/bin/podman", - "--conmon=/var/lib/easyenclave/bin/conmon", - "--runtime=/var/lib/easyenclave/bin/crun", - "--root=/var/lib/easyenclave/containers/storage", - "--runroot=/var/lib/easyenclave/containers/runroot", - "--cgroup-manager=cgroupfs", - "run", - "--rm", "--name", "ollama", - "--network=host" - ] + $gpu + [ - "-v", "/var/lib/easyenclave/ollama:/root/.ollama", - "-e", "OLLAMA_HOST=127.0.0.1:11434", - "docker.io/ollama/ollama:latest", - "serve" - ]) -}') -agent /deploy -H 'Content-Type: application/json' -d "$OLLAMA_SPEC" | jq -c '.' || true - -# ── 6. Wait for ollama HTTP to come up inside the container ──────── -# `podman exec ollama ollama list` exits 0 once the server is ready. -# First run has to pull ~900 MB of container image, so allow plenty. -echo " waiting for ollama to be ready (first run pulls the image)..." -ollama_ready=0 -for i in $(seq 1 120); do - resp=$(agent /exec -H 'Content-Type: application/json' \ - -d '{"cmd":["/var/lib/easyenclave/bin/podman","--root=/var/lib/easyenclave/containers/storage","--runroot=/var/lib/easyenclave/containers/runroot","--cgroup-manager=cgroupfs","exec","ollama","ollama","list"],"timeout_secs":15}' \ - 2>/dev/null || true) - if echo "$resp" | jq -e '.exit_code == 0' >/dev/null 2>&1; then - echo " ollama responding" - ollama_ready=1 - break - fi - sleep 10 -done -if [ "$ollama_ready" = "0" ]; then - echo "ERROR: ollama container never became ready (20 min timeout)" - echo " most recent /exec response:" - echo "$resp" | jq . - echo " last 30 lines of 'podman ps -a' + 'podman logs ollama':" - agent /exec -H 'Content-Type: application/json' \ - -d '{"cmd":["/var/lib/easyenclave/bin/podman","--root=/var/lib/easyenclave/containers/storage","--runroot=/var/lib/easyenclave/containers/runroot","ps","-a"],"timeout_secs":10}' | jq -r '.stdout // .stderr // ""' - agent /exec -H 'Content-Type: application/json' \ - -d '{"cmd":["/var/lib/easyenclave/bin/podman","--root=/var/lib/easyenclave/containers/storage","--runroot=/var/lib/easyenclave/containers/runroot","logs","ollama"],"timeout_secs":10}' 2>&1 | jq -r '.stdout // .stderr // ""' | tail -30 - exit 1 -fi - -# ── 7. Pull the model ────────────────────────────────────────────── -echo " pulling $MODEL (this can take a few minutes)..." -pull_resp=$(agent /exec -H 'Content-Type: application/json' \ - -d "$(jq -c -n --arg m "$MODEL" '{ - cmd:["/var/lib/easyenclave/bin/podman","--root=/var/lib/easyenclave/containers/storage","--runroot=/var/lib/easyenclave/containers/runroot","--cgroup-manager=cgroupfs","exec","ollama","ollama","pull",$m], - timeout_secs:1800 - }')") -if ! echo "$pull_resp" | jq -e '.exit_code == 0' >/dev/null 2>&1; then - echo "ERROR: ollama pull $MODEL failed" - echo "$pull_resp" | jq . - exit 1 -fi -echo " pull: $(echo "$pull_resp" | jq -r '.stdout // "(no stdout)"' | tail -3)" - -# ── 8. Launch OpenClaw ───────────────────────────────────────────── -# `ollama launch openclaw` installs via npm on first run if missing -# and then stays foreground, so we register it as a second long- -# running workload. --yes accepts the install prompt non-interactively. -echo " POST /deploy openclaw..." -OPENCLAW_SPEC=$(jq -c -n --arg m "$MODEL" '{ - app_name: "openclaw", - cmd: [ - "/var/lib/easyenclave/bin/podman", - "--root=/var/lib/easyenclave/containers/storage", - "--runroot=/var/lib/easyenclave/containers/runroot", - "--cgroup-manager=cgroupfs", - "exec", "ollama", - "ollama", "launch", "openclaw", - "--model", $m, - "--yes" - ] -}') -agent /deploy -H 'Content-Type: application/json' -d "$OPENCLAW_SPEC" | jq -c '.' || true - -# ── 9. Confirm openclaw is up ─ three probes, weakest → strongest ── -# (a) EE lists `openclaw` in /health — proves the workload was -# accepted by the in-VM runtime. Flips green on fork, before -# npm-install finishes, so on its own it's weak. -# (b) GET http://127.0.0.1:18789/healthz (the OpenClaw gateway HTTP -# endpoint). Docs: https://docs.openclaw.ai/gateway/health. -# 200 with valid JSON = gateway has bound its port and is -# serving. The ollama container runs with --net=host so the -# loopback is the VM's loopback; we curl through `podman exec` -# so we hit the in-container curl (EE's busybox lacks one). -# (c) `openclaw agent --message "ping"` — the documented one-shot -# CLI. Goes through the running gateway, hands the prompt to -# the loaded model, returns the assistant reply. Exit 0 AND -# non-empty stdout = the full ollama → openclaw → model path -# works end-to-end. The reply gets echoed into the workflow -# log as proof of life. -echo " confirming openclaw workload is registered with EE..." -for i in $(seq 1 30); do - list=$(agent /health 2>/dev/null || true) - if echo "$list" | jq -e '.deployments // [] | index("openclaw")' >/dev/null 2>&1; then - echo " openclaw: registered" - break - fi - sleep 5 -done - -echo " waiting for openclaw gateway on http://127.0.0.1:18789/healthz..." -openclaw_live=0 -for i in $(seq 1 60); do - resp=$(agent /exec -H 'Content-Type: application/json' \ - -d '{"cmd":["/var/lib/easyenclave/bin/podman","--root=/var/lib/easyenclave/containers/storage","--runroot=/var/lib/easyenclave/containers/runroot","--cgroup-manager=cgroupfs","exec","ollama","curl","-fsS","http://127.0.0.1:18789/healthz"],"timeout_secs":10}' \ - 2>/dev/null || true) - if echo "$resp" | jq -e '.exit_code == 0' >/dev/null 2>&1; then - echo " openclaw: /healthz 200" - echo "$resp" | jq -r '.stdout // ""' | head -c 200 | sed 's/^/ /' - echo - openclaw_live=1 - break - fi - sleep 5 -done - -if [ "$openclaw_live" != "1" ]; then - echo "ERROR: openclaw /healthz never returned 200 (gateway didn't come up within 5 min)" - echo " last /exec response:" - echo "$resp" | jq -c '.' | head -c 500 - exit 1 -fi - -echo " sending a round-trip prompt: 'ping'" -chat=$(agent /exec -H 'Content-Type: application/json' \ - -d '{"cmd":["/var/lib/easyenclave/bin/podman","--root=/var/lib/easyenclave/containers/storage","--runroot=/var/lib/easyenclave/containers/runroot","--cgroup-manager=cgroupfs","exec","ollama","openclaw","agent","--message","ping","--thinking","low"],"timeout_secs":120}' \ - 2>/dev/null || true) -reply=$(echo "$chat" | jq -r '.stdout // ""') -if [ -z "$reply" ] || ! echo "$chat" | jq -e '.exit_code == 0' >/dev/null 2>&1; then - echo "ERROR: openclaw agent --message didn't return a reply" - echo " raw: $(echo "$chat" | jq -c '.' | head -c 500)" - exit 1 -fi -echo -echo "=== openclaw replied ===" -echo "$reply" -echo "========================" - -echo -echo "=== agent fleet summary ===" -echo " agent: https://$agent_host" -echo " model: $MODEL" -echo " ollama: podman container 'ollama' on host net, :11434" -echo " openclaw: http://127.0.0.1:18789 (gateway), replied to round-trip ping" -echo "===========================" From 2f7194142238d428451f110f4ed4faf09a58cf67 Mon Sep 17 00:00:00 2001 From: Alex Newman Date: Sat, 18 Apr 2026 20:24:49 +0000 Subject: [PATCH 2/2] feat(apps): wire ollama+openclaw into dd-local-{preview,prod} + apps/README.md MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Agent VMs boot the full container stack now — every PR push and every main merge exercises podman + ollama + openclaw end-to-end. Preview runs CPU inference with qwen2.5:0.5b; prod runs GPU inference with qwen2.5:7b on the H100. Changes: - apps/_infra/local-agents.sh: replace the inline `jq -c -n` workload literals with a `bake()` helper that reads from apps//workload .{json,json.tmpl}. The workload set grows from {nv, mount-models, cloudflared, dd-agent} to {nv (prod only), mount-models, podman-static, podman-bootstrap, ollama.{prod,preview}.json, openclaw, cloudflared, dd-agent}. - apps/podman-bootstrap: install the wrapper as `podman` (not `dd-podman`) so bare `podman ps` from PATH reaches the right storage root instead of erroring with `mkdir /var/lib/containers: read-only file system`. Raw binary moves to .podman-raw; dd-podman becomes a symlink for back-compat. - deploy-cp.yml + local-agents.sh bake() now pass envsubst a restricted var list — only the uppercase `${VAR}` references the template actually declares. Lowercase shell locals ($i, $((…))) inside openclaw's `until` loop are no longer eaten. - apps/README.md: canonical reference for the workload spec, lifecycle matrix (CP / preview agent / prod agent), and a "deploying your own" walkthrough. Main README.md points at it. Co-Authored-By: Claude Opus 4.7 (1M context) --- .github/workflows/deploy-cp.yml | 13 +++- README.md | 17 +---- apps/README.md | 108 +++++++++++++++++++++++++++ apps/_infra/local-agents.sh | 111 +++++++++++++++++----------- apps/podman-bootstrap/workload.json | 2 +- 5 files changed, 188 insertions(+), 63 deletions(-) create mode 100644 apps/README.md diff --git a/.github/workflows/deploy-cp.yml b/.github/workflows/deploy-cp.yml index d21265c..772d750 100644 --- a/.github/workflows/deploy-cp.yml +++ b/.github/workflows/deploy-cp.yml @@ -116,13 +116,18 @@ jobs: : "${DD_ITA_API_KEY:?set DD_ITA_API_KEY via secrets.DD_ITA_API_KEY}" export DD_GITHUB_CALLBACK_URL="${DD_GITHUB_CALLBACK_URL:-https://${DD_HOSTNAME}/auth/github/callback}" - # Bake a workload template: envsubst ${VAR} placeholders and - # strip any "KEY=" env entries that ended up with empty values - # (e.g. OAuth creds in non-prod envs). + # Bake a workload template: substitute ${VAR} placeholders + # and strip "KEY=" env entries that ended up with empty values + # (e.g. OAuth creds in non-prod envs). envsubst is restricted + # to the uppercase ${VAR} refs the template actually declares + # so shell locals inside cmd strings ($i, $((…)), etc.) + # aren't eaten. bake() { case "$1" in *.json.tmpl) - envsubst < "$1" \ + local vars + vars=$(grep -oE '\$\{[A-Z_][A-Z0-9_]*\}' "$1" | sort -u | tr -d '\n') + envsubst "$vars" < "$1" \ | jq -c 'if .env then .env |= map(select(test("^[^=]+=.+"))) else . end' ;; *.json) diff --git a/README.md b/README.md index f8a1df1..575ec1f 100644 --- a/README.md +++ b/README.md @@ -22,22 +22,7 @@ The sealed enclave runtime is [EasyEnclave](https://github.com/easyenclave/easye Every fleet VM boots from a sealed easyenclave image published by [easyenclave/easyenclave](https://github.com/easyenclave/easyenclave/releases). No cloud-init, no stock Ubuntu, no runtime `apt-get install`. The TDX VM's rootfs is the latest image in the `easyenclave-staging` (or `-stable`) family, attestable against a single UKI SHA256. -The `devopsdefender` binary ships as a **GitHub release asset** — not an OCI image. Easyenclave fetches it directly via its `github_release` boot workload source: - -```json -{ - "github_release": { - "repo": "devopsdefender/dd", - "asset": "devopsdefender", - "tag": "latest" - }, - "cmd": ["devopsdefender"], - "app_name": "dd-management", - "env": ["DD_MODE=management", ...] -} -``` - -`cloudflared` is also pulled directly from `cloudflare/cloudflared`'s GitHub releases as a fetch-only boot workload — no bundling in our image, no Dockerfile step. +Every workload is a JSON spec consumed by easyenclave's `DeployRequest`. Boot-time and runtime-deployed workloads share one schema; both the `devopsdefender` binary and `cloudflared` ship as **GitHub release assets** — not OCI images — and easyenclave fetches them via its `github_release` source. The full set of specs and a guide to writing your own lives in [`apps/README.md`](apps/README.md). Per-VM configuration (CF credentials, GitHub OAuth, the workload spec itself) is passed to easyenclave at boot via **GCE instance metadata** (`ee-config` attribute), read by `easyenclave::init::fetch_gce_metadata_config()` and applied as env vars. The CP-deploy step in `.github/workflows/deploy-cp.yml` builds the spec and invokes `gcloud compute instances create --image-family=easyenclave-staging --metadata-from-file=ee-config=...`. diff --git a/apps/README.md b/apps/README.md new file mode 100644 index 0000000..12d98d0 --- /dev/null +++ b/apps/README.md @@ -0,0 +1,108 @@ +# apps/ — workload specs + +This directory is DD's canonical reference for **how to deploy a workload**. Every directory here is one workload — a process easyenclave runs inside a TDX-sealed VM. The specs are both the live deployment configuration and the worked example for operators writing their own. + +## Layout + +``` +apps/ + / + workload.json # literal spec + workload.json.tmpl # spec with ${VAR} placeholders (baked at deploy time) + _infra/ # host-side scripts; not a deployable workload +``` + +## What a workload looks like + +A **workload** is a JSON object consumed by easyenclave's `DeployRequest` (see `src/easyenclave/src/workload.rs`). Minimum shape: + +```json +{ + "app_name": "myapp", + "cmd": ["/bin/busybox", "sh", "-c", "echo hello; sleep inf"] +} +``` + +Add `github_release` to fetch a binary asset directly from a GitHub release — no OCI registry, no Dockerfile. The asset lands in `/var/lib/easyenclave/bin/` and is spawned by `cmd`: + +```json +{ + "app_name": "cloudflared", + "github_release": { + "repo": "cloudflare/cloudflared", + "asset": "cloudflared-linux-amd64", + "rename": "cloudflared" + } +} +``` + +Add `env` to inject config: + +```json +{ + "env": ["MY_ENDPOINT=https://api.example.com", "DEBUG=1"] +} +``` + +## Templates + +Files ending in `.json.tmpl` carry `${VAR}` placeholders. At bake time: + +1. `envsubst` substitutes every uppercase `${VAR}` that appears in the template using the caller's environment. +2. `jq` drops env-array entries whose value ended up empty (so you can make OAuth creds / optional secrets conditional by just leaving them unset). +3. The result is a plain `workload.json` ready for EE. + +Only uppercase placeholders get substituted — shell locals like `$i` or `$((n+1))` inside `cmd` strings are left alone. The bake helper is duplicated inline in two places so both lifecycle points behave identically: + +- `.github/workflows/deploy-cp.yml` (CI, for CP workloads) +- `apps/_infra/local-agents.sh` (tdx2 host, for agent VMs) + +## Where each workload runs + +| workload | CP VM | agent VM (preview) | agent VM (prod) | +|---|---|---|---| +| `cloudflared` | ✅ | ✅ | ✅ | +| `dd-management` | ✅ | | | +| `dd-agent` | | ✅ | ✅ | +| `mount-models` | | ✅ | ✅ | +| `nv` | | | ✅ (GPU insmod) | +| `podman-static` | | ✅ | ✅ | +| `podman-bootstrap` | | ✅ | ✅ | +| `ollama` | | ✅ (CPU, preview.json) | ✅ (GPU, prod.json) | +| `openclaw` | | ✅ (qwen2.5:0.5b) | ✅ (qwen2.5:7b) | + +CP stays slim: just `cloudflared` + `dd-management`. Containerised LLM serving lives on agent VMs where the `vdc` ext4 disk holds models + image storage. + +## Ordering + +EasyEnclave spawns boot workloads concurrently — there's no declared dependency graph. Dependents self-sequence by polling for their prerequisites. Worked example from this tree: + +- `podman-bootstrap` waits for `podman-static`'s tarball (`until [ -x $SRC/usr/local/bin/podman ]; do sleep 1; done`). +- `ollama`'s cmd waits for the wrapper (`until [ -x /var/lib/easyenclave/bin/podman ]; do sleep 2; done`). +- `openclaw`'s cmd waits for ollama's HTTP endpoint (`until wget -q -O- http://127.0.0.1:11434/api/tags; do sleep 5; done`) before pulling the model and launching the gateway. + +Costs seconds of wasted polling at boot; easy to reason about; no workload-runner changes needed. + +## Deploying your own + +1. Copy an existing folder as a starting point: + ``` + cp -r apps/cloudflared apps/myapp + $EDITOR apps/myapp/workload.json + ``` +2. Decide where it runs: + - **CP VM**: add a `bake apps/myapp/workload.json` line to the workload-building `run:` step in `.github/workflows/deploy-cp.yml`. + - **Agent VM**: add the same call to `apps/_infra/local-agents.sh` in `build_config_iso()`. + - **Ad-hoc, runtime-only**: POST the baked JSON to `/deploy` on a running agent: + ``` + curl -H "Authorization: Bearer $DD_PAT" \ + -H "Content-Type: application/json" \ + -d @apps/myapp/workload.json \ + https:///deploy + ``` + +## Reference + +- Schema source of truth: [`src/easyenclave/src/workload.rs`](../src/easyenclave/src/workload.rs) — the `DeployRequest` struct EE deserializes on `/deploy`. +- CP deploy caller: [`.github/workflows/deploy-cp.yml`](../.github/workflows/deploy-cp.yml) — inline `bake()` + CP workload set. +- Agent VM builder: [`apps/_infra/local-agents.sh`](_infra/local-agents.sh) — inline `bake()` + agent workload set per kind. diff --git a/apps/_infra/local-agents.sh b/apps/_infra/local-agents.sh index 20b772a..9d879e0 100755 --- a/apps/_infra/local-agents.sh +++ b/apps/_infra/local-agents.sh @@ -31,10 +31,42 @@ fi : "${DD_PAT?set DD_PAT (e.g. DD_PAT=\$(gh auth token))}" : "${DD_ITA_API_KEY?set DD_ITA_API_KEY}" +# Resolve repo root regardless of invoking CWD — the workload specs +# under apps// need absolute paths so bake() can find them. +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +REPO_ROOT="$(cd "$SCRIPT_DIR/../.." && pwd)" + IMG_DIR=/var/lib/libvirt/images BASE="$IMG_DIR/easyenclave-local.qcow2" BASE_DOMAIN="easyenclave-local" +# Render one workload spec. Matches the helper inlined in +# .github/workflows/deploy-cp.yml — same envsubst + empty-entry strip, +# so boot-time (config.iso) and runtime (/deploy) see identical JSON. +# +# envsubst is restricted to the ALL-CAPS `${VAR}` references that +# appear in the template itself. Lowercase `$i`, `${i}`, and bare +# `$((…))` arithmetic inside shell cmd strings are left alone — +# otherwise envsubst would eat shell locals in openclaw's `until` +# loop and produce broken scripts. +bake() { + case "$1" in + *.json.tmpl) + local vars + vars=$(grep -oE '\$\{[A-Z_][A-Z0-9_]*\}' "$1" | sort -u | tr -d '\n') + envsubst "$vars" < "$1" \ + | jq -c 'if .env then .env |= map(select(test("^[^=]+=.+"))) else . end' + ;; + *.json) + jq -c . "$1" + ;; + *) + echo "local-agents.sh: unknown workload file type: $1" >&2 + return 1 + ;; + esac +} + [ -r "$BASE" ] || { echo "missing $BASE" >&2; exit 1; } virsh dominfo "$BASE_DOMAIN" >/dev/null 2>&1 || { echo "base libvirt domain '$BASE_DOMAIN' not defined — rebuild the EE image first" >&2 @@ -58,51 +90,46 @@ build_config_iso() { tmp=$(mktemp -d) trap "rm -rf $tmp" RETURN - # EE reads `agent.env` from the config disk (dotenv: KEY=VALUE per - # line). EE_BOOT_WORKLOADS is a JSON-encoded array of workload - # specs. The first entry on the GPU VM insmods the nvidia driver - # so it's ready by the time the dd-agent comes up. - local nv_workload="null" + # Boot workload chain (EE spawns concurrently; each uses `until` + # loops to self-sequence): + # nv — insmod nvidia driver (prod only, first so the + # device nodes exist by the time ollama runs) + # mount-models — mount /dev/vdc at /var/lib/easyenclave/ollama + # podman-static — fetch the podman binary tarball into /var/lib/easyenclave/bin + # podman-bootstrap — stage binaries, write containers.conf + policy.json, + # install /var/lib/easyenclave/bin/podman as the wrapper + # (symlinked from dd-podman for back-compat) + # ollama — run docker.io/ollama/ollama:latest serve via the wrapper + # openclaw — wait for ollama, pull $MODEL, launch openclaw gateway + # cloudflared — fetch cloudflared binary (dd-register spawns it) + # dd-agent — run devopsdefender agent, register with CP, serve workloads + # + # Prod gets the GPU model; preview gets the tiny CPU-friendly one. + local model ollama_spec if [ "$with_gpu" = "yes" ]; then - nv_workload=$(jq -c -n '{ - app_name:"nv", - cmd:["/bin/busybox","sh","-c", - "/sbin/insmod /lib/modules/7.0.0-14-generic/kernel/nvidia-580srv-open/nvidia.ko NVreg_OpenRmEnableUnsupportedGpus=1 2>&1 && echo nv: loaded || echo nv: failed; sleep inf"] - }') + model="qwen2.5:7b" + ollama_spec="$REPO_ROOT/apps/ollama/workload.prod.json" + else + model="qwen2.5:0.5b" + ollama_spec="$REPO_ROOT/apps/ollama/workload.preview.json" fi - # Mount the persistent models disk (vdc) at /var/lib/easyenclave/ollama - # before ollama might try to use it. Pre-formatted ext4 on the host. - local mount_workload - mount_workload=$(jq -c -n '{ - app_name:"mount-models", - cmd:["/bin/busybox","sh","-c", - "mkdir -p /var/lib/easyenclave/ollama && mount /dev/vdc /var/lib/easyenclave/ollama && echo mount-models: ok; sleep inf"] - }') - local workloads - workloads=$(jq -c -n \ - --argjson nv "$nv_workload" \ - --argjson mount "$mount_workload" \ - --arg cp "$cp" --arg pat "$DD_PAT" --arg ita "$DD_ITA_API_KEY" \ - --arg env "$env" --arg vm "dd-local-$name" '[ - $nv, - $mount, - {"app_name":"cloudflared", - "github_release":{"repo":"cloudflare/cloudflared","asset":"cloudflared-linux-amd64","rename":"cloudflared"}}, - {"app_name":"dd-agent", - "github_release":{"repo":"devopsdefender/dd","asset":"devopsdefender","tag":"latest"}, - "cmd":["devopsdefender","agent"], - "env":[ - "DD_MODE=agent", - ("DD_CP_URL=" + $cp), ("DD_PAT=" + $pat), ("DD_ITA_API_KEY=" + $ita), - "DD_ITA_BASE_URL=https://api.trustauthority.intel.com", - "DD_ITA_JWKS_URL=https://portal.trustauthority.intel.com/certs", - "DD_ITA_ISSUER=https://portal.trustauthority.intel.com", - "DD_OWNER=devopsdefender", ("DD_ENV=" + $env), ("DD_VM_NAME=" + $vm), - "DD_PORT=8080" - ]} - ] | map(select(. != null))') + workloads=$({ + [ "$with_gpu" = "yes" ] && bake "$REPO_ROOT/apps/nv/workload.json" + bake "$REPO_ROOT/apps/mount-models/workload.json" + bake "$REPO_ROOT/apps/podman-static/workload.json" + bake "$REPO_ROOT/apps/podman-bootstrap/workload.json" + bake "$ollama_spec" + MODEL="$model" bake "$REPO_ROOT/apps/openclaw/workload.json.tmpl" + bake "$REPO_ROOT/apps/cloudflared/workload.json" + DD_CP_URL="$cp" \ + DD_PAT="$DD_PAT" \ + DD_ITA_API_KEY="$DD_ITA_API_KEY" \ + DD_ENV="$env" \ + DD_VM_NAME="dd-local-$name" \ + bake "$REPO_ROOT/apps/dd-agent/workload.json.tmpl" + } | jq -cs '.') { echo "EE_OWNER=devopsdefender" @@ -112,7 +139,7 @@ build_config_iso() { # ext4 — EE rootfs has no iso9660 module. truncate -s 4M "$out" mkfs.ext4 -q -d "$tmp" "$out" - echo " wrote $out (env=$env, gpu=$with_gpu)" + echo " wrote $out (env=$env, gpu=$with_gpu, model=$model)" } build_overlay() { diff --git a/apps/podman-bootstrap/workload.json b/apps/podman-bootstrap/workload.json index 5a797e4..690421b 100644 --- a/apps/podman-bootstrap/workload.json +++ b/apps/podman-bootstrap/workload.json @@ -2,6 +2,6 @@ "app_name": "podman-bootstrap", "cmd": [ "/bin/busybox", "sh", "-c", - "set -e\nBIN=/var/lib/easyenclave/bin\nSRC=$BIN/podman-linux-amd64\nuntil [ -x $SRC/usr/local/bin/podman ]; do sleep 1; done\n# If there's a vdc scratch disk, wait for mount-models to actually\n# mount it before we write files under /var/lib/easyenclave/ollama —\n# otherwise our writes land on the rootfs tmpfs and get shadowed the\n# moment vdc is mounted. On VMs without vdc (GCP CP preview) there's\n# no mount-models workload and this check short-circuits.\nif [ -b /dev/vdc ]; then\n until mountpoint -q /var/lib/easyenclave/ollama 2>/dev/null; do sleep 1; done\nfi\nmkdir -p /var/lib/easyenclave/ollama\ncp -f $SRC/usr/local/bin/* $BIN/\ncp -f $SRC/usr/local/lib/podman/conmon $BIN/\ncp -f $SRC/usr/local/lib/podman/netavark $BIN/ 2>/dev/null || true\ncp -f $SRC/usr/local/lib/podman/aardvark-dns $BIN/ 2>/dev/null || true\ncp -f $SRC/usr/local/lib/podman/rootlessport $BIN/ 2>/dev/null || true\nmkdir -p /var/lib/easyenclave/ollama/.podman/storage /var/lib/easyenclave/ollama/.podman/runroot\n# /dev/shm is where podman puts its per-container POSIX shm lock\n# file (libpod_lock). EE's guest rootfs may not mount tmpfs on\n# /dev/shm; without it, podman fails 'failed to create 2048 locks\n# in /libpod_lock: no such file or directory'. mkdir + mount idempotently.\nif ! mountpoint -q /dev/shm 2>/dev/null; then\n mkdir -p /dev/shm\n mount -t tmpfs -o size=64M tmpfs /dev/shm 2>/dev/null || true\nfi\n# Pick storage driver based on substrate. vdc-backed ext4 supports\n# native overlay (fast + space-efficient). Without vdc (GCP CP\n# preview, any guest running on tmpfs rootfs), overlay-on-tmpfs\n# errors out, so fall back to vfs (slower, full copy per layer, but\n# works on any filesystem).\nif mountpoint -q /var/lib/easyenclave/ollama; then\n DRIVER=overlay\nelse\n DRIVER=vfs\nfi\n# Write containers.conf on vdc (writable). /etc is RO on EE so we\n# can't put it where podman looks by default. helper_binaries_dir\n# tells podman where we staged conmon/netavark/aardvark-dns/… —\n# podman probes those at startup even with --network=host.\nPOL=/var/lib/easyenclave/ollama/.podman/policy.json\n# Minimum viable signature policy: trust anything. EE's attestation\n# story happens one layer up (image digest pinned by the spec we\n# baked); podman's own signature checking would duplicate that.\nprintf '%s' '{\"default\":[{\"type\":\"insecureAcceptAnything\"}]}' > $POL\n# Podman's containers-common looks for policy.json at hardcoded\n# paths (/etc/containers/, $HOME/.config/containers/). /etc and\n# /root are both RO on EE, so build a fake HOME under\n# /var/lib/easyenclave/.home (writable) and set HOME there in the\n# dd-podman wrapper.\nHOME_DIR=/var/lib/easyenclave/.home\nmkdir -p $HOME_DIR/.config/containers\ncp -f $POL $HOME_DIR/.config/containers/policy.json\nCONF=/var/lib/easyenclave/ollama/.podman/containers.conf\nprintf '%s\\n' '[engine]' 'helper_binaries_dir = [\"/var/lib/easyenclave/bin\"]' > $CONF\nmkdir -p $HOME_DIR/tmp\nprintf '%s\\n' '#!/bin/sh' \"export HOME=$HOME_DIR\" \"export TMPDIR=$HOME_DIR/tmp\" \"export CONTAINERS_CONF=$CONF\" \"exec /var/lib/easyenclave/bin/podman --conmon=/var/lib/easyenclave/bin/conmon --runtime=/var/lib/easyenclave/bin/crun --storage-driver=$DRIVER --root=/var/lib/easyenclave/ollama/.podman/storage --runroot=/var/lib/easyenclave/ollama/.podman/runroot --cgroup-manager=cgroupfs \\\"\\$@\\\"\" > $BIN/dd-podman\nchmod +x $BIN/dd-podman\nls -la $CONF $POL $BIN/dd-podman 2>&1 || true\ncat $CONF\necho podman-bootstrap: v2 ok driver=$DRIVER conf=$CONF policy=$POL" + "set -e\nBIN=/var/lib/easyenclave/bin\nSRC=$BIN/podman-linux-amd64\nuntil [ -x $SRC/usr/local/bin/podman ]; do sleep 1; done\n# Wait for mount-models to mount /dev/vdc before writing under\n# /var/lib/easyenclave/ollama — otherwise writes land on tmpfs\n# and get shadowed the moment vdc is mounted. On VMs without vdc\n# (e.g. CP previews with no models disk) this check short-circuits.\nif [ -b /dev/vdc ]; then\n until mountpoint -q /var/lib/easyenclave/ollama 2>/dev/null; do sleep 1; done\nfi\nmkdir -p /var/lib/easyenclave/ollama\n# Stage helpers first (conmon, netavark, crun, etc.).\nfor f in $SRC/usr/local/bin/*; do\n name=$(basename $f)\n case $name in\n podman) cp -f $f $BIN/.podman-raw ;;\n *) cp -f $f $BIN/ ;;\n esac\ndone\ncp -f $SRC/usr/local/lib/podman/conmon $BIN/\ncp -f $SRC/usr/local/lib/podman/netavark $BIN/ 2>/dev/null || true\ncp -f $SRC/usr/local/lib/podman/aardvark-dns $BIN/ 2>/dev/null || true\ncp -f $SRC/usr/local/lib/podman/rootlessport $BIN/ 2>/dev/null || true\nmkdir -p /var/lib/easyenclave/ollama/.podman/storage /var/lib/easyenclave/ollama/.podman/runroot\n# /dev/shm holds podman's per-container POSIX shm lock file\n# (libpod_lock). EE may not mount tmpfs there; without it, podman\n# fails `failed to create 2048 locks in /libpod_lock`. Idempotent.\nif ! mountpoint -q /dev/shm 2>/dev/null; then\n mkdir -p /dev/shm\n mount -t tmpfs -o size=64M tmpfs /dev/shm 2>/dev/null || true\nfi\n# Pick storage driver: overlay on vdc-backed ext4; vfs elsewhere\n# (overlay-on-tmpfs errors out).\nif mountpoint -q /var/lib/easyenclave/ollama; then\n DRIVER=overlay\nelse\n DRIVER=vfs\nfi\nPOL=/var/lib/easyenclave/ollama/.podman/policy.json\nprintf '%s' '{\"default\":[{\"type\":\"insecureAcceptAnything\"}]}' > $POL\n# /etc and /root are RO on EE. Build a writable fake HOME for\n# policy.json + podman's default lookups.\nHOME_DIR=/var/lib/easyenclave/.home\nmkdir -p $HOME_DIR/.config/containers $HOME_DIR/tmp\ncp -f $POL $HOME_DIR/.config/containers/policy.json\nCONF=/var/lib/easyenclave/ollama/.podman/containers.conf\nprintf '%s\\n' '[engine]' 'helper_binaries_dir = [\"/var/lib/easyenclave/bin\"]' > $CONF\n# Wrapper installed as $BIN/podman so bare `podman ps` (from PATH)\n# reaches the right storage root + driver. Raw binary lives at\n# $BIN/.podman-raw. $BIN/dd-podman stays as a back-compat symlink\n# since openclaw's workload calls dd-podman by name.\nprintf '%s\\n' '#!/bin/sh' \"export HOME=$HOME_DIR\" \"export TMPDIR=$HOME_DIR/tmp\" \"export CONTAINERS_CONF=$CONF\" \"exec $BIN/.podman-raw --conmon=$BIN/conmon --runtime=$BIN/crun --storage-driver=$DRIVER --root=/var/lib/easyenclave/ollama/.podman/storage --runroot=/var/lib/easyenclave/ollama/.podman/runroot --cgroup-manager=cgroupfs \\\"\\$@\\\"\" > $BIN/podman\nchmod +x $BIN/podman\nln -sf podman $BIN/dd-podman\nls -la $CONF $POL $BIN/podman $BIN/dd-podman $BIN/.podman-raw 2>&1 || true\necho podman-bootstrap: ok driver=$DRIVER conf=$CONF policy=$POL" ] }