Skip to content

ci: fold every deploy path into Release, drop scripts/#131

Closed
posix4e wants to merge 1 commit into
feat/unify-workloadsfrom
feat/fold-into-release
Closed

ci: fold every deploy path into Release, drop scripts/#131
posix4e wants to merge 1 commit into
feat/unify-workloadsfrom
feat/fold-into-release

Conversation

@posix4e
Copy link
Copy Markdown
Member

@posix4e posix4e commented Apr 18, 2026

Summary

One workflow, release.yml, owns the full lifecycle now:

  • PRbuilddeploy-preview (deploy-cp.yml, preview inputs) → blocking relaunch of dd-local-preview
  • push mainbuilddeploy-production (deploy-cp.yml, prod inputs) → blocking relaunch of dd-local-prod
  • workflow_dispatchdeploy-production with rollback-pick release_tag input
  • push v*build only (versioned artifact, no deploy)

A release goes green only when the matching dd-local-{kind} VM re-registers with the freshly-deployed CP within 5 minutes. That's the user's gate: before I merge, I want to see the local agent deployment worked.

Deleted:

  • production-deploy.yml — folded into release.yml as a deploy-production job.
  • local-agents.yml — manual-dispatch path gone; push a commit to trigger a relaunch via the cascade.
  • scripts/gcp-deploy.sh inlined into deploy-cp.yml (the bake/join helpers become a ~10-line inline envsubst + jq pipeline). dd-relaunch.sh and local-agents.sh moved to apps/_infra/ (host-side scripts, consumed by the tdx2 host via git checkout). workloads.sh + redeploy-workload.sh deleted as unused.

.github/actions/relaunch-agent/action.yml gains a blocking Verify agent registered with CP step (polls /api/agents for 5 min); deploy-cp.yml's Relaunch step drops continue-on-error: true.

Net: -235 lines, three files gone, one signal that actually matches intent.

Test plan

  • Throwaway PR → Release runs builddeploy-preview → dd-local-preview relaunches and re-registers → PR goes green.
  • Merge to main → Release runs builddeploy-production → dd-local-prod relaunches and re-registers.
  • workflow_dispatch with release_tag: v0.2.0deploy-production deploys that tag; dd-local-prod re-registers.
  • Two rapid PR pushes → old Release run cancels, new one runs (PR concurrency).
  • Two rapid main pushes → both runs queue, second waits for first (main concurrency never cancels).
  • ls scripts/ fails; ls apps/_infra/ shows dd-relaunch.sh + local-agents.sh.

Stacked under #127 — retarget to main after #127 merges.

🤖 Generated with Claude Code

Release now owns the full lifecycle: build → deploy-preview (PR) OR
deploy-production (main / dispatch) → relaunch-agent (blocking) →
verify agent re-registered with CP. A release is "done" only when the
local dd-local-{kind} VM is back online talking to the freshly-deployed
CP — that's the signal that tells us a PR is safe to merge or a merge
actually shipped.

Deleted:
  .github/workflows/production-deploy.yml — folded into release.yml
    as a deploy-production job with same deploy-cp.yml body.
  .github/workflows/local-agents.yml — manual-dispatch path gone;
    push a commit to trigger a relaunch via the cascade.

Deleted scripts/:
  scripts/gcp-deploy.sh     — inlined into deploy-cp.yml.
  scripts/dd-relaunch.sh    → apps/_infra/dd-relaunch.sh (host-side).
  scripts/local-agents.sh   → apps/_infra/local-agents.sh (host-side).
  scripts/workloads.sh      — dead after inline; only gcp-deploy
                              sourced it and local-agents.sh built
                              workloads via inline jq anyway.
  scripts/redeploy-workload.sh — unused helper, removed.

deploy-cp.yml's Relaunch step drops `continue-on-error: true`; the
relaunch-agent composite gains a "Verify agent registered with CP"
step that polls /api/agents for a freshly-registered dd-local-{kind}
entry with a 5-min budget.

Concurrency on release.yml becomes expression-driven: PR pushes cancel
in-progress runs; main / tag / dispatch queue so an in-flight prod
deploy finishes cleanly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

DD preview ready

URL: https://pr-131.devopsdefender.com

Browser login: paste gh auth token output at https://pr-131.devopsdefender.com/auth/pat

CLI / curl: curl -H "Authorization: Bearer $(gh auth token)" https://pr-131.devopsdefender.com/

Register endpoint for a local agent: wss://pr-131.devopsdefender.com/register

posix4e added a commit that referenced this pull request Apr 18, 2026
Audit via gcloud revealed dd-pr-121-1776434711 running in staging
a day after PR #121 merged (branch not deleted → pr-teardown.yml
never fired; cleanup.yml only reaps TERMINATED, not RUNNING). Closes
that gap and tidies what audit found along the way.

Changes:
  - Delete .github/workflows/retire-staging.yml. One-shot to retire
    the legacy app-staging.{domain} env; has already been run, no
    dd_env=staging VMs exist. Header itself said "delete after use".
  - cleanup.yml: drop reap-staging job (filters dd_env=staging, dead).
  - cleanup.yml: drop "Production Deploy" from workflow_run trigger
    (that workflow is being removed in #131 / follow-ups). Release
    alone now covers both preview and prod post-deploy hooks.
  - cleanup.yml: add reap-merged-pr-previews. For each pr-N env with a
    RUNNING VM, resolve the PR's state via gh; if MERGED/CLOSED, tear
    down VM + CF tunnel + DNS CNAME like pr-teardown.yml would have.
    Scheduled sweep + workflow_run + dispatch.

Deferred (blocked on easyenclave): image-family split preview→staging /
prod→stable. easyenclave has no easyenclave-stable GCP family and no
qcow2 on v0.1.14 today, so there's nothing to point prod at. Ship the
split once easyenclave publishes stable images.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
posix4e added a commit that referenced this pull request Apr 18, 2026
Consolidates PRs #127, #131, #132 into one commit. Net -381 lines,
one "Release" workflow drives the whole fleet lifecycle.

Workload spec:
  apps/<name>/workload.{json,json.tmpl} becomes the single source of
  truth for every EE workload (cloudflared, dd-agent, dd-management,
  ollama, openclaw, nv, mount-models, podman-static, podman-bootstrap).
  Boot-time (config.iso / ee-config metadata) and runtime (/deploy)
  both bake from the same file.

Workflow topology:
  release.yml is the one entry point.
    pull_request      → build → deploy-preview      → dd-local-preview relaunch
    push main         → build → deploy-production   → dd-local-prod    relaunch
    push v*           → build only (versioned artifact, no deploy)
    workflow_dispatch → build → deploy-production   (rollback: release_tag input)
  .github/workflows/deploy-cp.yml is the reusable workflow both paths
  call, so preview CI exercises the exact code prod uses. It provisions
  the GCP CP VM, verifies health + /cp/attest MRTD + dashboard + STONITH,
  comments on PR, and cascades agent relaunch. The cascade is blocking:
  a release goes green only when the matching dd-local-{kind} VM
  re-registers with the freshly-deployed CP (proves "everything works").
  .github/actions/relaunch-agent/ is the composite action that SSHes
  into tdx2, runs dd-relaunch.sh, and polls /api/agents for the
  freshly-registered entry (5-min budget).

Deleted:
  .github/workflows/production-deploy.yml (folded into release.yml
    as deploy-production job).
  .github/workflows/local-agents.yml (manual-dispatch path gone;
    push a commit to trigger a relaunch).
  .github/workflows/retire-staging.yml (one-shot, already run,
    no dd_env=staging VMs exist).
  scripts/ entirely. gcp-deploy.sh inlined into deploy-cp.yml;
  dd-relaunch.sh + local-agents.sh moved to apps/_infra/ (host-side);
  ollama-deploy.sh, redeploy-workload.sh, workloads.sh deleted as
  unused after the refactor.

Cleanup audit:
  gcloud survey revealed dd-pr-121-1776434711 RUNNING in staging a
  day after PR #121 merged — branch not deleted, so pr-teardown.yml
  never fired and cleanup.yml only reaps TERMINATED. Added a
  reap-merged-pr-previews job to cleanup.yml that resolves each
  RUNNING pr-N VM's PR state via gh and tears down VM + CF tunnel
  + DNS CNAME when MERGED/CLOSED. Dropped reap-staging (dead),
  trimmed workflow_run trigger (Production Deploy is gone).

Deferred (blocked on easyenclave):
  easyenclave image family split — preview → easyenclave-staging,
  prod → easyenclave-stable. easyenclave has no stable GCP family
  and no qcow2 on v0.1.14 today, so nothing to point prod at. Plus
  tdx2 auto-refresh of the base qcow2 from easyenclave releases.
  Follow-up PR once easyenclave publishes stable images.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@posix4e
Copy link
Copy Markdown
Member Author

posix4e commented Apr 18, 2026

Folded into #132 (single-commit consolidation).

@posix4e posix4e closed this Apr 18, 2026
posix4e added a commit that referenced this pull request Apr 18, 2026
…ollama+openclaw

Consolidates PRs #127, #131, #132, #133 into one commit. Every fleet
path runs through one workflow, every agent VM boots the full
container stack, and audit-surfaced cleanup gaps are closed.

## Workload spec — single source of truth

apps/<name>/workload.{json,json.tmpl} is the one place a workload is
defined. Boot-time (config.iso) and runtime (/deploy) bake from the
same file. New apps/README.md documents the schema, the lifecycle
matrix (CP / preview agent / prod agent), ordering via `until` polling,
and a "deploying your own" walkthrough. Main README points at it.

The bake helper (envsubst + jq strip-empty-env-entries) is inlined in
two places — .github/workflows/deploy-cp.yml (CI, CP workloads) and
apps/_infra/local-agents.sh (tdx2, agent workloads) — restricted to
the uppercase ${VAR} refs each template declares so shell locals
($i, $((…))) in cmd strings aren't eaten.

## Workflow topology — one entry point

release.yml drives everything:
  pull_request     → build → deploy-preview → dd-local-preview relaunch
  push main        → build → deploy-production → dd-local-prod relaunch
  push v*          → build only (versioned artifact, no deploy)
  workflow_dispatch → build → deploy-production (rollback: release_tag input)

.github/workflows/deploy-cp.yml is the reusable workflow both paths
call. Preview CI exercises the exact code prod uses. It provisions
the GCP CP VM, verifies /cp/attest MRTD + dashboard + STONITH,
comments on PR, and cascades the agent relaunch — blocking on the
agent re-registering with the CP (proves "everything works").

.github/actions/relaunch-agent/ is the composite that SSHes into
tdx2, runs apps/_infra/dd-relaunch.sh, polls /api/agents for the
freshly-registered entry (5-min budget).

Retired workflows:
  production-deploy.yml → folded into release.yml as deploy-production
  local-agents.yml      → no remaining purpose (cascade handles relaunch)
  retire-staging.yml    → one-shot, already run

scripts/ deleted entirely. gcp-deploy.sh inlined into deploy-cp.yml;
dd-relaunch.sh + local-agents.sh moved to apps/_infra/ (host-side);
ollama-deploy.sh, redeploy-workload.sh, workloads.sh removed as unused.

## Container stack on agent VMs

dd-local-{preview,prod} now boot the full chain:
  nv (prod only) → mount-models → podman-static → podman-bootstrap
    → ollama (prod.json with GPU devices / preview.json CPU only)
    → openclaw (qwen2.5:7b on prod, qwen2.5:0.5b on preview)
    → cloudflared → dd-agent

podman-bootstrap installs the wrapper as `podman` (not `dd-podman`),
so bare `podman ps` from a guest shell reaches the right storage
root instead of failing with `mkdir /var/lib/containers: read-only
file system`. Raw binary moves to .podman-raw; dd-podman becomes a
symlink for back-compat.

## Cleanup audit

gcloud survey found dd-pr-121 RUNNING in staging a day after PR #121
merged (branch not deleted → pr-teardown.yml never fired;
cleanup.yml only reaps TERMINATED). Added reap-merged-pr-previews:
for each RUNNING pr-N VM, gh-resolves the PR state and tears down
VM + CF tunnel + DNS CNAME when MERGED/CLOSED. Dropped reap-staging
(dead — no dd_env=staging VMs exist), trimmed the workflow_run
trigger (Production Deploy is gone).

## Deferred (blocked on easyenclave)

Preview → easyenclave-staging / prod → easyenclave-stable image-family
split + tdx2 auto-refresh of the base qcow2. easyenclave has no stable
GCP family and no qcow2 on v0.1.14 today. Follow-up PR once easyenclave
publishes stable images with qcow2 + a stable GCP image family.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
posix4e added a commit that referenced this pull request Apr 18, 2026
…ollama+openclaw (#132)

Consolidates PRs #127, #131, #132, #133 into one commit. Every fleet
path runs through one workflow, every agent VM boots the full
container stack, and audit-surfaced cleanup gaps are closed.

## Workload spec — single source of truth

apps/<name>/workload.{json,json.tmpl} is the one place a workload is
defined. Boot-time (config.iso) and runtime (/deploy) bake from the
same file. New apps/README.md documents the schema, the lifecycle
matrix (CP / preview agent / prod agent), ordering via `until` polling,
and a "deploying your own" walkthrough. Main README points at it.

The bake helper (envsubst + jq strip-empty-env-entries) is inlined in
two places — .github/workflows/deploy-cp.yml (CI, CP workloads) and
apps/_infra/local-agents.sh (tdx2, agent workloads) — restricted to
the uppercase ${VAR} refs each template declares so shell locals
($i, $((…))) in cmd strings aren't eaten.

## Workflow topology — one entry point

release.yml drives everything:
  pull_request     → build → deploy-preview → dd-local-preview relaunch
  push main        → build → deploy-production → dd-local-prod relaunch
  push v*          → build only (versioned artifact, no deploy)
  workflow_dispatch → build → deploy-production (rollback: release_tag input)

.github/workflows/deploy-cp.yml is the reusable workflow both paths
call. Preview CI exercises the exact code prod uses. It provisions
the GCP CP VM, verifies /cp/attest MRTD + dashboard + STONITH,
comments on PR, and cascades the agent relaunch — blocking on the
agent re-registering with the CP (proves "everything works").

.github/actions/relaunch-agent/ is the composite that SSHes into
tdx2, runs apps/_infra/dd-relaunch.sh, polls /api/agents for the
freshly-registered entry (5-min budget).

Retired workflows:
  production-deploy.yml → folded into release.yml as deploy-production
  local-agents.yml      → no remaining purpose (cascade handles relaunch)
  retire-staging.yml    → one-shot, already run

scripts/ deleted entirely. gcp-deploy.sh inlined into deploy-cp.yml;
dd-relaunch.sh + local-agents.sh moved to apps/_infra/ (host-side);
ollama-deploy.sh, redeploy-workload.sh, workloads.sh removed as unused.

## Container stack on agent VMs

dd-local-{preview,prod} now boot the full chain:
  nv (prod only) → mount-models → podman-static → podman-bootstrap
    → ollama (prod.json with GPU devices / preview.json CPU only)
    → openclaw (qwen2.5:7b on prod, qwen2.5:0.5b on preview)
    → cloudflared → dd-agent

podman-bootstrap installs the wrapper as `podman` (not `dd-podman`),
so bare `podman ps` from a guest shell reaches the right storage
root instead of failing with `mkdir /var/lib/containers: read-only
file system`. Raw binary moves to .podman-raw; dd-podman becomes a
symlink for back-compat.

## Cleanup audit

gcloud survey found dd-pr-121 RUNNING in staging a day after PR #121
merged (branch not deleted → pr-teardown.yml never fired;
cleanup.yml only reaps TERMINATED). Added reap-merged-pr-previews:
for each RUNNING pr-N VM, gh-resolves the PR state and tears down
VM + CF tunnel + DNS CNAME when MERGED/CLOSED. Dropped reap-staging
(dead — no dd_env=staging VMs exist), trimmed the workflow_run
trigger (Production Deploy is gone).

## Deferred (blocked on easyenclave)

Preview → easyenclave-staging / prod → easyenclave-stable image-family
split + tdx2 auto-refresh of the base qcow2. easyenclave has no stable
GCP family and no qcow2 on v0.1.14 today. Follow-up PR once easyenclave
publishes stable images with qcow2 + a stable GCP image family.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@posix4e posix4e deleted the feat/fold-into-release branch April 18, 2026 21:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant