Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
103 changes: 103 additions & 0 deletions .github/actions/relaunch-agent/action.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
name: Relaunch local TDX agent
description: >-
SSH into the tdx2 host, recreate the matching dd-local-{kind} libvirt
domain against the given CP url (pulling apps/ from the given git ref),
then block until the agent re-registers with the CP. A release is "done"
only when this action succeeds end-to-end.

inputs:
kind:
description: 'prod | preview — which libvirt domain to relaunch'
required: true
url:
description: 'CP URL the agent should register against (e.g. https://app.devopsdefender.com)'
required: true
ref:
description: 'git ref whose scripts/apps tree dd-relaunch.sh should check out on the host'
required: true
ssh-key:
description: 'Private SSH key for tdx2@host'
required: true
host:
description: 'Public host address of the tdx2 node'
required: true
dd-pat:
description: 'GitHub PAT the agent uses to talk to the CP'
required: true
ita-api-key:
description: 'Intel Trust Authority API key for attestation'
required: true

runs:
using: composite
steps:
# CP must be reachable before we SSH — on PR pushes we race with
# Release's deploy-preview standing up the pr-N CP. /health is public.
- name: Wait for CP to be healthy
shell: bash
env:
URL: ${{ inputs.url }}
run: |
for i in $(seq 1 60); do
if curl -fsS --max-time 5 "$URL/health" >/dev/null 2>&1; then
echo "CP $URL healthy after ${i} attempts"
exit 0
fi
echo " waiting for $URL... (${i}/60)"
sleep 10
done
echo "::error::CP $URL never came up within 10 min"
exit 1

# SSH in and relaunch the VM (destroy + redefine + start). Finishes
# in ~10 s — the baked config.iso's EE_BOOT_WORKLOADS drives the rest.
- name: ssh + relaunch VM
shell: bash
env:
SSH_KEY: ${{ inputs.ssh-key }}
HOST: ${{ inputs.host }}
DD_PAT: ${{ inputs.dd-pat }}
DD_ITA_API_KEY: ${{ inputs.ita-api-key }}
KIND: ${{ inputs.kind }}
URL: ${{ inputs.url }}
REF: ${{ inputs.ref }}
run: |
mkdir -p ~/.ssh
printf '%s\n' "$SSH_KEY" > ~/.ssh/id_ed25519
chmod 600 ~/.ssh/id_ed25519
ssh-keyscan -H "$HOST" >> ~/.ssh/known_hosts 2>/dev/null
ssh -o BatchMode=yes -o StrictHostKeyChecking=yes \
-i ~/.ssh/id_ed25519 "tdx2@$HOST" \
"DD_PAT='$DD_PAT' DD_ITA_API_KEY='$DD_ITA_API_KEY' /home/tdx2/src/dd/apps/_infra/dd-relaunch.sh '$KIND' '$URL' '$REF'"

# Block until the freshly-booted agent VM registers with the CP.
# This is the "I can see the local agent deployment worked" signal
# that gates the whole release. 5-min budget covers a cold VM boot
# (~60s) + cloudflared tunnel (~30s) + agent startup + register —
# plenty of headroom. Doesn't probe openclaw/ollama readiness —
# that first-boot pays a 30-min npm-install tax and isn't part
# of the release gate.
- name: Verify agent registered with CP
shell: bash
env:
URL: ${{ inputs.url }}
DD_PAT: ${{ inputs.dd-pat }}
KIND: ${{ inputs.kind }}
run: |
vm="dd-local-$KIND"
started_at=$(date -u +%Y-%m-%dT%H:%M:%SZ)
AUTH=(-H "Authorization: Bearer $DD_PAT")
for i in $(seq 1 30); do
host=$(curl -fsS --max-time 10 "${AUTH[@]}" "$URL/api/agents" 2>/dev/null \
| jq -r --arg since "$started_at" --arg vm "$vm" '
[.[] | select(.vm_name==$vm and .status=="healthy" and .last_seen > $since)]
| sort_by(.last_seen) | reverse | .[0].hostname // empty' 2>/dev/null || true)
if [ -n "$host" ] && [ "$host" != "null" ]; then
echo "$vm registered at https://$host"
exit 0
fi
echo " waiting for $vm to register with $URL... (${i}/30)"
sleep 10
done
echo "::error::$vm never registered with $URL within 5 min"
exit 1
152 changes: 110 additions & 42 deletions .github/workflows/cleanup.yml
Original file line number Diff line number Diff line change
@@ -1,41 +1,49 @@
name: Cleanup

# Reap TERMINATED dd-{env}-* VMs. STONITH self-poweroff leaves the VM
# in TERMINATED state — it uses no compute but clutters the inventory
# and a long enough chain of deploys turns into pages of dead VMs.
# Background safety net that reaps GCE VMs the primary cleanup paths
# missed. Primary paths today:
#
# Two jobs run in parallel, one per environment, so a regression in
# either auth/zone/project doesn't block the other. The cleanup is
# idempotent: skip if nothing to reap.
# - STONITH: dd-register deletes the old VM's CF tunnel on startup →
# old cloudflared exits → old dd-register poweroffs →
# TERMINATED. Happens on every deploy of the same env.
# - Teardown: pr-teardown.yml fires on branch-delete and deletes the
# VM + tunnel + DNS. Happens when a dev deletes the branch.
#
# Gaps this workflow covers:
# - TERMINATED VMs accumulate between STONITH and branch-delete.
# - A PR that's merged/closed but whose branch survives → the preview
# VM stays RUNNING forever, burning compute. reap-merged-pr-previews
# finds these and treats them like a branch-delete (VM + tunnel + DNS).
#
# Triggers:
# - workflow_dispatch (operator-initiated cleanup)
# - workflow_run completion of Release / Production Deploy (catch
# post-deploy zombies opportunistically)
# - workflow_run completion of Release (catch post-deploy zombies
# opportunistically; Release covers both preview and prod now)
# - schedule, every 6 hours (background safety net)

on:
workflow_dispatch:
workflow_run:
workflows: ["Release", "Production Deploy"]
workflows: ["Release"]
types: [completed]
schedule:
- cron: '0 */6 * * *'

# Don't pile up identical reaps when several deploys land in quick
# succession — one in-flight reap is enough.
concurrency:
group: dd-cleanup
cancel-in-progress: false

permissions:
contents: read

env:
GCP_ZONE: us-central1-c

jobs:
# PR preview envs (dd_env=pr-*) accumulate during active PRs — every
# push STONITHs the old VM into TERMINATED. PR close runs
# pr-teardown.yml which deletes them, but between pushes they stack
# up. This job reaps them in place.
# PR preview envs (dd_env=pr-*) accumulate TERMINATED VMs during
# active PRs — every push STONITHs the old VM. Branch-delete reaps
# the matching VMs; between pushes they stack up. This reaps them
# in place.
reap-pr-previews:
runs-on: ubuntu-latest
environment: staging
Expand All @@ -51,11 +59,7 @@ jobs:
- name: Reap TERMINATED dd-pr-* VMs
env:
GCP_PROJECT_ID: ${{ secrets.GCP_PROJECT_ID }}
GCP_ZONE: us-central1-c
run: |
# gcloud filter regex: `~` matches against the value. Anchor
# to start so we don't accidentally match an env like
# "foo-pr-bar" in the future.
DEAD=$(gcloud compute instances list \
--project="$GCP_PROJECT_ID" \
--filter='labels.devopsdefender=managed AND labels.dd_env~"^pr-" AND status=TERMINATED' \
Expand All @@ -69,62 +73,126 @@ jobs:
gcloud compute instances delete $DEAD \
--project="$GCP_PROJECT_ID" --zone="$GCP_ZONE" --quiet

reap-staging:
reap-production:
runs-on: ubuntu-latest
environment: staging
environment: production
permissions:
contents: read
id-token: write
steps:
- uses: google-github-actions/auth@v2
with:
workload_identity_provider: 'projects/654815109728/locations/global/workloadIdentityPools/github-actions-pool/providers/github-provider'
service_account: 'easyenclave-staging-ci@eestaging.iam.gserviceaccount.com'
workload_identity_provider: 'projects/779946350556/locations/global/workloadIdentityPools/github-actions-pool/providers/github-provider'
service_account: 'easyenclave-production-ci@easyenclave.iam.gserviceaccount.com'
- uses: google-github-actions/setup-gcloud@v2
- name: Reap TERMINATED dd-staging VMs
- name: Reap TERMINATED dd-production VMs
env:
GCP_PROJECT_ID: ${{ secrets.GCP_PROJECT_ID }}
GCP_ZONE: us-central1-c
run: |
DEAD=$(gcloud compute instances list \
--project="$GCP_PROJECT_ID" \
--filter="labels.devopsdefender=managed AND labels.dd_env=staging AND status=TERMINATED" \
--filter="labels.devopsdefender=managed AND labels.dd_env=production AND status=TERMINATED" \
--format="value(name)")
if [ -z "$DEAD" ]; then
echo "No TERMINATED dd-staging VMs to reap."
echo "No TERMINATED dd-production VMs to reap."
exit 0
fi
echo "Reaping: $(echo "$DEAD" | tr '\n' ' ')"
# shellcheck disable=SC2086
gcloud compute instances delete $DEAD \
--project="$GCP_PROJECT_ID" --zone="$GCP_ZONE" --quiet

reap-production:
# RUNNING pr-N VMs whose PR is merged or closed are leaked compute —
# neither STONITH (waits for a new deploy) nor pr-teardown.yml (waits
# for branch-delete) reaches them. This finds them and tears them
# down like a branch-delete would have: VM + CF tunnel + DNS CNAME.
reap-merged-pr-previews:
runs-on: ubuntu-latest
environment: production
environment: staging
permissions:
contents: read
id-token: write
pull-requests: read
steps:
- uses: google-github-actions/auth@v2
with:
workload_identity_provider: 'projects/779946350556/locations/global/workloadIdentityPools/github-actions-pool/providers/github-provider'
service_account: 'easyenclave-production-ci@easyenclave.iam.gserviceaccount.com'
workload_identity_provider: 'projects/654815109728/locations/global/workloadIdentityPools/github-actions-pool/providers/github-provider'
service_account: 'easyenclave-staging-ci@eestaging.iam.gserviceaccount.com'
- uses: google-github-actions/setup-gcloud@v2
- name: Reap TERMINATED dd-production VMs
- name: Reap RUNNING pr-N VMs whose PR is closed or merged
env:
GCP_PROJECT_ID: ${{ secrets.GCP_PROJECT_ID }}
GCP_ZONE: us-central1-c
CF_API_TOKEN: ${{ secrets.DD_CP_CF_API_TOKEN }}
CF_ACCOUNT_ID: ${{ secrets.DD_CP_CF_ACCOUNT_ID }}
CF_ZONE_ID: ${{ secrets.DD_CP_CF_ZONE_ID }}
DD_DOMAIN: ${{ vars.DD_CF_DOMAIN || 'devopsdefender.com' }}
GH_TOKEN: ${{ github.token }}
run: |
DEAD=$(gcloud compute instances list \
# Unique set of pr-N envs currently RUNNING.
envs=$(gcloud compute instances list \
--project="$GCP_PROJECT_ID" \
--filter="labels.devopsdefender=managed AND labels.dd_env=production AND status=TERMINATED" \
--format="value(name)")
if [ -z "$DEAD" ]; then
echo "No TERMINATED dd-production VMs to reap."
--filter='labels.devopsdefender=managed AND labels.dd_env~"^pr-" AND status=RUNNING' \
--format='value(labels.dd_env)' | sort -u)
if [ -z "$envs" ]; then
echo "No RUNNING dd-pr-* VMs to consider."
exit 0
fi
echo "Reaping: $(echo "$DEAD" | tr '\n' ' ')"
# shellcheck disable=SC2086
gcloud compute instances delete $DEAD \
--project="$GCP_PROJECT_ID" --zone="$GCP_ZONE" --quiet

for env in $envs; do
pr="${env#pr-}"
state=$(gh pr view "$pr" --repo "${{ github.repository }}" \
--json state --jq .state 2>/dev/null || echo "UNKNOWN")
if [ "$state" = "OPEN" ]; then
echo "pr-$pr still OPEN — leaving RUNNING VMs alone."
continue
fi
if [ "$state" = "UNKNOWN" ]; then
echo "::warning::could not resolve state for pr-$pr (gh pr view failed); leaving alone"
continue
fi
echo "pr-$pr is $state — tearing down preview env $env"

# VMs
vms=$(gcloud compute instances list \
--project="$GCP_PROJECT_ID" \
--filter="labels.devopsdefender=managed AND labels.dd_env=$env" \
--format='value(name)')
if [ -n "$vms" ]; then
echo " deleting VMs: $(echo "$vms" | tr '\n' ' ')"
# shellcheck disable=SC2086
gcloud compute instances delete $vms \
--project="$GCP_PROJECT_ID" --zone="$GCP_ZONE" --quiet
fi

# CF tunnels — named `dd-{env}-{uuid}`.
resp=$(curl -fsS \
-H "Authorization: Bearer $CF_API_TOKEN" \
"https://api.cloudflare.com/client/v4/accounts/$CF_ACCOUNT_ID/cfd_tunnel?is_deleted=false&per_page=200")
ids=$(echo "$resp" | jq -r --arg prefix "dd-$env-" \
'.result[] | select(.name | startswith($prefix)) | .id')
for id in $ids; do
echo " deleting tunnel $id"
curl -fsS -X DELETE \
-H "Authorization: Bearer $CF_API_TOKEN" \
"https://api.cloudflare.com/client/v4/accounts/$CF_ACCOUNT_ID/cfd_tunnel/$id/connections" \
>/dev/null || true
curl -fsS -X DELETE \
-H "Authorization: Bearer $CF_API_TOKEN" \
"https://api.cloudflare.com/client/v4/accounts/$CF_ACCOUNT_ID/cfd_tunnel/$id" \
>/dev/null || echo "::warning::tunnel $id delete failed (may already be gone)"
done

# DNS CNAME for pr-N.{domain}
hostname="$env.$DD_DOMAIN"
record_id=$(curl -fsS \
-H "Authorization: Bearer $CF_API_TOKEN" \
"https://api.cloudflare.com/client/v4/zones/$CF_ZONE_ID/dns_records?type=CNAME&name=$hostname" \
| jq -r '.result[0].id // empty')
if [ -n "$record_id" ]; then
echo " deleting CNAME $hostname ($record_id)"
curl -fsS -X DELETE \
-H "Authorization: Bearer $CF_API_TOKEN" \
"https://api.cloudflare.com/client/v4/zones/$CF_ZONE_ID/dns_records/$record_id" \
>/dev/null
fi
done
Loading
Loading