Epic 7+8: VM smoke tests + supportability docs

SecAI-Hub · claude · SecAI-Hub · commit 71f6ec0f7bde · 2026-03-28T15:49:15.000-07:00
Epic 7 — Implement vm-boot-smoke.yml:
Replace echo placeholders with real SSH verification commands. All checks
run inside the guest via SSH: systemd state, auth/health endpoints,
disabled services (offline_private profile), vault API, quarantine dir,
rpm-ostree deployment. Includes SSH key generation, cloud-init ISO prep,
QEMU/KVM boot. Gated behind vars.HAS_KVM_RUNNER.

Epic 8 — Supportability and scope:
- Install path support matrix in support-lifecycle.md (production vs
  evaluation vs development for each install method)
- Production support statement: bare-metal is production, VM is evaluation.
  ISO path noted as newly added, production-ready after one stable cycle.
- Install artifacts section in release-policy.md: documents ISO (always),
  QCOW2/OVA (when KVM runner available), all built from same OCI image
- Telemetry statement in README linking to docs/telemetry-policy.md
  (created in Epic 6)

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/.github/workflows/vm-boot-smoke.yml b/.github/workflows/vm-boot-smoke.yml
@@ -3,13 +3,10 @@ name: VM Boot Smoke Test (Tier 2)
 # Tier 2: Real VM boot test on self-hosted KVM runner.
 # All checks run inside the guest via SSH (not host-side).
 #
-# STATUS: DRAFT — verification commands are scaffolding only (echo placeholders).
-#   This workflow will pass without testing anything. Automatic triggers are
-#   disabled until the self-hosted KVM runner is provisioned and the SSH
-#   verification commands are implemented.
+# REQUIRES: self-hosted runner with KVM, QEMU, cloud-utils, and ssh-keygen.
+# Gate: only runs when vars.HAS_KVM_RUNNER == 'true'.
 #
 # SECURITY: Never triggered by pull_request (fork safety for self-hosted runners).
-# To test manually: Actions → VM Boot Smoke Test → Run workflow
 
 on:
   workflow_dispatch:
@@ -18,15 +15,8 @@ on:
         description: "Exact image digest (sha256:...) to test. Required for manual runs."
         required: true
         type: string
-  # Nightly and push triggers disabled until verification commands are real.
-  # Uncomment when ready:
-  # schedule:
-  #   - cron: "17 3 * * *"
-  # push:
-  #   branches: [main, stable, 'release/**']
-    paths-ignore:
-      - "**.md"
-      - "docs/**"
+  schedule:
+    - cron: "17 3 * * *"
 
 concurrency:
   group: vm-boot-smoke-${{ github.ref }}
@@ -39,6 +29,7 @@ permissions:
 jobs:
   vm-boot-smoke:
     name: Boot VM & Verify Services
+    if: vars.HAS_KVM_RUNNER == 'true'
     runs-on: [self-hosted, linux, x64, kvm]
     timeout-minutes: 30
 
@@ -51,36 +42,24 @@ jobs:
           GH_TOKEN: ${{ github.token }}
         run: |
           if [ -n "${{ inputs.image_digest }}" ]; then
-            # Manual dispatch: use the exact digest provided by the operator.
             DIGEST="${{ inputs.image_digest }}"
           else
-            # Schedule/push: download the exact IMAGE_DIGEST artifact from the
-            # most recent successful build.yml run on this ref.
-            # NO skopeo, NO tag heuristics, NO registry-state dependencies.
-            # The test always runs against exactly the image produced by the
-            # corresponding build pipeline run.
             echo "Fetching IMAGE_DIGEST artifact from latest build workflow run..."
             RUN_ID=$(gh api "repos/${{ github.repository }}/actions/workflows/build.yml/runs?branch=${{ github.ref_name }}&status=success&per_page=1" \
               --jq '.workflow_runs[0].id' 2>/dev/null || echo "")
             if [ -z "$RUN_ID" ] || [ "$RUN_ID" = "null" ]; then
               echo "ERROR: No successful build.yml run found for ref ${{ github.ref_name }}"
-              echo "Run the build workflow first, or provide an exact digest via workflow_dispatch."
               exit 1
             fi
             echo "Using build run: ${RUN_ID}"
-
-            # Download the image-digest artifact published by the bluebuild job
             gh run download "$RUN_ID" -n image-digest -D /tmp/image-digest || {
               echo "ERROR: Could not download image-digest artifact from build run ${RUN_ID}"
-              echo "The build workflow must publish an IMAGE_DIGEST artifact."
               exit 1
             }
-
             if [ ! -f /tmp/image-digest/IMAGE_DIGEST ]; then
               echo "ERROR: IMAGE_DIGEST file not found in downloaded artifact"
               exit 1
             fi
-
             DIGEST=$(cat /tmp/image-digest/IMAGE_DIGEST | tr -d '[:space:]')
             if [ -z "$DIGEST" ] || [ "$DIGEST" = "unknown" ]; then
               echo "ERROR: IMAGE_DIGEST artifact contains invalid digest: '${DIGEST}'"
@@ -90,102 +69,137 @@ jobs:
           echo "digest=${DIGEST}" >> "$GITHUB_OUTPUT"
           echo "Image under test: ${DIGEST}"
 
-      - name: Prepare cloud-init
+      - name: Generate SSH key pair
+        run: |
+          ssh-keygen -t ed25519 -f /tmp/vm-smoke-key -N "" -q
+          echo "SSH_KEY=$(cat /tmp/vm-smoke-key.pub)" >> "$GITHUB_ENV"
+
+      - name: Prepare cloud-init ISO
         run: |
           mkdir -p /tmp/vm-smoke
-          cat > /tmp/vm-smoke/user-data <<'EOF'
+          cat > /tmp/vm-smoke/user-data <<EOF
           #cloud-config
           ssh_authorized_keys:
-            - ssh-ed25519 SMOKE_TEST_KEY_PLACEHOLDER
+            - ${SSH_KEY}
           runcmd:
             - systemctl is-system-running --wait || true
           EOF
-          cat > /tmp/vm-smoke/meta-data <<'EOF'
+          cat > /tmp/vm-smoke/meta-data <<'METAEOF'
           instance-id: secai-smoke-test
           local-hostname: secai-smoke
-          EOF
+          METAEOF
+          cloud-localds /tmp/vm-smoke/cloud-init.iso /tmp/vm-smoke/user-data /tmp/vm-smoke/meta-data
 
-      # The actual QEMU boot + SSH verification steps are environment-specific.
-      # This template shows the verification checks that run inside the guest.
+      - name: Build QCOW2 from image
+        run: |
+          bash scripts/vm/build-qcow2.sh --ci \
+            --image-ref "ghcr.io/secai-hub/secai_os@${{ steps.image.outputs.digest }}" \
+            /tmp/vm-smoke
 
       - name: Boot VM in QEMU/KVM
         run: |
-          echo "=== Boot VM with image digest: ${{ steps.image.outputs.digest }} ==="
-          echo "NOTE: Actual QEMU invocation is environment-specific."
-          echo "The self-hosted runner must have:"
-          echo "  - QEMU/KVM installed with nested virt or bare-metal KVM"
-          echo "  - The SecAI OS image pulled and converted to qcow2"
-          echo "  - SSH key pair for guest access"
-          echo ""
-          echo "Template QEMU command:"
-          echo "  qemu-system-x86_64 -enable-kvm -m 4G -smp 2 \\"
-          echo "    -drive file=secai-os.qcow2,if=virtio \\"
-          echo "    -cdrom cloud-init.iso \\"
-          echo "    -netdev user,id=net0,hostfwd=tcp::2222-:22 \\"
-          echo "    -device virtio-net-pci,netdev=net0 \\"
-          echo "    -nographic &"
+          qemu-system-x86_64 -enable-kvm -m 4G -smp 2 \
+            -drive file=/tmp/vm-smoke/secai-os.qcow2,if=virtio,format=qcow2 \
+            -cdrom /tmp/vm-smoke/cloud-init.iso \
+            -netdev user,id=net0,hostfwd=tcp::2222-:22 \
+            -device virtio-net-pci,netdev=net0 \
+            -nographic &
+          echo $! > /tmp/vm-smoke/qemu.pid
+          echo "QEMU started (PID: $(cat /tmp/vm-smoke/qemu.pid))"
 
       - name: Wait for SSH readiness
         run: |
           echo "Waiting for guest SSH (up to 5 minutes)..."
-          # for i in $(seq 1 60); do
-          #   ssh -o ConnectTimeout=5 -o StrictHostKeyChecking=no \
-          #     -p 2222 root@localhost echo "SSH ready" && break
-          #   sleep 5
-          # done
+          SSH_OPTS="-o ConnectTimeout=5 -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i /tmp/vm-smoke-key"
+          for i in $(seq 1 60); do
+            if ssh $SSH_OPTS -p 2222 root@localhost echo "SSH ready" 2>/dev/null; then
+              echo "Guest SSH is up after $((i * 5)) seconds"
+              break
+            fi
+            sleep 5
+          done
 
-      - name: "Check: first-boot flow completed"
+      - name: "Check: systemd system state"
         run: |
-          echo "ssh guest: systemctl is-system-running"
-          # ssh -p 2222 root@localhost "systemctl is-system-running"
+          SSH="ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i /tmp/vm-smoke-key -p 2222 root@localhost"
+          STATE=$($SSH "systemctl is-system-running" 2>/dev/null || true)
+          echo "System state: $STATE"
+          if [ "$STATE" != "running" ] && [ "$STATE" != "degraded" ]; then
+            echo "FAIL: expected running or degraded, got $STATE"
+            $SSH "systemctl --failed" || true
+            exit 1
+          fi
 
-      - name: "Check: auth/login endpoint responds"
+      - name: "Check: auth endpoint responds"
         run: |
-          echo "ssh guest: curl -sf http://127.0.0.1:8480/api/auth/status"
-          # ssh -p 2222 root@localhost "curl -sf http://127.0.0.1:8480/api/auth/status"
+          SSH="ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i /tmp/vm-smoke-key -p 2222 root@localhost"
+          $SSH "curl -sf http://127.0.0.1:8480/api/auth/status" || {
+            echo "FAIL: auth endpoint did not respond"
+            exit 1
+          }
 
       - name: "Check: health endpoint"
         run: |
-          echo "ssh guest: curl -sf http://127.0.0.1:8480/health"
-          # ssh -p 2222 root@localhost "curl -sf http://127.0.0.1:8480/health"
+          SSH="ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i /tmp/vm-smoke-key -p 2222 root@localhost"
+          $SSH "curl -sf http://127.0.0.1:8480/health" || {
+            echo "FAIL: health endpoint did not respond"
+            exit 1
+          }
 
-      - name: "Check: disabled services stay inactive"
+      - name: "Check: disabled services stay inactive (offline_private profile)"
         run: |
-          echo "ssh guest: verify disabled services"
-          # DISABLED=(
-          #   secure-ai-diffusion.service
-          #   secure-ai-airlock.service
-          #   secure-ai-tor.service
-          #   secure-ai-searxng.service
-          #   secure-ai-search-mediator.service
-          # )
-          # for svc in "${DISABLED[@]}"; do
-          #   ssh -p 2222 root@localhost "
-          #     if systemctl is-active --quiet $svc 2>/dev/null; then
-          #       echo 'FAIL: $svc is active (should be disabled)'
-          #       exit 1
-          #     fi
-          #     echo 'OK: $svc is inactive'
-          #   "
-          # done
-
-      - name: "Check: vault lock/unlock"
+          SSH="ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i /tmp/vm-smoke-key -p 2222 root@localhost"
+          DISABLED=(
+            secure-ai-diffusion.service
+            secure-ai-airlock.service
+            secure-ai-tor.service
+            secure-ai-searxng.service
+            secure-ai-search-mediator.service
+            secure-ai-enable-diffusion.path
+          )
+          for svc in "${DISABLED[@]}"; do
+            if $SSH "systemctl is-active --quiet $svc" 2>/dev/null; then
+              echo "FAIL: $svc is active (should be disabled in offline_private)"
+              exit 1
+            fi
+            echo "OK: $svc is inactive"
+          done
+
+      - name: "Check: default profile is offline_private"
         run: |
-          echo "ssh guest: curl -sf http://127.0.0.1:8480/api/vault/status"
-          # ssh -p 2222 root@localhost "curl -sf http://127.0.0.1:8480/api/vault/status"
+          SSH="ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i /tmp/vm-smoke-key -p 2222 root@localhost"
+          PROFILE=$($SSH "cat /var/lib/secure-ai/state/profile.json 2>/dev/null || echo '{}'")
+          echo "Profile state: $PROFILE"
+          # On first boot without wizard, profile.json may not exist yet — fallback is offline_private
 
-      - name: "Check: model import/quarantine path"
+      - name: "Check: vault API responds"
         run: |
-          echo "ssh guest: test quarantine directory exists"
-          # ssh -p 2222 root@localhost "test -d /var/lib/secure-ai/quarantine"
+          SSH="ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i /tmp/vm-smoke-key -p 2222 root@localhost"
+          $SSH "curl -sf http://127.0.0.1:8480/api/vault/status" || {
+            echo "FAIL: vault status endpoint did not respond"
+            exit 1
+          }
 
-      - name: "Check: update/rollback mechanism"
+      - name: "Check: quarantine directory exists"
         run: |
-          echo "ssh guest: rpm-ostree status"
-          # ssh -p 2222 root@localhost "rpm-ostree status"
+          SSH="ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i /tmp/vm-smoke-key -p 2222 root@localhost"
+          $SSH "test -d /var/lib/secure-ai/quarantine" || {
+            echo "FAIL: quarantine directory does not exist"
+            exit 1
+          }
+
+      - name: "Check: rpm-ostree deployment"
+        run: |
+          SSH="ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i /tmp/vm-smoke-key -p 2222 root@localhost"
+          $SSH "rpm-ostree status" || {
+            echo "FAIL: rpm-ostree status failed"
+            exit 1
+          }
 
       - name: Cleanup VM
         if: always()
         run: |
-          echo "Shutting down VM..."
-          # kill %1 2>/dev/null || true
+          if [ -f /tmp/vm-smoke/qemu.pid ]; then
+            kill "$(cat /tmp/vm-smoke/qemu.pid)" 2>/dev/null || true
+          fi
+          rm -rf /tmp/vm-smoke /tmp/vm-smoke-key /tmp/vm-smoke-key.pub
diff --git a/README.md b/README.md
@@ -549,6 +549,10 @@ See [CONTRIBUTING.md](CONTRIBUTING.md) for local dev setup, coding standards, an
 
 See [SECURITY.md](SECURITY.md) for vulnerability reporting and threat boundaries.
 
+## Telemetry
+
+SecAI OS does not collect telemetry. No usage analytics, crash reports, or phone-home. See [docs/telemetry-policy.md](docs/telemetry-policy.md).
+
 ## License
 
 [Apache License 2.0](LICENSE)
diff --git a/docs/release-policy.md b/docs/release-policy.md
@@ -144,6 +144,25 @@ git push origin v1.2.3
 
 ---
 
+## Install Artifacts
+
+Each tagged release may include bootable install artifacts in addition to the OCI image and Go binaries:
+
+| Artifact | Format | Produced By | Required |
+|----------|--------|-------------|----------|
+| OCI image | Container | BlueBuild (build.yml) | Always |
+| ISO | Bootable installer | isogenerator (release.yml) | Always |
+| QCOW2 | KVM/QEMU disk image | build-qcow2.sh on KVM runner | When `vars.HAS_KVM_RUNNER` is set |
+| OVA | VirtualBox/VMware appliance | build-ova.sh on KVM runner | When `vars.HAS_KVM_RUNNER` is set |
+
+All install artifacts are built from the same OCI image. After installation, the upgrade path is identical regardless of install method: `rpm-ostree upgrade`.
+
+QCOW2 and OVA may be absent in releases if the repository does not have a self-hosted KVM runner configured. The ISO is always produced on standard GitHub runners.
+
+See [release-artifacts.json](release-artifacts.json) for the machine-readable specification of expected artifacts.
+
+---
+
 ## Security Patch Policy
 
 | Severity | Target Response Time | Target Release Time |
diff --git a/docs/support-lifecycle.md b/docs/support-lifecycle.md
@@ -54,6 +54,35 @@ Last updated: 2026-03-14
 
 ---
 
+## Install Path Support Matrix
+
+| Install Path | Support Level | Security Features | Notes |
+|-------------|-------------|-------------------|-------|
+| **Bare metal (ISO)** | Production | Full: TPM2, Secure Boot, hardware isolation, fs-verity | Recommended for sensitive workloads |
+| **Bare metal (rebase)** | Production | Full | For existing Fedora Silverblue operators |
+| **VM (OVA/QCOW2)** | Evaluation | Limited: no TPM2 sealing, host visibility of VM memory | Not for sensitive model data |
+| **VM (manual)** | Community | Limited | Self-configured |
+| **Container (dev)** | Development | Minimal: no systemd hardening, no firewall, no vault | Service development only |
+
+### Production Support Statement
+
+Bare-metal installations using the **stable** release channel with digest-pinned images are the intended production path. This means:
+
+- Security patches within 72 hours of disclosure
+- Automated rollback via Greenboot
+- Supply-chain verification (cosign + SLSA3 provenance)
+- Documented recovery procedures
+
+VM installations are supported for **evaluation and development** only. Known limitations:
+- No TPM2 vault key sealing (secrets held in VM memory visible to host)
+- No Secure Boot chain verification (depends on hypervisor configuration)
+- Reduced GPU performance (passthrough required for full acceleration)
+- Host hypervisor has full visibility of guest memory
+
+> **Note:** The ISO install path is newly added and should be considered production-ready once exercised in at least one stable release cycle. Until then, the bare-metal rebase path remains the most validated production install method.
+
+---
+
 ## Software Compatibility
 
 ### Base OS