Run Windows-only tests on a GitHub Actions runner (on demand)#1245
Conversation
Windows-only tests ([WindowsTest]) compile on macOS/Apple-Silicon but skip at runtime. This skill runs them on a local UTM Windows 11 ARM VM over loopback SSH and returns results synchronously, so they're drivable from the sandbox. - setup.sh: one-time orchestrator (installs UTM, fetches ISO, serves provision.ps1 to the guest, configures the loopback SSH port forward, verifies the toolchain). Pauses only at the two GUI steps UTM has no CLI for. - run.sh: per-run entry point. Resolves VM state (absent -> exit 3, stopped -> utmctl start + wait, running), rsyncs source (no build artifacts), runs dotnet test --filter, streams output back. - provision.ps1: in-guest setup (OpenSSH, .NET 8 SDK, rsync, authorized key). - SKILL.md: triggers on Windows-only/skipped tests; documents the exit-code contract. First-run-validate quality: the VM-dependent paths haven't been exercised against a real VM yet. Verified: exit-code branches, syntax, exec bits. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Drop UTM (GUI-locked creation) for qemu-system-aarch64 directly, so VM creation and Windows install are fully scripted and re-runnable: - lib.sh: shared config + QEMU arg assembly (hvf, edk2 pflash, NVMe disk, virtio-net with loopback hostfwd, swtpm TPM 2.0, headless). - setup.sh: installs qemu/swtpm, builds firmware vars + disk, fetches virtio-win, builds an autounattend answer ISO, boots the unattended installer, waits for SSH. - autounattend.xml: hands-free Win11 ARM64 install (TPM via swtpm), creates dev user with autologon, runs provision.ps1 from the answer CD on first logon. - provision.ps1: installs virtio-net driver, OpenSSH + authorized key, .NET 8 SDK, rsync. - start-vm.sh: idempotent headless boot + SSH wait. - run.sh: ensure VM up, rsync source (no build artifacts), dotnet test --filter. Only non-automated step: acquiring the MS-gated Windows ARM64 ISO. Verified: bash syntax (all), autounattend.xml well-formed, run.sh exit-2/exit-3 branches, firmware path resolves. Unverified (needs ~30-min install on real hardware): the install/provision path end-to-end. Likely first-pass fixes: autounattend edition name, virtio-net ARM64 driver path, rsync cygwin dest. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Removes the last manual step: when no ISO is present, setup.sh now calls fetch-iso.sh, which resolves the latest non-Insider Windows 11 ARM64 build from the UUP dump API, downloads the UUP payload from Microsoft's update servers, and converts it to an ISO locally (aria2 + wimlib + cdrtools). Setup is now end-to-end clickless. Verified: bash syntax (all), the UUP build-selection jq filter (picks newest non-Insider arm64). Unverified (needs network + ~5 GB + real run): the get.php package params and the converter invocation — flagged first-run-validate in fetch-iso.sh. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
First live run got past the UUP API resolve (build id resolved fine) but the converter aborted: it hard-requires chntpw, which can't build on Apple Silicon (the sidneys tap's openssl@1.0 fails its test suite — known EC-curve bug). Fix: do the API resolve + package download on the Mac, then run uup_download_linux.sh inside debian:bookworm where aria2/cabextract/wimtools/chntpw/genisoimage install via apt. The converter's host arch is irrelevant to the ARM64 Windows payload. Requires Docker (checked up front). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The raw-QEMU local VM was abandoned: brew's edk2 firmware on Apple Silicon only enumerates USB storage and won't boot the Windows installer (NVMe/virtio block never appear; drops to UEFI shell). Documented in project memory. Instead, run Windows-only tests on a windows-latest runner: - .github/workflows/windows-test.yml: setup-dotnet (global.json 8.0.413), then dotnet test Octopus.Tentacle.Tests.Integration --framework net8.0 --filter <filter>. Triggered by push to this branch, or workflow_dispatch (filter input) once on main. - skills/run-windows-tests: run.sh dispatches the workflow and `gh run watch`es it to a pass/fail exit; SKILL.md documents the gh-driven loop. Removed all QEMU scaffolding. github.com is reachable from the sandbox (gh needs sandbox disabled for TLS), so this loop is drivable end-to-end. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…dPipes The previous filter (CancelThenAbandon_...) matched nothing on main, where the test is named CancellationToken_WhenGrandchildHoldsRedirectedPipes_ShouldNotHang. The shared substring matches the Windows grandchild test on both main and the EFT-3295 branch. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Drop the push trigger (CLI-triggered only) and the default filter (filter is now a required workflow_dispatch input, passed via env to avoid script injection; run.sh errors if no filter is given). Dispatch-only means the workflow must live on the default branch. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
rhysparry
left a comment
There was a problem hiding this comment.
Nothing blocking, but a few possible improvements. Action versions are probably worth fixing.
| gh workflow run windows-test.yml -f filter="$FILTER" --ref "$REF" | ||
|
|
||
| # Give GitHub a moment to register the run, then find and watch it. | ||
| sleep 5 |
There was a problem hiding this comment.
I'm curious whether this will be flaky in practice, but can be addressed if it proves to be an issue.
There was a problem hiding this comment.
Yeah... claude is there a better way to do this than a sleep?
| (A local QEMU VM was tried and abandoned: brew's edk2 firmware on Apple Silicon only | ||
| enumerates USB storage and won't boot the Windows installer. See the project memory.) |
There was a problem hiding this comment.
Not sure how helpful this is to the agent. I don't think memory is shared in the repo, right?
| ## How to run | ||
|
|
||
| ```bash | ||
| .claude/skills/run-windows-tests/run.sh "Name~WhenGrandchildHoldsRedirectedPipes" |
There was a problem hiding this comment.
Is it clear that this is an example?
There was a problem hiding this comment.
Claude make it clear this is an example test name, you dont need to pass this specifically. Tell the user (which is you) how to construct this parameter
gb-8
left a comment
There was a problem hiding this comment.
Claude found some things (see below).
I'm not sure how problematic the hanging test issue is. Is there a cost implication? Or a throughput implication?
Other than that, they don't seem crucial, so I'm ✅ and leaving it up to you.
PR Review: #1245 — Run Windows-only tests on a GitHub Actions runner (on demand)
Author: Jim Pelletier | Base: main | Files: 3 | +110 / -0
Overview
A neat piece of dev tooling that closes a real pain point: [WindowsTest] tests skip entirely on macOS/Apple Silicon, leaving a slow push-to-CI feedback loop as the only option. This PR adds a manual workflow_dispatch GitHub Actions job and a Claude skill (run.sh + SKILL.md) to dispatch it and stream results.
The approach is well-scoped — dispatch-only, no accidental push/pull_request triggers, filter is mandatory.
What works well
- Security: passing the filter through env: rather than inlining ${{ inputs.filter }} into the run: block is exactly right — prevents script injection.
- set -euo pipefail and ${1:?...} in run.sh are good defensive defaults.
- SKILL.md is unusually thorough — the "Common mistakes" section and the workflow_dispatch-on-default-branch gotcha are the kind of thing that would bite repeatedly without documentation.
- actions/setup-dotnet@v4 with global-json-file: global.json correctly pins the SDK version.
Issues and suggestions
- Race condition in run.sh (medium risk)
sleep 5
RID="$(gh run list --workflow windows-test.yml --branch "$REF" --limit 1 --json databaseId --jq '.[0].databaseId')"
sleep 5 then take the most-recent run ID is fragile in two ways:
- GitHub is occasionally slow to register a dispatch → the listed run may be a previous one on the same branch.
- A parallel dispatch (another dev, a retry) → you watch the wrong run.
gh workflow run can emit the new run URL with --json if you capture stderr, but the cleaner fix is to record the time before dispatch and filter by createdAt:
BEFORE=$(date -u +%Y-%m-%dT%H:%M:%SZ)
gh workflow run windows-test.yml -f filter="$FILTER" --ref "$REF"
sleep 8
RID="$(gh run list --workflow windows-test.yml --branch "$REF" --limit 5 \
--json databaseId,createdAt \
--jq "[.[] | select(.createdAt > \"$BEFORE\")] | .[0].databaseId")"
- No timeout-minutes on the workflow job
A hanging test (or a test that unexpectedly runs more than the filtered set) will consume the GitHub Actions default 6-hour limit. Consider:
jobs:
windows-test:
runs-on: windows-latest
timeout-minutes: 30
- No permissions block
The other workflow in this repo (approve-renovate-pull-request.yml) explicitly declares permissions. Even if only reads are needed, a restrictive permissions: block is good practice and communicates intent:
permissions:
contents: read
- --ref "$REF" silently fails if workflow isn't on main
SKILL.md documents this limitation well, but run.sh will fail with a confusing gh error if run before the PR is merged. A guard at the top of run.sh would make the failure message actionable:
# Verify the workflow exists on the default branch
DEFAULT=$(gh repo view --json defaultBranchRef --jq '.defaultBranchRef.name')
if ! gh api "repos/{owner}/{repo}/contents/.github/workflows/windows-test.yml?ref=$DEFAULT" &>/dev/null; then
echo "ERROR: windows-test.yml is not yet on '$DEFAULT' — merge this PR first." >&2
exit 1
fi
Minor nits
- run.sh line 22: # Give GitHub a moment to register the run, then find and watch it. — the comment implies this is expected to be reliable; worth softening ("may need a longer sleep on a busy runner").
- The SKILL.md references "project memory" for the QEMU/edk2 story — fine for internal use, but that memory file isn't in this PR. Not a blocker.
Summary
Solid PR. The core design (dispatch-only, mandatory filter, env-based injection guard) is correct and the documentation is genuinely good. The main things worth fixing before merge are the 4 items above.
Co-authored-by: Rhys Parry <rhys.parry@octopus.com>
Co-authored-by: Rhys Parry <rhys.parry@octopus.com>
- workflow: add `permissions: contents: read` and `timeout-minutes: 30` (caps a hanging test at 30 min instead of the 6-hour default — answers the cost/throughput question). - run.sh: replace the blind `sleep 5` + take-latest with a bounded poll that matches the run we just dispatched by creation time (handles slow registration and parallel dispatches); add a guard that fails with an actionable message if the workflow isn't on the default branch yet, instead of a confusing gh error. - SKILL.md: remove the project-memory reference (that file isn't in the repo); rewrite the "How to run" section so the filter is clearly an example and explain how to construct one. (actions/checkout@v6 and actions/setup-dotnet@v5 already applied via review suggestions.) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
| if ! gh api "repos/{owner}/{repo}/contents/.github/workflows/windows-test.yml?ref=$DEFAULT" >/dev/null 2>&1; then | ||
| echo "ERROR: windows-test.yml is not on the default branch ('$DEFAULT') yet, so workflow_dispatch can't see it. Merge this branch to '$DEFAULT' first." >&2 | ||
| exit 1 | ||
| fi |
It only guarded the one-time pre-merge state; once the workflow is on the default branch it always passes, so it's just an extra gh round-trip on every run. Removed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Background
Some of our integration tests are Windows-only (
[WindowsTest]) which me and Claude obviously cant run my Mac. The only option currently is to push to a PR and let CI run, but thats a long feedback loop, this PR adds a new github actions based mechanism to shorten the loop for agents.Results
Adds
.github/workflows/windows-test.yml: aworkflow_dispatch-only job onwindows-latestthat takes a requiredfilterand runsdotnet testagainstOctopus.Tentacle.Tests.Integration(--framework net8.0 --filter <filter>). Nopushorpull_requesttrigger, so it only runs when someone asks.Also adds a
run-windows-testsskill under.claude/skills/that dispatches the workflow and watches it to a pass/fail.I've run it end to end:
CancellationToken_WhenGrandchildHoldsRedirectedPipes_ShouldNotHangpassed on the runner in 543ms, about 5 minutes start to finish.How to review this PR
Reducing risk
This is dev tooling only, I cant see any real risk
[JIM_BOT.EXE v2.13]