Skip to content

ci: drop unreliable Android emulator snapshot caching#64

Merged
gmaclennan merged 10 commits into
mainfrom
claude/ci-emulator-readiness
May 11, 2026
Merged

ci: drop unreliable Android emulator snapshot caching#64
gmaclennan merged 10 commits into
mainfrom
claude/ci-emulator-readiness

Conversation

@gmaclennan
Copy link
Copy Markdown
Member

@gmaclennan gmaclennan commented May 7, 2026

Summary

Drop AVD snapshot caching from the Android instrumented-test workflow, and the readiness loop it was put in place to support.

The original PR added a wait-for-emulator probe to handle a race where, after a snapshot restore, sys.boot_completed=1 (it gets restored from the saved property store) arrived before system_server had bound settings / package / activity, causing the next adb shell settings put to fail with Broken pipe.

Iterating with the probe instrumented showed the actual failure mode is worse: the restored snapshot's system_server is dead-on-arrival on a different runner machine — init.svc.zygote stayed in restarting for the full 120s probe window, and the emulator-runner action's own adb shell input keyevent 82 (which runs before the user script:) was already throwing android.os.DeadSystemException. The cached snapshot was healthy when saved but not portable across runners (likely Vulkan/RAM guest state coupled to the host emulator's ICD — confirmed by the "Please update the emulator to one that supports the feature(s): Vulkan" / "Increasing RAM size to 2048MB" warnings on every restore).

With snapshot caching gone the emulator cold-boots every run (~60–90s slower). On cold boot sys.boot_completed=1 is set by system_server only after the BOOT_COMPLETED broadcast, so core services are bound by the time the action's built-in wait returns — no custom probe needed.

Net change

  • Removed the Cache AVD snapshot and Create AVD and generate snapshot steps.
  • Removed the inline adb wait-for-device … getprop sys.boot_completed loop from the test step (the action already does this).

Test plan

  • Confirm the Android instrumented-test job is green on this PR.
  • Re-run a couple of times to confirm no cold-boot regressions.
  • Confirm the job stays within its 60-minute timeout.

Generated by Claude Code

claude and others added 10 commits May 7, 2026 22:57
`sys.boot_completed` flips as soon as zygote is alive, which on a
snapshot restore happens before `system_server` finishes binding the
`settings`, `package`, and `activity` services. The previous wait-for-device
loop returned almost instantly because the property was already set, and the
next `adb shell settings put` then crashed with `cmd: Failure calling service
settings: Broken pipe (32)`.

Replace it with a probe that polls each service we're about to call until
it actually responds, with a 120s ceiling. One retry on each `settings put`
covers a residual race where `list` succeeds but a write transaction loses
to first-time service initialisation.

`disable-animations: false` is unchanged — emulator-runner's own
`input keyevent 82` path crashes the emulator on this image, which is why
the workflow drives the settings directly.
reactivecircus/android-emulator-runner v2 invokes the script with
\`/usr/bin/sh\` (dash on Ubuntu), not bash. The previous version used
\`[[ ]]\`, \`local\`, and \`\$SECONDS\` which dash refuses to parse with:

  /usr/bin/sh: 1: Syntax error: end of file unexpected (expecting \"}\")

Replace with POSIX-only constructs: \`[ ]\` tests, \`\$(date +%s)\` for
elapsed-time tracking, an inline loop with a \`ready\` flag instead of a
function. Verified parses + runs under dash locally.
reactivecircus/android-emulator-runner v2 invokes the workflow's
`script:` line by line through `sh -c`, not as a single multi-line
script — so any function definition, while loop, or other multi-line
construct is split across separate shell invocations and fails to
parse (the previous attempts hit "expecting }" / "expecting done").

Move the entire body to scripts/run-android-instrumented-tests.sh and
have the workflow invoke it as a single line. The script can use
bash freely (proper shebang, set -euo pipefail).
Scope the script down to just the multi-line wait-for-emulator-services
loop (the only part that can't survive the action's per-line `sh -c`).
Animation settings and the gradle invocation move back inline as
single-line statements; the `||` retry uses a one-line brace group,
which dash parses fine.
Run each readiness check independently each tick so we capture pm
and settings exit codes separately, then log t=Xs bc='…' pm_rc=…
cmd_rc=… every ~10s and dump init.svc.* every ~30s. Drop set -e so
an intermittent adb hiccup can't silently kill the probe.

Diagnostic step — next CI run should show whether the timeout is a
genuinely-broken emulator or a probe bug.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previous CI runs showed a clear pattern: the cached snapshot booted
but zygote immediately fell into a restart loop (init.svc.zygote =
restarting for the full 120s, pm/cmd returning DEAD_OBJECT). The
warning "Please update the emulator to one that supports the
feature(s): Vulkan" plus a forced "Increasing RAM size to 2048MB"
pointed at incompatibility between the cached snapshot and the
emulator GitHub now installs.

Two changes:
- Cache key suffix `-v2` so the bad snapshot is abandoned.
- Run wait-for-emulator.sh in the snapshot-creation step so we only
  save a snapshot from a fully-healthy userland (the action saves
  on emulator shutdown, after the script returns).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Cross-runner snapshot restore produces a corpse: snapshot loads,
sys.boot_completed=1, but system_server is dead on first adb call
(DeadSystemException). Same snapshot restores fine within the job
that saved it, so the rot is portability — likely the captured
Vulkan/RAM guest state is coupled to the host emulator's ICD.

Cold-booting the emulator every run trades ~60-90s for reliability.
The wait-for-emulator probe still gates the test command.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The probe was added to work around a snapshot-restore race where
sys.boot_completed=1 came from the saved property store before live
services were bound. With snapshot caching dropped, the action's
own wait on sys.boot_completed=1 is sufficient — on cold boot that
property is only set after system_server posts BOOT_COMPLETED, so
core services are already bound.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@gmaclennan gmaclennan changed the title ci: wait for emulator services to be ready post-snapshot ci: drop unreliable Android emulator snapshot caching May 11, 2026
@gmaclennan gmaclennan merged commit 09be9ec into main May 11, 2026
7 checks passed
@gmaclennan gmaclennan deleted the claude/ci-emulator-readiness branch May 11, 2026 21:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants