ci: drop unreliable Android emulator snapshot caching#64
Merged
Conversation
`sys.boot_completed` flips as soon as zygote is alive, which on a snapshot restore happens before `system_server` finishes binding the `settings`, `package`, and `activity` services. The previous wait-for-device loop returned almost instantly because the property was already set, and the next `adb shell settings put` then crashed with `cmd: Failure calling service settings: Broken pipe (32)`. Replace it with a probe that polls each service we're about to call until it actually responds, with a 120s ceiling. One retry on each `settings put` covers a residual race where `list` succeeds but a write transaction loses to first-time service initialisation. `disable-animations: false` is unchanged — emulator-runner's own `input keyevent 82` path crashes the emulator on this image, which is why the workflow drives the settings directly.
reactivecircus/android-emulator-runner v2 invokes the script with \`/usr/bin/sh\` (dash on Ubuntu), not bash. The previous version used \`[[ ]]\`, \`local\`, and \`\$SECONDS\` which dash refuses to parse with: /usr/bin/sh: 1: Syntax error: end of file unexpected (expecting \"}\") Replace with POSIX-only constructs: \`[ ]\` tests, \`\$(date +%s)\` for elapsed-time tracking, an inline loop with a \`ready\` flag instead of a function. Verified parses + runs under dash locally.
reactivecircus/android-emulator-runner v2 invokes the workflow's `script:` line by line through `sh -c`, not as a single multi-line script — so any function definition, while loop, or other multi-line construct is split across separate shell invocations and fails to parse (the previous attempts hit "expecting }" / "expecting done"). Move the entire body to scripts/run-android-instrumented-tests.sh and have the workflow invoke it as a single line. The script can use bash freely (proper shebang, set -euo pipefail).
Scope the script down to just the multi-line wait-for-emulator-services loop (the only part that can't survive the action's per-line `sh -c`). Animation settings and the gradle invocation move back inline as single-line statements; the `||` retry uses a one-line brace group, which dash parses fine.
Run each readiness check independently each tick so we capture pm and settings exit codes separately, then log t=Xs bc='…' pm_rc=… cmd_rc=… every ~10s and dump init.svc.* every ~30s. Drop set -e so an intermittent adb hiccup can't silently kill the probe. Diagnostic step — next CI run should show whether the timeout is a genuinely-broken emulator or a probe bug. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previous CI runs showed a clear pattern: the cached snapshot booted but zygote immediately fell into a restart loop (init.svc.zygote = restarting for the full 120s, pm/cmd returning DEAD_OBJECT). The warning "Please update the emulator to one that supports the feature(s): Vulkan" plus a forced "Increasing RAM size to 2048MB" pointed at incompatibility between the cached snapshot and the emulator GitHub now installs. Two changes: - Cache key suffix `-v2` so the bad snapshot is abandoned. - Run wait-for-emulator.sh in the snapshot-creation step so we only save a snapshot from a fully-healthy userland (the action saves on emulator shutdown, after the script returns). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Cross-runner snapshot restore produces a corpse: snapshot loads, sys.boot_completed=1, but system_server is dead on first adb call (DeadSystemException). Same snapshot restores fine within the job that saved it, so the rot is portability — likely the captured Vulkan/RAM guest state is coupled to the host emulator's ICD. Cold-booting the emulator every run trades ~60-90s for reliability. The wait-for-emulator probe still gates the test command. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The probe was added to work around a snapshot-restore race where sys.boot_completed=1 came from the saved property store before live services were bound. With snapshot caching dropped, the action's own wait on sys.boot_completed=1 is sufficient — on cold boot that property is only set after system_server posts BOOT_COMPLETED, so core services are already bound. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Drop AVD snapshot caching from the Android instrumented-test workflow, and the readiness loop it was put in place to support.
The original PR added a
wait-for-emulatorprobe to handle a race where, after a snapshot restore,sys.boot_completed=1(it gets restored from the saved property store) arrived beforesystem_serverhad boundsettings/package/activity, causing the nextadb shell settings putto fail withBroken pipe.Iterating with the probe instrumented showed the actual failure mode is worse: the restored snapshot's
system_serveris dead-on-arrival on a different runner machine —init.svc.zygotestayed inrestartingfor the full 120s probe window, and the emulator-runner action's ownadb shell input keyevent 82(which runs before the userscript:) was already throwingandroid.os.DeadSystemException. The cached snapshot was healthy when saved but not portable across runners (likely Vulkan/RAM guest state coupled to the host emulator's ICD — confirmed by the "Please update the emulator to one that supports the feature(s): Vulkan" / "Increasing RAM size to 2048MB" warnings on every restore).With snapshot caching gone the emulator cold-boots every run (~60–90s slower). On cold boot
sys.boot_completed=1is set bysystem_serveronly after the BOOT_COMPLETED broadcast, so core services are bound by the time the action's built-in wait returns — no custom probe needed.Net change
Cache AVD snapshotandCreate AVD and generate snapshotsteps.adb wait-for-device … getprop sys.boot_completedloop from the test step (the action already does this).Test plan
Generated by Claude Code