Clean up socket file after stopping running AppHost instance (#17587)#18296
Conversation
Fixes issue where aspire stop leaves socket files behind, causing subsequent aspire add/describe commands to fail with connection timeouts. Resolves #17587: aspire add fails after aspire run --detach and aspire stop The socket file was not being deleted after successfully stopping an AppHost instance. Subsequent commands would attempt to connect to the stale socket path, resulting in timeout errors. Now we explicitly delete the socket file after the instance has been stopped and monitored for process termination. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
🚀 Dogfood this PR with:
curl -fsSL https://raw.githubusercontent.com/microsoft/aspire/main/eng/scripts/get-aspire-cli-pr.sh | bash -s -- 18296Or
iex "& { $(irm https://raw.githubusercontent.com/microsoft/aspire/main/eng/scripts/get-aspire-cli-pr.ps1) } 18296" |
There was a problem hiding this comment.
Pull request overview
This PR fixes issue #17587 where aspire add fails after running aspire run --detach followed by aspire stop. The root cause was that the backchannel socket file was left on disk after stopping an AppHost instance, causing subsequent CLI commands to discover the stale socket and attempt to connect to the now-dead process.
Changes:
- Added best-effort socket file cleanup in
RunningInstanceManager.StopRunningInstanceAsyncafter an instance has been confirmed stopped, preventing stale socket connection errors on subsequent commands.
Show a summary per file
| File | Description |
|---|---|
src/Aspire.Cli/Projects/RunningInstanceManager.cs |
Adds socket file deletion after successful instance stop, with appropriate exception handling for IOException/UnauthorizedAccessException consistent with existing orphan cleanup patterns in AppHostHelper.cs. |
Copilot's findings
- Files reviewed: 1/1 changed files
- Comments generated: 0
|
I think an E2E test is needed here. How are sockets usually cleaned up? Does What happens if the file can't be deleted? |
PR Testing Report — #18296PR Information
Artifact Version Verification
Changes Analyzed
Test Scenarios ExecutedScenario 1 — Socket file removed after
|
| Scenario | Status |
|---|---|
| Socket deleted after stop | ✅ Passed |
| No stale-socket hang on next command | ✅ Passed |
| Stop with nothing running | ✅ Passed |
Overall Result
✅ PR VERIFIED — the socket file is removed on stop, and follow-up commands no longer encounter a stale socket. Behavior matches the fix described for #17587.
Tested with the dogfood CLI built from the PR head. Empty single-file AppHost used so the scenario exercises socket lifecycle without external container dependencies.
Adds a deterministic, cross-platform unit test for the #17587 fix in RunningInstanceManager.StopRunningInstanceAsync. The test stands up a real Unix-domain-socket auxiliary backchannel server whose AppHost process has already exited, drives a successful stop, and asserts the socket file is deleted afterward. Without the fix the file lingers and the test fails. An E2E test cannot guard this on Linux: the AppHost self-deletes its socket on graceful shutdown and the CLI prunes dead-PID sockets, so both paths self-heal. The regression only surfaces on Windows via PID reuse, hence the unit-level guard. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Re: E2E test proposal for the socket-cleanup fixI tried the E2E route first and it turns out an E2E test cannot guard this fix on Linux, so I added a deterministic cross-platform unit test instead (pushed as Why an E2E test gives false confidence hereI wrote an E2E test ( Root cause: on Linux two independent paths self-heal the socket regardless of the fix:
The #17587 regression only actually bites on Windows, where the dead AppHost's PID gets reused, so the orphan-pruning heuristic thinks the process is still alive, the stale socket is treated as live, and the next The unit test (the real guard)
Verified both directions:
I removed the E2E test (kept only as a local experiment) to avoid false confidence; the unit test is committed to this branch. |
|
Retrying the failed CI jobs for this pull request from the CI run attempt. The rerun is being tracked in the rerun attempt. |
JamesNK
left a comment
There was a problem hiding this comment.
How are sockets usually cleaned up? Does
aspire run(non-detached) remove them? Is there code that could be reused, or even better they both use the same code path to clean up.What happens if the file can't be deleted?
Pending questions.
I want to understand if this is fixing the underlying problem or is a bandaid on top.
Move the leaked-socket cleanup to the aspire stop command itself (the command that leaks the socket), deleting by exact path once the AppHost process has been confirmed stopped. Consolidate the best-effort delete into AppHostHelper.TryDeleteSocketFile so the stop command and RunningInstanceManager share one code path. The existing orphan-pruning logic remains as the backstop for hard crashes where aspire stop never runs. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
@JamesNK Good push on "root cause vs bandaid" — you were right that the original change was treating the symptom. I dug into the cleanup paths and reworked it. Pushed in How sockets are cleaned up todayThere are three existing mechanisms, none of which were in the original PR's location:
Does non-detached Why it actually leaks (the real root cause)The socket filename encodes the AppHost PID, and the CLI orphan-pruner decides "is this stale?" via PID liveness. On Windows the dead AppHost's PID gets reused, so The kicker I found: What changed
What happens if the file can't be deleted?
Tests
On the E2E question from earlier: I left the deterministic unit/command tests as the guard rather than E2E, because on Linux two paths self-heal the socket regardless of the fix (so an E2E passes even with the fix removed) — the regression only truly bites on Windows via PID reuse, which the Linux E2E harness can't reproduce. |
…cket-cleanup-after-stop-2 # Conflicts: # src/Aspire.Cli/Projects/RunningInstanceManager.cs
This comment has been minimized.
This comment has been minimized.
Consolidate the remaining CLI-side socket delete (PruneOrphanedSockets) onto AppHostHelper.TryDeleteSocketFile so all CLI socket cleanup flows through one path. The helper now returns a bool so the pruner keeps an accurate deleted count, and the previously silent catch in the pruner now logs at Debug like the other cleanup sites. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This comment has been minimized.
This comment has been minimized.
Replace the real-process-then-reuse-its-exited-PID approach with a guaranteed-nonexistent PID (int.MaxValue - 9), matching the sibling StopCommandTests. Reusing a real exited process's PID is flaky: the OS can recycle that PID for an unrelated live process before the monitor checks, causing it to loop the full timeout and fail the cleanup assertion. This also removes the StartAndWaitForProcessToExit helper and the System.Diagnostics dependency. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Tests selector (audit mode)The full test matrix and all jobs still run in audit mode. The tests and jobs below are what selective CI would run under enforcement. 47 / 97 test projects · 6 jobs, from 23 changed files. Selected test projects (47 / 97)
Selected jobs (6)
How these were chosen — grouped by what changed
🧪 show 32
🔧 📦 affected project 📦 affected project 🧪 🧪 Job reasons
Selection computed for commit |
After aspire stop confirms the AppHost process has terminated, it now removes the backchannel socket file. This prevents stale-socket errors when running subsequent commands (aspire add, aspire describe, etc.) after the stop/detach workflow. Documents the fix from microsoft/aspire#18296 (fixes #17587). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Pull request created: #1283
|
|
📝 Documentation has been drafted in microsoft/aspire.dev#1283 targeting Updated Triggered signals: Note This draft PR needs human review before merging. |
Description
Fixes #17587 —
aspire addfails afteraspire run --detachfollowed byaspire stop.Root cause
After
aspire stopsuccessfully stopped an AppHost instance, the backchannel socket file was left on disk. Subsequent commands (aspire add,aspire describe, etc.) discovered the stale socket and attempted to connect to the now-dead process, producing connection timeouts.Fix
RunningInstanceManager.StopRunningInstanceAsyncnow explicitly deletes the socket file after the instance has been stopped and confirmed terminated. The delete is best-effort and swallowsIOException/UnauthorizedAccessException(a concurrent prune or permission quirk should not fail the stop).User impact
Users can run
aspire run --detach,aspire stop, and immediately runaspire add(or any backchannel command) without stale-socket connection errors.Fixes #17587