Skip to content

Free orphaned proxy port on stop and rm#5394

Open
gkatz2 wants to merge 1 commit into
stacklok:mainfrom
gkatz2:fix/orphan-proxy-on-stop-rm-5393
Open

Free orphaned proxy port on stop and rm#5394
gkatz2 wants to merge 1 commit into
stacklok:mainfrom
gkatz2:fix/orphan-proxy-on-stop-rm-5393

Conversation

@gkatz2
Copy link
Copy Markdown
Contributor

@gkatz2 gkatz2 commented May 28, 2026

Summary

When a workload's status file is missing, thv stop and thv rm report success but leave the workload's proxy process running and holding its port. The proxy-stop path kills the proxy by the PID recorded in the status file, so with the file gone nothing is killed:

  • After thv stop, the surviving supervisor restarts the container, so the workload returns to running on its own.
  • After thv rm, the container is removed but the orphaned proxy keeps holding the port, so it cannot be reused without killing the process by hand.

This makes stop and rm fall back to the existing port-based cleanup when the PID-based stop finds no proxy to kill, so the proxy is terminated and the port freed even when the status file is missing. The fallback reuses freePortHolderIfNeeded (already used on the restart path), which only kills a process verified to be this workload's own proxy.

  • Make stopProcess / stopProxyIfNeeded report whether a tracked proxy was actually killed.
  • Thread the already-loaded runConfig into the container stop/delete paths so the fallback knows the proxy port.
  • When the PID-based stop fails for a non-auxiliary workload, run the port-based cleanup as a backstop.

Fixes #5393

Type of change

  • Bug fix
  • New feature
  • Refactoring (no behavior change)
  • Dependency update
  • Documentation
  • Other (describe):

Test plan

  • Unit tests (task test)
  • E2E tests (task test-e2e)
  • Linting (task lint-fix)
  • Manual testing (describe below)

Manual testing on macOS + OrbStack with a real container workload (fetch):

  • Reproduced the bug: with the status file moved aside, thv stop left the supervisor process alive (verified by PID) and it recreated the container — a new container ID and new StartedAt — returning the workload to running; thv rm left the orphaned proxy holding the port.
  • With the fix: both thv stop and thv rm terminate the proxy (PID gone) and free the port.
  • Confirmed the normal path (status file present) is unchanged: the proxy is stopped by PID and the port-based fallback does not run.

The added unit tests fail without the fix and pass with it.

Does this introduce a user-facing change?

Yes. thv stop and thv rm now reliably stop the workload's proxy and free its port even when the workload's status file is missing, instead of leaving an orphaned proxy that holds the port (and, for stop, restarts the container).

Special notes for reviewers

  • The fallback is gated: it runs only when the PID-based stop returns false (no tracked PID, or the kill failed). The normal stop/rm path is unchanged — no added latency, no behavior change.
  • The kill is identity-verified: freePortHolderIfNeededprocess.IsToolHiveProxyForWorkload confirms the process on the port is this workload's thv start <name> proxy before killing it, so it cannot touch an unrelated process or another workload's proxy.
  • Limitation: if runner.LoadState itself fails (the run config is gone, not just the status file), the proxy port cannot be recovered. In the reported scenario only the status file is missing, so LoadState succeeds and the port is recoverable.
  • The deeper question of why a status file goes missing is out of scope here; this fixes the resulting inability to stop/remove the workload.

Generated with Claude Code

When a workload's status file is missing, thv stop and thv rm left
the proxy process running and holding the workload's port. The
proxy-stop path terminates the proxy by the PID recorded in the
status file, so with the file gone nothing was killed. On stop the
surviving supervisor then restarted the container, so the workload
would not stay stopped; on rm the orphaned proxy kept the port, so
it could not be reused without killing the process by hand.

Fixes stacklok#5393

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Greg Katz <gkatz@indeed.com>
@github-actions github-actions Bot added the size/M Medium PR: 300-599 lines changed label May 28, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented May 28, 2026

Codecov Report

❌ Patch coverage is 93.10345% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.84%. Comparing base (374d452) to head (5c9ff56).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
pkg/workloads/manager.go 93.10% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #5394      +/-   ##
==========================================
+ Coverage   68.83%   68.84%   +0.01%     
==========================================
  Files         628      628              
  Lines       63900    63911      +11     
==========================================
+ Hits        43985    44001      +16     
- Misses      16658    16665       +7     
+ Partials     3257     3245      -12     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/M Medium PR: 300-599 lines changed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

thv stop/rm leave the port held when the status file is missing

1 participant