Skip to content

fix: prevent orphan VIT model processes on crash or interrupt#1256

Open
sufubao wants to merge 2 commits intomainfrom
vit
Open

fix: prevent orphan VIT model processes on crash or interrupt#1256
sufubao wants to merge 2 commits intomainfrom
vit

Conversation

@sufubao
Copy link
Copy Markdown
Collaborator

@sufubao sufubao commented Apr 3, 2026

Summary

  • Add start_parent_check_thread() to VIT model worker processes so they self-terminate if the visual server dies, matching the pattern used by router, audio, and other server components
  • Return process handles from start_model_process() and store them in VisualManager and VisualOnlyManager
  • Implement clean_up() to kill tracked VIT processes on shutdown, with race-condition-safe exception handling

Test plan

  • Deploy visual server with VIT model processes and verify they start correctly
  • Kill the visual server process and confirm VIT workers self-terminate within ~10 seconds
  • Trigger an initialization error and verify clean_up() terminates all spawned VIT processes
  • Verify no orphan VIT processes remain after repeated crash/restart cycles

sufubao added 2 commits April 2, 2026 12:56
VIT model inference processes spawned by the visual server had no
mechanism to detect parent death and self-terminate. The process
handles were also discarded immediately after spawning, making
explicit cleanup impossible.

- Add start_parent_check_thread() to VIT model worker processes so
  they monitor the visual server and self-terminate if it dies
- Return process handles from start_model_process() and store them
  in VisualManager and VisualOnlyManager
- Implement clean_up() to kill tracked VIT processes on shutdown
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces process tracking and cleanup for VIT model processes within the visual server. It updates the model startup sequence to return process handles, adds a parent process check thread, and implements a cleanup method. The review feedback suggests optimizing the cleanup logic in both VisualManager and VisualOnlyManager by signaling all processes to terminate before joining them, which prevents cumulative delays during shutdown.

Comment on lines +193 to +200
for proc in getattr(self, "model_procs", []):
try:
if proc.is_alive():
logger.info(f"Killing VIT model process {proc.pid}")
proc.kill()
proc.join(timeout=5)
except (ProcessLookupError, OSError):
pass
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current cleanup logic kills and joins processes sequentially. If multiple processes are unresponsive, the cumulative timeout (5 seconds per process) could significantly delay the shutdown of the visual server. It is more efficient to signal all processes to terminate first, and then join them.

Suggested change
for proc in getattr(self, "model_procs", []):
try:
if proc.is_alive():
logger.info(f"Killing VIT model process {proc.pid}")
proc.kill()
proc.join(timeout=5)
except (ProcessLookupError, OSError):
pass
procs = getattr(self, "model_procs", [])
for proc in procs:
try:
if proc.is_alive():
logger.info(f"Killing VIT model process {proc.pid}")
proc.kill()
except (ProcessLookupError, OSError):
pass
for proc in procs:
try:
proc.join(timeout=5)
except (ProcessLookupError, OSError):
pass

Comment on lines +136 to +143
for proc in getattr(self, "model_procs", []):
try:
if proc.is_alive():
logger.info(f"Killing VIT model process {proc.pid}")
proc.kill()
proc.join(timeout=5)
except (ProcessLookupError, OSError):
pass
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Similar to the VisualManager, the cleanup logic here is sequential. Signaling all processes to terminate before joining them would improve shutdown efficiency, especially when dealing with multiple model processes.

        procs = getattr(self, "model_procs", [])
        for proc in procs:
            try:
                if proc.is_alive():
                    logger.info(f"Killing VIT model process {proc.pid}")
                    proc.kill()
            except (ProcessLookupError, OSError):
                pass
        for proc in procs:
            try:
                proc.join(timeout=5)
            except (ProcessLookupError, OSError):
                pass

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant