Skip to content

fix(driver): force JVM exit to prevent hanging RayJobs#481

Draft
my-vegetable-has-exploded wants to merge 2 commits into
ray-project:masterfrom
my-vegetable-has-exploded:jvm-exit-guard-fork
Draft

fix(driver): force JVM exit to prevent hanging RayJobs#481
my-vegetable-has-exploded wants to merge 2 commits into
ray-project:masterfrom
my-vegetable-has-exploded:jvm-exit-guard-fork

Conversation

@my-vegetable-has-exploded

Copy link
Copy Markdown
Contributor

Motivation

In Spark on Ray scenarios, the driver logic may finish but the JVM cannot exit naturally because non-daemon threads (e.g. Hudi Embedded Timeline Server, connection pools, etc.) keep it alive. KubeRay RayJob relies on the entrypoint process exiting to determine job completion, so this causes RayJobs to stay in RUNNING state indefinitely.

Approach

  1. JvmExitGuard — a singleton that, when armed, starts a daemon countdown thread. If the JVM hasn't exited within the configured timeout, it forcibly calls System.exit(). If System.exit() throws IllegalStateException (JVM already in its shutdown sequence), it falls back to Runtime.getRuntime().halt() which bypasses all shutdown hooks and terminates immediately.

  2. Unified exit code constantsEXIT_SUCCESS=0, EXIT_APP_FAILED=1, EXIT_KILLED=143. Replace all hardcoded exit codes in DriverExitState and ApplicationInfo with these constants.

  3. Arm points:

    • finalizeDriverTermination() — called on normal exit, failure, and exit function redirect
    • stop(KILLED) path in RayCoarseGrainedSchedulerBackend
  4. raydp-submit integration — forward RAYDP_JVM_EXIT_TIMEOUT environment variable as -Draydp.jvm.exit.timeout JVM property.

epsilonwang added 2 commits May 15, 2026 10:49
- Add DriverExitState: thread-safe state machine tracking FINISHED/FAILED/KILLED with exit codes and diagnostics
- Add DriverAppMasterReporter: reports terminal state to RayAppMaster via FinishApplication RPC, idempotent with CAS-based once-only reporting
- Replace UnregisterApplication with FinishApplication containing state/exitCode/diagnostics
- Add ApplicationInfo.finish() method and exitCode/diagnostics fields
- Add RayAppMaster.finishApplication() and RayAppMasterUtils Java accessor
- Track driver exit in SparkSubmit: wrap main with try/catch, report state on exit
- Wire DriverAppMasterReporter.bind/bindMasterHandle in scheduler backend
- Handle KILLED state in RayCoarseGrainedSchedulerBackend.stop()
- Make shuffle service actors named and idempotent
- Remove obsolete TestRayCoarseGrainedSchedulerBackend tests

Signed-off-by: epsilonwang <epsilonwang@didiglobal.com>
- Add JvmExitGuard: daemon thread that forces System.exit() after a
  configurable timeout (default 300s) when the driver reaches terminal
  state, preventing RayJobs from hanging due to non-daemon threads
- Replace hardcoded exit codes in DriverExitState with JvmExitGuard
  constants (EXIT_SUCCESS=0, EXIT_APP_FAILED=1, EXIT_KILLED=143)
- Update ApplicationInfo to use JvmExitGuard.EXIT_SUCCESS
- Add JvmExitGuard.arm() call in finalizeDriverTermination and KILLED path
- Stop SparkContext in finally block for ray:// master URLs
- Add RAYDP_JVM_EXIT_TIMEOUT env var support in raydp-submit

Signed-off-by: epsilonwang <epsilonwang@didiglobal.com>
@my-vegetable-has-exploded my-vegetable-has-exploded changed the title Jvm exit guard fork fix(driver): force JVM exit to prevent hanging RayJobs May 18, 2026
@my-vegetable-has-exploded my-vegetable-has-exploded marked this pull request as draft May 18, 2026 02:33
@my-vegetable-has-exploded

Copy link
Copy Markdown
Contributor Author

wait for #480

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant