Skip to content

feat(driver): track and report driver exit state to AppMaster#480

Open
my-vegetable-has-exploded wants to merge 4 commits into
ray-project:masterfrom
my-vegetable-has-exploded:application-state-fork
Open

feat(driver): track and report driver exit state to AppMaster#480
my-vegetable-has-exploded wants to merge 4 commits into
ray-project:masterfrom
my-vegetable-has-exploded:application-state-fork

Conversation

@my-vegetable-has-exploded

Copy link
Copy Markdown
Contributor

Motivation

RayDP currently has no mechanism for the driver to communicate its exit state (FINISHED / FAILED / KILLED) to the RayAppMaster. The original UnregisterApplication message carries only an appId, so the AppMaster always records the application as FINISHED — even on failure or kill. This makes it impossible for upstream consumers (e.g. KubeRay RayJob status, monitoring dashboards) to determine the real outcome of a Spark application.

Approach

  • DriverExitState — a thread-safe state machine that tracks UNKNOWN → FINISHED / FAILED / KILLED, carrying exitCode and diagnostics. Once a terminal state is set it is immutable.

  • DriverAppMasterReporter — reports the terminal state to RayAppMaster via a new FinishApplication RPC. Uses AtomicBoolean.compareAndSet to guarantee exactly-once reporting (idempotent). After reporting, it also stops the AppMaster to ensure cleanup.

  • FinishApplication message — replaces the original UnregisterApplication, adding state, exitCode, and diagnostics fields. Routed via receiveAndReply so the AppMaster can return whether the finish was accepted.

  • SparkSubmit.main() refactor:

    • Wrap exitFn: non-zero exit code → trySetFailed, zero → trySetFinished
    • Wrap doSubmit() in try/catch: exception → trySetFailed
    • Unify cleanup by calling finalizeDriverTermination()tryReportAndCleanup() before exit
  • RayCoarseGrainedSchedulerBackend:

    • In onStopRequest(), set DriverExitState.trySetKilled(143, "Spark launcher requested application stop.")
    • In the stop(KILLED) path, call DriverAppMasterReporter.tryReportAndCleanup()
    • Remove the direct call to RayAppMasterUtils.stopAppMaster() in stop() — the Reporter now handles this
  • ApplicationInfo — add finish() method (synchronized, idempotent — returns false if already finished) and exitCode / diagnostics fields.

- Add DriverExitState: thread-safe state machine tracking FINISHED/FAILED/KILLED with exit codes and diagnostics
- Add DriverAppMasterReporter: reports terminal state to RayAppMaster via FinishApplication RPC, idempotent with CAS-based once-only reporting
- Replace UnregisterApplication with FinishApplication containing state/exitCode/diagnostics
- Add ApplicationInfo.finish() method and exitCode/diagnostics fields
- Add RayAppMaster.finishApplication() and RayAppMasterUtils Java accessor
- Track driver exit in SparkSubmit: wrap main with try/catch, report state on exit
- Wire DriverAppMasterReporter.bind/bindMasterHandle in scheduler backend
- Handle KILLED state in RayCoarseGrainedSchedulerBackend.stop()
- Make shuffle service actors named and idempotent
- Remove obsolete TestRayCoarseGrainedSchedulerBackend tests

Signed-off-by: epsilonwang <epsilonwang@didiglobal.com>
Revert ExternalShuffleServiceUtils.java and RayExternalShuffleService.scala
to their base (bde8b06) versions, removing changes that belong to the
idempotent-ess-fork branch:

- Remove named actor pattern (getShuffleServiceActorName, Ray.getActor,
  setName, setMaxRestarts, setMaxTaskRetries) from ExternalShuffleServiceUtils
- Restore startShuffleService() method in ExternalShuffleServiceUtils
- Remove auto-start call in RayExternalShuffleService constructor
- Restore semicolon in import statement

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds driver terminal-state reporting for RayDP applications so the RayAppMaster can distinguish FINISHED, FAILED, and KILLED outcomes instead of always treating unregister as successful completion.

Changes:

  • Adds driver exit-state tracking and an AppMaster reporting/cleanup helper.
  • Replaces UnregisterApplication with FinishApplication carrying state, exit code, and diagnostics.
  • Wires SparkSubmit and scheduler-backend lifecycle paths into the new reporting flow.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
core/raydp-main/src/main/scala/org/apache/spark/scheduler/cluster/raydp/RayCoarseGrainedSchedulerBackend.scala Binds AppMaster reporting handles and reports cleanup on KILLED stop paths.
core/raydp-main/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala Wraps submit lifecycle and exit handling to set/report driver terminal state.
core/raydp-main/src/main/scala/org/apache/spark/deploy/raydp/RayAppMaster.scala Adds AppMaster-side finish RPC handling.
core/raydp-main/src/main/scala/org/apache/spark/deploy/raydp/Messages.scala Replaces unregister message with finish message carrying terminal metadata.
core/raydp-main/src/main/scala/org/apache/spark/deploy/raydp/DriverExitState.scala Adds global synchronized driver exit-state snapshot/state machine.
core/raydp-main/src/main/scala/org/apache/spark/deploy/raydp/DriverAppMasterReporter.scala Adds exactly-once terminal state reporting and AppMaster cleanup helper.
core/raydp-main/src/main/scala/org/apache/spark/deploy/raydp/ApplicationInfo.scala Stores exit code/diagnostics and adds idempotent finish handling.
core/raydp-main/src/main/java/org/apache/spark/deploy/raydp/RayAppMasterUtils.java Adds Java helper for invoking AppMaster finish RPC.
Comments suppressed due to low confidence (1)

core/raydp-main/src/main/scala/org/apache/spark/scheduler/cluster/raydp/RayCoarseGrainedSchedulerBackend.scala:312

  • Reporting is now only triggered from the backend for the KILLED path. Since the previous stopAppMaster call was removed from stop(), in-process SparkSubmit and direct SparkContext users that do not go through SparkSubmit.main's new finalizer will stop the SparkContext without reporting FINISHED or cleaning up the AppMaster. This regresses the existing backend-owned cleanup path; normal stop should still report/cleanup or share the same finalization path.
        if (finalState == SparkAppHandle.State.KILLED) {
          DriverAppMasterReporter.tryReportAndCleanup()

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@pang-wu

pang-wu commented May 23, 2026

Copy link
Copy Markdown
Collaborator

can we add some tests?

epsilonwang added 2 commits June 2, 2026 03:20
Signed-off-by: epsilonwang <epsilonwang@didiglobal.com>
Signed-off-by: epsilonwang <epsilonwang@didiglobal.com>
@my-vegetable-has-exploded

Copy link
Copy Markdown
Contributor Author

@copilot review

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 4 comments.

Comment on lines +59 to +62
val binding = synchronized {
if (reported.get()) None
else Some((appId, masterHandle))
}
Comment on lines +66 to +69
if (currentAppId == null || currentMasterHandle == null) {
logWarning("Skip reporting terminal application state because AppMaster binding " +
"is incomplete.")
false
Comment on lines 309 to +313
try {
super.stop() // this will stop all executors
appMasterRef.get.send(UnregisterApplication(appId.get))
if (finalState == SparkAppHandle.State.KILLED) {
DriverAppMasterReporter.tryReportAndCleanup()
}
Comment on lines +165 to +168
# The driver sleeps for 300s after the Spark action; waiting here keeps
# the test simple while still killing raydp-submit during driver runtime.
time.sleep(KILL_AFTER_SECONDS)
log = output_path.read_text(encoding="utf-8")
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants