fix(shuffle): make shuffle service actors idempotent and enable fault tolerance#472
Conversation
Prevent duplicate shuffle service actors on the same node by assigning unique names and reusing existing actors. Also enable fault tolerance with max restarts and task retries, and remove the now-unnecessary explicit start call. Signed-off-by: epsilonwang <epsilonwang@didiglobal.com>
There was a problem hiding this comment.
Pull request overview
This PR makes the RayDP external shuffle service actor named and reusable per node to avoid duplicate shuffle service actors, and enables Ray-level fault recovery by configuring unlimited restarts and task retries.
Changes:
- Create/reuse a named shuffle service actor via
Ray.getActor(name)before creating a new actor. - Auto-start the shuffle service inside
RayExternalShuffleServiceconstruction and remove the explicit start call path fromRayAppMaster. - Configure the shuffle service actor with
setMaxRestarts(-1)andsetMaxTaskRetries(-1)for fault tolerance.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| core/raydp-main/src/main/scala/org/apache/spark/deploy/raydp/RayExternalShuffleService.scala | Auto-start shuffle service upon actor construction. |
| core/raydp-main/src/main/scala/org/apache/spark/deploy/raydp/RayAppMaster.scala | Simplifies shuffle-service startup logic during executor registration (no explicit start task). |
| core/raydp-main/src/main/java/org/apache/spark/deploy/raydp/ExternalShuffleServiceUtils.java | Introduces named-actor reuse and configures actor restart/task retry behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Mark RayExternalShuffleService.start() as final to prevent overridable method call from constructor pitfall - Change log message from 'Starting shuffle service' to 'Ensuring shuffle service' to accurately reflect that the actor may already exist
|
need to retrigger ci. |
There was a problem hiding this comment.
@my-vegetable-has-exploded Thanks for the proposal, can you elaborate more on what problem we are trying to solve? We do want to support two spark cluster instance running on the same ray cluster, in this case we do want two actors
|
Thanks for review.@pang-wu
The main problem is that ess can't restart automatically if certain pod restarts. So we add set
I think it will be great if we can support two or more spark cluster instance running on the same ray cluster. In my opinion, it still make sense that one pod have only one ess. Ess use To support two or more spark cluster, I think we need to separate shuffle block clean (Once application finished) and ess shutdown (Ray cluster down). Btw, I am interesting in proposal for support more spark cluster instance running on the same ray cluster, it will be great if I can participate in related discussion. |
Motivation
Prevent duplicate shuffle service actors on the same node by assigning unique names and reusing existing actors. Also enable fault tolerance with max restarts and task retries, and remove the now-unnecessary explicit start call.
Approach
ExternalShuffleServiceUtils.createShuffleService(): generate a unique nameraydp-shuffle-service-<ip>based on the node IPsetMaxRestarts(-1) / setMaxTaskRetries(-1)to enable automatic fault recovery at the Ray levelstartShuffleService()method; instead, callstart()inside theRayExternalShuffleServiceconstructor so the service auto-starts on creation