Fix/idempotent spark submit#2840
Conversation
Use RetryOnConflict to handle concurrent updates when setting application status to Invalidating. Fetch a fresh copy of the SparkApplication to avoid modifying the cached informer object, and skip update if already in a terminal or transitioning state. Also return true on error to ensure the update event reaches the reconciler, preventing silent spec change drops when there are conflicts. Signed-off-by: Adityakuchekar <adityakuchekar0077@gmail.com>
|
🎉 Welcome to the Kubeflow Spark Operator! 🎉 Thanks for opening your first PR! We're happy to have you as part of our community 🚀 Here's what happens next:
Join the community:
Feel free to ask questions in the comments if you need any help or clarification! |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
…river pods Signed-off-by: aditya-systems-hub <adityakuchekar0077@gmail.com>
c6781ba to
c6d957b
Compare
|
Hi @ChenYi015, |
|
What exactly is the issue this is trying to solve? What do you mean by conflict? How do you reproduce it? What are you expecting and what do you see instead? |
|
Thanks for the direct feedback — I understand the concern. The issue happens during reconciliation when the operator updates the SparkApplication status. If Kubernetes returns a conflict error (for example, due to a resourceVersion change), the controller retries reconciliation. Expected: Actual: |
Purpose of this PR
This PR fixes an issue where a Spark application could be submitted more than once during reconciliation, leading to incorrect failures and orphaned driver pods.
Previously, when a status update failed due to a Kubernetes conflict, the controller retried the whole submission. This caused
spark-submitto run again even though the driver pod was already created. The second submission failed because the driver pod already existed, and the application was then incorrectly marked as failed. At the same time, events from the original running driver pod were ignored because a new SubmissionID was generated.This change makes the submission process idempotent and ensures retries only update status rather than resubmitting the application.
Proposed changes:
submitSparkApplication()to reuse an existing driver pod instead of runningspark-submitagain.RetryOnConflictloops so retries only affect status updates.Change Category
Rationale
Spark application submission should happen only once. Retrying submission during status conflicts caused duplicate submission attempts, incorrect failure states, and orphaned driver pods. By making submission idempotent and retry-safe, the operator behaves correctly under conflict retries and prevents silent event loss.
Checklist
Additional Notes
Verification performed:
go build ./...passes successfully.Branch pushed:
fix/idempotent-spark-submit.Test case