Skip to content

Fix/idempotent spark submit#2840

Open
aditya-systems-hub wants to merge 2 commits intokubeflow:masterfrom
aditya-systems-hub:fix/idempotent-spark-submit
Open

Fix/idempotent spark submit#2840
aditya-systems-hub wants to merge 2 commits intokubeflow:masterfrom
aditya-systems-hub:fix/idempotent-spark-submit

Conversation

@aditya-systems-hub
Copy link
Copy Markdown

Purpose of this PR

This PR fixes an issue where a Spark application could be submitted more than once during reconciliation, leading to incorrect failures and orphaned driver pods.

Previously, when a status update failed due to a Kubernetes conflict, the controller retried the whole submission. This caused spark-submit to run again even though the driver pod was already created. The second submission failed because the driver pod already existed, and the application was then incorrectly marked as failed. At the same time, events from the original running driver pod were ignored because a new SubmissionID was generated.

This change makes the submission process idempotent and ensures retries only update status rather than resubmitting the application.

Proposed changes:

  • Added an idempotency check in submitSparkApplication() to reuse an existing driver pod instead of running spark-submit again.
  • Moved submission logic outside RetryOnConflict loops so retries only affect status updates.
  • Applied the same fix to both new application reconciliation and resuming applications.
  • Added a new test case verifying recovery when a driver pod already exists.

Change Category

  • Bugfix (non-breaking change which fixes an issue)
  • Feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that could affect existing functionality)
  • Documentation update

Rationale

Spark application submission should happen only once. Retrying submission during status conflicts caused duplicate submission attempts, incorrect failure states, and orphaned driver pods. By making submission idempotent and retry-safe, the operator behaves correctly under conflict retries and prevents silent event loss.


Checklist

  • I have conducted a self-review of my own code.
  • I have updated documentation accordingly.
  • I have added tests that prove my changes are effective or that my feature works.
  • Existing unit tests pass locally with my changes.

Additional Notes

Verification performed:

  • go build ./... passes successfully.
  • All unit tests pass.
  • Integration tests pass (25/25 with one expected skip).
  • Added idempotency test confirming recovery from an existing driver pod.
  • The AfterSuite warning is related to Windows envtest process handling and not caused by this change.

Branch pushed: fix/idempotent-spark-submit.


Test case

Screenshot 2026-02-10 045524 Screenshot 2026-02-10 045552 Screenshot 2026-02-10 045626 Screenshot 2026-02-10 045651 Screenshot 2026-02-10 045715 Screenshot 2026-02-10 045739

Use RetryOnConflict to handle concurrent updates when setting application
status to Invalidating. Fetch a fresh copy of the SparkApplication to avoid
modifying the cached informer object, and skip update if already in a
terminal or transitioning state.

Also return true on error to ensure the update event reaches the reconciler,
preventing silent spec change drops when there are conflicts.

Signed-off-by: Adityakuchekar <adityakuchekar0077@gmail.com>
@github-actions
Copy link
Copy Markdown

github-actions bot commented Feb 9, 2026

🎉 Welcome to the Kubeflow Spark Operator! 🎉

Thanks for opening your first PR! We're happy to have you as part of our community 🚀

Here's what happens next:

Join the community:

Feel free to ask questions in the comments if you need any help or clarification!
Thanks again for contributing to Kubeflow! 🙏

@google-oss-prow
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign andreyvelich for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow bot requested review from ImpSy and nabuskey February 9, 2026 23:34
…river pods

Signed-off-by: aditya-systems-hub <adityakuchekar0077@gmail.com>
@aditya-systems-hub aditya-systems-hub force-pushed the fix/idempotent-spark-submit branch from c6781ba to c6d957b Compare February 9, 2026 23:36
@aditya-systems-hub
Copy link
Copy Markdown
Author

Hi @ChenYi015,
I’ve addressed the earlier feedback and updated the PR with the fixes. All tests are passing locally, and the changes are now limited to the required logic only. Could you please review the PR when you have time? Your feedback would be greatly appreciated.

@nabuskey
Copy link
Copy Markdown
Contributor

What exactly is the issue this is trying to solve? What do you mean by conflict? How do you reproduce it? What are you expecting and what do you see instead?
This reads like completely generated by a LLM. It's fine to use LLMs for coding, I use one too. But you need to be in control and tell us what you are doing.

@aditya-systems-hub
Copy link
Copy Markdown
Author

Thanks for the direct feedback — I understand the concern.

The issue happens during reconciliation when the operator updates the SparkApplication status. If Kubernetes returns a conflict error (for example, due to a resourceVersion change), the controller retries reconciliation.
In the current behavior, that retry can trigger another spark-submit call even though the driver pod was already created in the previous attempt.

Expected:
If the driver pod already exists, the operator should detect it and continue managing it without submitting again.

Actual:
On retry, a second submission may occur. This can cause a failed submission, inconsistent application state, or lost events from the original driver pod.
You can reproduce it by forcing a status update conflict during reconciliation (for example, modifying the SparkApplication while it is being processed) and observing that the retry path may invoke submission again.
This PR makes the submission step idempotent by checking for an existing driver pod before calling spark-submit. That way, retries remain safe and do not resubmit the application.I’m happy to expand the reproduction steps further if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants