Fix/idempotent spark submit by aditya-systems-hub · Pull Request #2840 · kubeflow/spark-operator

aditya-systems-hub · 2026-02-09T23:34:15Z

Purpose of this PR

This PR fixes an issue where a Spark application could be submitted more than once during reconciliation, leading to incorrect failures and orphaned driver pods.

Previously, when a status update failed due to a Kubernetes conflict, the controller retried the whole submission. This caused spark-submit to run again even though the driver pod was already created. The second submission failed because the driver pod already existed, and the application was then incorrectly marked as failed. At the same time, events from the original running driver pod were ignored because a new SubmissionID was generated.

This change makes the submission process idempotent and ensures retries only update status rather than resubmitting the application.

Proposed changes:

Added an idempotency check in submitSparkApplication() to reuse an existing driver pod instead of running spark-submit again.
Moved submission logic outside RetryOnConflict loops so retries only affect status updates.
Applied the same fix to both new application reconciliation and resuming applications.
Added a new test case verifying recovery when a driver pod already exists.

Change Category

Bugfix (non-breaking change which fixes an issue)
Feature (non-breaking change which adds functionality)
Breaking change (fix or feature that could affect existing functionality)
Documentation update

Rationale

Spark application submission should happen only once. Retrying submission during status conflicts caused duplicate submission attempts, incorrect failure states, and orphaned driver pods. By making submission idempotent and retry-safe, the operator behaves correctly under conflict retries and prevents silent event loss.

Checklist

I have conducted a self-review of my own code.
I have updated documentation accordingly.
I have added tests that prove my changes are effective or that my feature works.
Existing unit tests pass locally with my changes.

Additional Notes

Verification performed:

go build ./... passes successfully.
All unit tests pass.
Integration tests pass (25/25 with one expected skip).
Added idempotency test confirming recovery from an existing driver pod.
The AfterSuite warning is related to Windows envtest process handling and not caused by this change.

Branch pushed: fix/idempotent-spark-submit.

Test case

Use RetryOnConflict to handle concurrent updates when setting application status to Invalidating. Fetch a fresh copy of the SparkApplication to avoid modifying the cached informer object, and skip update if already in a terminal or transitioning state. Also return true on error to ensure the update event reaches the reconciler, preventing silent spec change drops when there are conflicts. Signed-off-by: Adityakuchekar <adityakuchekar0077@gmail.com>

github-actions · 2026-02-09T23:34:24Z

🎉 Welcome to the Kubeflow Spark Operator! 🎉

Thanks for opening your first PR! We're happy to have you as part of our community 🚀

Here's what happens next:

If you haven't already, please check out our Contributing Guide for repo-specific guidelines and the Kubeflow Contributor Guide for general community standards.
Our team will review your PR soon!

Join the community:

Slack: Join our #kubeflow-spark-on-kubernetes Slack channel.

Feel free to ask questions in the comments if you need any help or clarification!
Thanks again for contributing to Kubeflow! 🙏

google-oss-prow · 2026-02-09T23:34:27Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign andreyvelich for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

…river pods Signed-off-by: aditya-systems-hub <adityakuchekar0077@gmail.com>

aditya-systems-hub · 2026-02-09T23:39:44Z

Hi @ChenYi015,
I’ve addressed the earlier feedback and updated the PR with the fixes. All tests are passing locally, and the changes are now limited to the required logic only. Could you please review the PR when you have time? Your feedback would be greatly appreciated.

nabuskey · 2026-02-23T21:39:35Z

What exactly is the issue this is trying to solve? What do you mean by conflict? How do you reproduce it? What are you expecting and what do you see instead?
This reads like completely generated by a LLM. It's fine to use LLMs for coding, I use one too. But you need to be in control and tell us what you are doing.

aditya-systems-hub · 2026-02-24T01:48:22Z

Thanks for the direct feedback — I understand the concern.

The issue happens during reconciliation when the operator updates the SparkApplication status. If Kubernetes returns a conflict error (for example, due to a resourceVersion change), the controller retries reconciliation.
In the current behavior, that retry can trigger another spark-submit call even though the driver pod was already created in the previous attempt.

Expected:
If the driver pod already exists, the operator should detect it and continue managing it without submitting again.

Actual:
On retry, a second submission may occur. This can cause a failed submission, inconsistent application state, or lost events from the original driver pod.
You can reproduce it by forcing a status update conflict during reconciliation (for example, modifying the SparkApplication while it is being processed) and observing that the retry path may invoke submission again.
This PR makes the submission step idempotent by checking for an existing driver pod before calling spark-submit. That way, retries remain safe and do not resubmit the application.I’m happy to expand the reproduction steps further if needed.

google-oss-prow bot requested review from ImpSy and nabuskey February 9, 2026 23:34

google-oss-prow bot added the size/L label Feb 9, 2026

Fix non-idempotent spark-submit in RetryOnConflict causing orphaned d…

c6d957b

…river pods Signed-off-by: aditya-systems-hub <adityakuchekar0077@gmail.com>

aditya-systems-hub force-pushed the fix/idempotent-spark-submit branch from c6781ba to c6d957b Compare February 9, 2026 23:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix/idempotent spark submit#2840

Fix/idempotent spark submit#2840
aditya-systems-hub wants to merge 2 commits intokubeflow:masterfrom
aditya-systems-hub:fix/idempotent-spark-submit

aditya-systems-hub commented Feb 9, 2026

Uh oh!

github-actions bot commented Feb 9, 2026

Uh oh!

google-oss-prow bot commented Feb 9, 2026

Uh oh!

aditya-systems-hub commented Feb 9, 2026

Uh oh!

nabuskey commented Feb 23, 2026

Uh oh!

aditya-systems-hub commented Feb 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

aditya-systems-hub commented Feb 9, 2026

Purpose of this PR

Change Category

Rationale

Checklist

Additional Notes

Test case

Uh oh!

github-actions bot commented Feb 9, 2026

Uh oh!

google-oss-prow bot commented Feb 9, 2026

Uh oh!

aditya-systems-hub commented Feb 9, 2026

Uh oh!

nabuskey commented Feb 23, 2026

Uh oh!

aditya-systems-hub commented Feb 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants