Skip to content

[SPARK-56505][SQL][TESTS] Add SparkSessionBinder to replace SharedSparkSession#56190

Open
fwc wants to merge 12 commits into
apache:masterfrom
fwc:sharedsparksession-refactor-mostly-nonbreaking
Open

[SPARK-56505][SQL][TESTS] Add SparkSessionBinder to replace SharedSparkSession#56190
fwc wants to merge 12 commits into
apache:masterfrom
fwc:sharedsparksession-refactor-mostly-nonbreaking

Conversation

@fwc
Copy link
Copy Markdown

@fwc fwc commented May 28, 2026

What changes were proposed in this pull request?

  • Introduces sql.SparkSessionBinder and classic.SparkSessionBinder as 'implementor' counterparts to the 'declarators' sql.SparkSessionProvider and classic.SparkSessionProvider.
  • Deprecates SharedSparkSession with the hint that SparkSessionBinder and QueryTest shall be used instead.

Why are the changes needed?

Currently, most tests use SharedSparkSession to obtain the spark object. This prevents specializing these tests in sql/connect as SharedSparkSession provides a classic.SparkSession, thus preventing overriding.

This PR deprecates SharedSparkSession and instead introduces sql.SparkSessionBinder and classic.SparkSessionBinder. While both create a classic.SparkSession, the sql.SparkSessionBinderhas an abstract def spark: sql.SparkSession declaration, so it can we overriden with some trait that provides a connect.SparkSession.

If some FooSuite now uses the sql.SparkSessionBinder trait like e.g.

class FooSuite extends SparkSessionBinder with QueryTest {
  checkAnswer(
    sql("SELECT 1"),
    Seq(1)
  )
}

We can now add a connect variant of that suite as follows:

class FooWithConnectSuite extends FooSuite
  with connect.SparkSessionBinder
  with connect.QueryTest

Does this PR introduce any user-facing change?

This PR extends the beforeAll/afterAll of SharedSparkSessionBase to include the the thread audit check, which was previously only present in SharedSparkSession.
AFAICS, SparkSessionBase is neither used in delta lake nor in apache iceberg.

How was this patch tested?

This patch is test-only.

Was this patch authored or co-authored using generative AI tooling?

Parts of this patch were authored by claude code

@fwc fwc force-pushed the sharedsparksession-refactor-mostly-nonbreaking branch from b7ba3f5 to 4c35b22 Compare May 28, 2026 20:54
Copy link
Copy Markdown
Contributor

@cloud-fan cloud-fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 blocking, 2 non-blocking, 3 nits.
Right direction — decoupling the session type so suites can run on classic or Connect. My main feedback is on the author-facing shape: I'd push for a binder-free base + per-env concrete suites, with the bare SparkSessionBinder kept internal.

Design / architecture (1)

  • sql/core/.../sql/QueryTest.scala:1214: push the binder-free-base + classic/connect-concrete pattern; treat bare SparkSessionBinder as internal — see inline

Suggestions (2)

  • sql/connect/.../connect/SparkSessionBinder.scala:89: redundant afterEach override with an inaccurate comment — see inline
  • sql/connect/.../connect/QueryTest.scala:30: only one checkAnswer overload overridden — see inline

Nits: 3 minor items (see inline comments).

}

class QueryTestSuite extends test.SharedSparkSession {
class QueryTestSuite extends QueryTest with SparkSessionBinder {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This migration — mixing the bare sql.SparkSessionBinder into a concrete suite — is the shape I'd push back on. sql.SparkSessionBinder binds a classic session but exposes spark only as the abstract sql.SparkSession, so it's really internal plumbing, not what a test author should reach for.

The end-state I'd recommend documenting and demonstrating is a binder-free base + per-env concrete suites:

abstract class FooSuiteBase extends QueryTest {          // no binder; spark abstract
  test("shared") { checkAnswer(sql("SELECT 1"), Row(1)) }
}
class FooSuite extends FooSuiteBase with classic.SparkSessionBinder {
  test("classic only") { ... }
}
class FooConnectSuite extends FooSuiteBase
  with connect.SparkSessionBinder with connect.QueryTest {
  test("connect only") { ... }
}

QueryTest already mixes in SparkSessionProvider (via SQLTestData) and leaves spark abstract, so it works as the env-agnostic base directly. Concretely: (1) steer the migration and the @deprecated message at classic.SparkSessionBinder / connect.SparkSessionBinder + this base pattern, not the bare binder; (2) QueryTestWithConnectSuite currently demonstrates the retrofit path (extending an already-classic-bound QueryTestSuite and overriding the binding) — a binder-free base would demonstrate the cleaner pattern and double as the template authors copy.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to nudge test authors towards writing (somewhat) connect-compatible tests by default, which is why I want them to write tests with a sql.SparkSession in hand.

My fear is that the 'clean' way is not the 'easiest' way. Most current tests do not use an abstract base class and I fear that most test authors will default to just start a new suite with classic.SparkSessionBinder as they might not think about connect in that moment:

// hypothetical antipattern, but path of least resistance:
class FooSuite extends QueryTest with classic.SparkSessionBinder {
  test("all tests, both shared and classic only") { ... }
}

I reworked the PR so that SparkSessionBinder now implements QueryTest. Now classic.SparkSessionBinder is a drop-in replacement for SharedSparkSession and sql.SparkSessionBinder provides the new, 'fixed' default.

What do you think of this approach?

}
}

// The base SharedSparkSessionBase.afterEach calls spark.sharedState which is not supported
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is inaccurate after the refactor: connect.SparkSessionBinder extends sql.SparkSessionBinder directly (not SharedSparkSessionBase), and that parent's afterEach clears the cache via the private _spark — the classic session, which is exactly what's used on Connect (createSparkSession isn't overridden). So the parent's afterEach already works here and this override is redundant. If you do keep it, note that skipping super.afterEach() drops the BeforeAndAfterEach chain. Simplest fix is to remove the override entirely.

*/
trait QueryTest extends sqlApi.QueryTest with SparkSessionProvider {

override protected def checkAnswer(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This overrides only the checkAnswer(df, Seq[Row]) variant, which is enough for QueryTestSuite. But the stated goal is re-running arbitrary sql/core suites over Connect, and the other QueryTest helpers (other checkAnswer overloads, checkDataset, ...) still reach classic-only paths like queryExecution/logicalPlan. Worth a line in the trait doc noting that broader reuse will need more overrides.

/**
* Runs [[QueryTestSuite]] tests through a Connect session.
*
* This validates the `FooSuite with connect.SharedSparkSession` pattern: the existing
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no connect.SharedSparkSession trait; the pattern this suite actually uses (and that the sibling connect/SparkSessionBinder.scala doc shows) is connect.SparkSessionBinder with connect.QueryTest.

Suggested change
* This validates the `FooSuite with connect.SharedSparkSession` pattern: the existing
* This validates the `FooSuite with connect.SparkSessionBinder with connect.QueryTest` pattern: the existing

}

/**
* Suites extending [[SharedSparkSession]] are sharing resources (e.g. SparkSession) in their
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doc moved out of SharedSparkSession; the snapshot-before-init logic now lives in this trait, so referring to SharedSparkSession is stale.

Suggested change
* Suites extending [[SharedSparkSession]] are sharing resources (e.g. SparkSession) in their
* Suites extending this trait are sharing resources (e.g. SparkSession) in their

import org.apache.spark.sql
import org.apache.spark.sql.{classic, QueryTest, QueryTestBase}

@deprecated("Use SparkSessionBinder and QueryTest instead")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@deprecated takes a since version as its second argument; adding it documents when the deprecation started and matches the convention elsewhere in the codebase. Same on line 59.

@fwc
Copy link
Copy Markdown
Author

fwc commented May 29, 2026

Hi @cloud-fan, I changed the PR so that SharedSparkSession is now an empty alias of classic.SparkSessionBinder with a deprecation note that recommends using sql.SparkSessionBinder if possible.

I am unsure with regards to the SparkSessionBinder name:

AFAICS SharedSparkSession is/was the testing trait (~500 extends/implements usages compared to ~150 usages of QueryTest, both according to Intellij's "Find Usages").
If it wouldn't be a breaking change, I'd want to rename QueryTest to QueryTestHelpers and SparkSessionBinder to QueryTest. What do you think? Maybe QuerySuite? Maybe SparkSessionSuite?

@fwc fwc requested a review from cloud-fan May 29, 2026 22:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants