[SPARK-56505][SQL][TESTS] Add SparkSessionBinder to replace SharedSparkSession#56190
[SPARK-56505][SQL][TESTS] Add SparkSessionBinder to replace SharedSparkSession#56190fwc wants to merge 12 commits into
Conversation
…parkSession This is technically an 'api change' as it moves the thread audit stuff from `test.SharedSparkSession` to `test.SharedSparkSessionBase`. This breaks code that implements `SharedSparkSessionBase` to circumvent the thread audit stuff.
b7ba3f5 to
4c35b22
Compare
cloud-fan
left a comment
There was a problem hiding this comment.
1 blocking, 2 non-blocking, 3 nits.
Right direction — decoupling the session type so suites can run on classic or Connect. My main feedback is on the author-facing shape: I'd push for a binder-free base + per-env concrete suites, with the bare SparkSessionBinder kept internal.
Design / architecture (1)
sql/core/.../sql/QueryTest.scala:1214: push the binder-free-base + classic/connect-concrete pattern; treat bareSparkSessionBinderas internal — see inline
Suggestions (2)
sql/connect/.../connect/SparkSessionBinder.scala:89: redundantafterEachoverride with an inaccurate comment — see inlinesql/connect/.../connect/QueryTest.scala:30: only onecheckAnsweroverload overridden — see inline
Nits: 3 minor items (see inline comments).
| } | ||
|
|
||
| class QueryTestSuite extends test.SharedSparkSession { | ||
| class QueryTestSuite extends QueryTest with SparkSessionBinder { |
There was a problem hiding this comment.
This migration — mixing the bare sql.SparkSessionBinder into a concrete suite — is the shape I'd push back on. sql.SparkSessionBinder binds a classic session but exposes spark only as the abstract sql.SparkSession, so it's really internal plumbing, not what a test author should reach for.
The end-state I'd recommend documenting and demonstrating is a binder-free base + per-env concrete suites:
abstract class FooSuiteBase extends QueryTest { // no binder; spark abstract
test("shared") { checkAnswer(sql("SELECT 1"), Row(1)) }
}
class FooSuite extends FooSuiteBase with classic.SparkSessionBinder {
test("classic only") { ... }
}
class FooConnectSuite extends FooSuiteBase
with connect.SparkSessionBinder with connect.QueryTest {
test("connect only") { ... }
}QueryTest already mixes in SparkSessionProvider (via SQLTestData) and leaves spark abstract, so it works as the env-agnostic base directly. Concretely: (1) steer the migration and the @deprecated message at classic.SparkSessionBinder / connect.SparkSessionBinder + this base pattern, not the bare binder; (2) QueryTestWithConnectSuite currently demonstrates the retrofit path (extending an already-classic-bound QueryTestSuite and overriding the binding) — a binder-free base would demonstrate the cleaner pattern and double as the template authors copy.
There was a problem hiding this comment.
I want to nudge test authors towards writing (somewhat) connect-compatible tests by default, which is why I want them to write tests with a sql.SparkSession in hand.
My fear is that the 'clean' way is not the 'easiest' way. Most current tests do not use an abstract base class and I fear that most test authors will default to just start a new suite with classic.SparkSessionBinder as they might not think about connect in that moment:
// hypothetical antipattern, but path of least resistance:
class FooSuite extends QueryTest with classic.SparkSessionBinder {
test("all tests, both shared and classic only") { ... }
}I reworked the PR so that SparkSessionBinder now implements QueryTest. Now classic.SparkSessionBinder is a drop-in replacement for SharedSparkSession and sql.SparkSessionBinder provides the new, 'fixed' default.
What do you think of this approach?
| } | ||
| } | ||
|
|
||
| // The base SharedSparkSessionBase.afterEach calls spark.sharedState which is not supported |
There was a problem hiding this comment.
This comment is inaccurate after the refactor: connect.SparkSessionBinder extends sql.SparkSessionBinder directly (not SharedSparkSessionBase), and that parent's afterEach clears the cache via the private _spark — the classic session, which is exactly what's used on Connect (createSparkSession isn't overridden). So the parent's afterEach already works here and this override is redundant. If you do keep it, note that skipping super.afterEach() drops the BeforeAndAfterEach chain. Simplest fix is to remove the override entirely.
| */ | ||
| trait QueryTest extends sqlApi.QueryTest with SparkSessionProvider { | ||
|
|
||
| override protected def checkAnswer( |
There was a problem hiding this comment.
This overrides only the checkAnswer(df, Seq[Row]) variant, which is enough for QueryTestSuite. But the stated goal is re-running arbitrary sql/core suites over Connect, and the other QueryTest helpers (other checkAnswer overloads, checkDataset, ...) still reach classic-only paths like queryExecution/logicalPlan. Worth a line in the trait doc noting that broader reuse will need more overrides.
| /** | ||
| * Runs [[QueryTestSuite]] tests through a Connect session. | ||
| * | ||
| * This validates the `FooSuite with connect.SharedSparkSession` pattern: the existing |
There was a problem hiding this comment.
There's no connect.SharedSparkSession trait; the pattern this suite actually uses (and that the sibling connect/SparkSessionBinder.scala doc shows) is connect.SparkSessionBinder with connect.QueryTest.
| * This validates the `FooSuite with connect.SharedSparkSession` pattern: the existing | |
| * This validates the `FooSuite with connect.SparkSessionBinder with connect.QueryTest` pattern: the existing |
| } | ||
|
|
||
| /** | ||
| * Suites extending [[SharedSparkSession]] are sharing resources (e.g. SparkSession) in their |
There was a problem hiding this comment.
This doc moved out of SharedSparkSession; the snapshot-before-init logic now lives in this trait, so referring to SharedSparkSession is stale.
| * Suites extending [[SharedSparkSession]] are sharing resources (e.g. SparkSession) in their | |
| * Suites extending this trait are sharing resources (e.g. SparkSession) in their |
| import org.apache.spark.sql | ||
| import org.apache.spark.sql.{classic, QueryTest, QueryTestBase} | ||
|
|
||
| @deprecated("Use SparkSessionBinder and QueryTest instead") |
There was a problem hiding this comment.
@deprecated takes a since version as its second argument; adding it documents when the deprecation started and matches the convention elsewhere in the codebase. Same on line 59.
|
Hi @cloud-fan, I changed the PR so that I am unsure with regards to the AFAICS |
What changes were proposed in this pull request?
sql.SparkSessionBinderandclassic.SparkSessionBinderas 'implementor' counterparts to the 'declarators'sql.SparkSessionProviderandclassic.SparkSessionProvider.SharedSparkSessionwith the hint thatSparkSessionBinderandQueryTestshall be used instead.Why are the changes needed?
Currently, most tests use
SharedSparkSessionto obtain thesparkobject. This prevents specializing these tests insql/connectasSharedSparkSessionprovides aclassic.SparkSession, thus preventing overriding.This PR deprecates
SharedSparkSessionand instead introducessql.SparkSessionBinderandclassic.SparkSessionBinder. While both create aclassic.SparkSession, thesql.SparkSessionBinderhas an abstractdef spark: sql.SparkSessiondeclaration, so it can we overriden with some trait that provides aconnect.SparkSession.If some
FooSuitenow uses thesql.SparkSessionBindertrait like e.g.We can now add a connect variant of that suite as follows:
Does this PR introduce any user-facing change?
This PR extends the
beforeAll/afterAllofSharedSparkSessionBaseto include the the thread audit check, which was previously only present inSharedSparkSession.AFAICS,
SparkSessionBaseis neither used in delta lake nor in apache iceberg.How was this patch tested?
This patch is test-only.
Was this patch authored or co-authored using generative AI tooling?
Parts of this patch were authored by claude code