[SPARK-52812][CONNECT] Preserve spark.sql.sources.default for eager createTable(tableName, path)#56211
Conversation
…reateTable(tableName, path) ### What changes were proposed in this pull request? SPARK-52812 (apache#56064) made Spark Connect `Catalog.createTable` eager by re-routing the two-argument `createTable(tableName, path)` overload through `createTable(tableName, path, "parquet")`. That hardcodes the parquet provider and drops the `spark.sql.sources.default` fallback that the overload previously relied on. This PR restores the original behavior: the two-argument overload again leaves the source unset so the server resolves `spark.sql.sources.default`, while keeping the eager execution introduced by SPARK-52812. A regression test is added to `CatalogSuite`. ### Why are the changes needed? The two-argument `createTable(tableName, path)` overload is documented as "It will use the default data source configured by spark.sql.sources.default." After SPARK-52812 it always used parquet regardless of that configuration, contradicting its own contract and the classic Catalog behavior. ### Does this PR introduce _any_ user-facing change? Yes, within the unreleased master branch. `spark.catalog.createTable(tableName, path)` on Spark Connect once again honors `spark.sql.sources.default` instead of always creating a parquet table. The eager-execution behavior from SPARK-52812 is preserved. ### How was this patch tested? Added a regression test in `CatalogSuite` that sets `spark.sql.sources.default` to `json`, writes JSON data, creates the table via the two-argument overload, and asserts the resulting table uses the json provider and is readable. The test fails on the previous hardcoded-parquet behavior. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Code (Opus 4.8)
cloud-fan
left a comment
There was a problem hiding this comment.
0 blocking, 1 non-blocking, 0 nits.
Clean, correct, well-tested follow-up restoring the spark.sql.sources.default fallback while keeping eager execution.
Design / architecture (1)
Catalog.scala:398: 2-arg overload inlines the proto build, bypassing the delegation chain all other overloads use and duplicating the 5-argexecute{}+table()block. Correct and thesetSourceomission is necessary, but a privatesource: Option[String]impl shared by both would remove the duplication. Non-blocking. — see inline
| // Leave the source unset so the server resolves spark.sql.sources.default, as documented | ||
| // above. Routing through createTable(tableName, path, "parquet") would hardcode the provider | ||
| // and ignore that configuration. | ||
| sparkSession.execute { builder => |
There was a problem hiding this comment.
The two-argument overload inlines the CreateTable proto build here, bypassing the delegation chain every other createTable overload uses (each delegates down to the single 5-arg general overload) and duplicating that method's execute{...} + table() block, minus setSource.
The behavior is correct and omitting setSource is necessary — source is optional in the proto, so the server only resolves spark.sql.sources.default when hasSource() is false, and delegating to the 5-arg overload would call setSource("") and pin the source to a literal empty string. So this is non-blocking.
That said, the duplication can be removed cleanly by extracting a private impl taking source: Option[String] that both the 2-arg (None) and 5-arg (Some(source)) overloads route through, calling setSource only when defined. That keeps a single place building the proto while preserving the unset-source semantics. Your call.
What changes were proposed in this pull request?
SPARK-52812 (#56064) made Spark Connect
Catalog.createTableeager by re-routing the two-argumentcreateTable(tableName, path)overload throughcreateTable(tableName, path, "parquet"). That hardcodes the parquet provider and drops thespark.sql.sources.defaultfallback that the overload previously relied on.This PR restores the original behavior: the two-argument overload again leaves the source unset so the server resolves
spark.sql.sources.default, while keeping the eager execution introduced by SPARK-52812. A regression test is added toCatalogSuite.Why are the changes needed?
The two-argument
createTable(tableName, path)overload is documented as "It will use the default data source configured by spark.sql.sources.default." After SPARK-52812 it always used parquet regardless of that configuration, contradicting its own contract and the classic Catalog behavior.Does this PR introduce any user-facing change?
Yes, within the unreleased master branch.
spark.catalog.createTable(tableName, path)on Spark Connect once again honorsspark.sql.sources.defaultinstead of always creating a parquet table. The eager-execution behavior from SPARK-52812 is preserved.How was this patch tested?
Added a regression test in
CatalogSuitethat setsspark.sql.sources.defaulttojson, writes JSON data, creates the table via the two-argument overload, and asserts the resulting table uses the json provider and is readable. The test fails on the previous hardcoded-parquet behavior.Was this patch authored or co-authored using generative AI tooling?
Co-authored with Claude Code.