Skip to content

[SPARK-56273][SQL] Simplify extracting fields from DataSourceV2ScanRelation#55070

Closed
yyanyy wants to merge 5 commits intoapache:masterfrom
yyanyy:extract-v2-scan-relation-extractors
Closed

[SPARK-56273][SQL] Simplify extracting fields from DataSourceV2ScanRelation#55070
yyanyy wants to merge 5 commits intoapache:masterfrom
yyanyy:extract-v2-scan-relation-extractors

Conversation

@yyanyy
Copy link
Copy Markdown
Contributor

@yyanyy yyanyy commented Mar 28, 2026

What changes were proposed in this pull request?

DataSourceV2ScanRelation is a case class with 5 fields. Many pattern match sites only need the scan field or a subset of (relation, scan, output), using wildcards for the rest. This couples every match site to the constructor arity, so adding or removing fields requires updating all of them.

This PR introduces two extractors following the ExtractV2Table precedent (SPARK-53720):

  • ExtractV2Scan: returns Scan
  • ExtractV2ScanRelation: returns (DataSourceV2Relation, Scan, Seq[AttributeReference])

Updated 14 pattern match sites across 10 files. Type-based matches and constructor calls are unchanged.

Why are the changes needed?

Code simplification. Similar to 6bae835

Does this PR introduce any user-facing change?

no

How was this patch tested?

This PR relies on existing tests.

Was this patch authored or co-authored using generative AI tooling?

Yes - Opus 4.6 

Copy link
Copy Markdown
Member

@szehon-ho szehon-ho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice idea, just a few comments

* An extractor that returns a 3-tuple of ([[DataSourceV2Relation]], [[Scan]],
* output attributes) from a [[DataSourceV2ScanRelation]].
*
* This is useful when a pattern match site needs the relation, scan, and output
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same, should we remove?

* This is useful when a pattern match site only needs the scan and should not be
* coupled to the full constructor signature of [[DataSourceV2ScanRelation]].
*/
object ExtractV2Scan {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: should we move it to the other Extractor for DSV2Relation below (like ExtractV2Table)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

/**
* An extractor that returns the [[Scan]] from a [[DataSourceV2ScanRelation]].
*
* This is useful when a pattern match site only needs the scan and should not be
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: should we remove this comment? (thinking, else why not put it on all the extractors?)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

*/
object ExtractV2ScanRelation {
def unapply(scanRelation: DataSourceV2ScanRelation)
: Option[(DataSourceV2Relation, Scan, Seq[AttributeReference])] =
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ExtractV2Scan returns v2 scan, this returns DataSourceV2Relation, shall we name it ExtractV2Relation?

@yyanyy
Copy link
Copy Markdown
Contributor Author

yyanyy commented Mar 30, 2026

Thank you both for the quick review! I have updated the PR per comment, please take a look again when you have a chance

Some(scanRelation.scan)
}

object ExtractV2Relation {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should still think a bit more about the name here. It does extract a relation but it does this in a scan context and the name doesn't say anything about that. I agree ExtractV2ScanRelation is also somewhat not accurate as we are not extracting it. I wonder if things like ExtractV2ScanInfo would be better.

Let me think. Name alternatives are welcome.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to change the name to indicate it extract more than relation (sorry didnt reply earlier to #55070 (comment)) ExtractV2ScanComponents is another one, but maybe Info is better as its shorter

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we end up not extracting output, it could be something like: ExtractV2RelationAndScan.

The key ask: the name has to indicate it applies to scanning.

Copy link
Copy Markdown
Contributor

@anton5798 anton5798 Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: do we have to include Extract in the name? might be cleaner w/o it (it's already assumed it is an extractor..). V2RelationAndScan / V2Relation / V2ScanContext

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anton5798, I had a similar question when we did extractors for DataSourceV2Relation but the consensus was that it is beneficial to have ExtractXXX explicitly.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sure ExtractV2ScanRelationInfo has crossed some of your mind but it's probably too long?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I like ExtractV2ScanInfo the most and still include output, let me update the PR with this


object ExtractV2Relation {
def unapply(scanRelation: DataSourceV2ScanRelation)
: Option[(DataSourceV2Relation, Scan, Seq[AttributeReference])] =
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it common that we need output as well? Technically, you could do like this:

r @ ExtractV2Relation(_, scan: LocalScan) => r.output

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 of the 6 instances that requires ExtractV2Relation would need output but all usages are currently in DataSourceV2Strategy so I guess we can consider it either way...

@@ -258,7 +258,7 @@ class CollatedFilterPushDownToParquetV2Suite extends CollatedFilterPushDownToPar
override def getPushedDownFilters(query: DataFrame): Seq[Filter] = {
query.queryExecution.optimizedPlan.collectFirst {
case PhysicalOperation(_, _,
DataSourceV2ScanRelation(_, scan: ParquetScan, _, _, _)) =>
ExtractV2Scan(scan: ParquetScan)) =>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this now be moved to the same line?

@aokolnychyi
Copy link
Copy Markdown
Contributor

I support this idea as well.

yyanyy added 4 commits April 2, 2026 21:40
…lation

`DataSourceV2ScanRelation` is a case class with 5 fields. Many pattern
match sites only need the `scan` field or a subset of `(relation, scan,
output)`, using wildcards for the rest. This couples every match site to
the constructor arity, so adding or removing fields requires updating
all of them.

This PR introduces two extractors following the `ExtractV2Table`
precedent (SPARK-53720):
- `ExtractV2Scan`: returns `Scan`
- `ExtractV2ScanRelation`: returns `(DataSourceV2Relation, Scan,
  Seq[AttributeReference])`

Updated 14 pattern match sites across 10 files. Type-based matches
and constructor calls are unchanged.
…ractV2ScanRelation

- Remove scaladoc comments from ExtractV2Scan and ExtractV2ScanRelation
  for consistency with other extractors (ExtractV2Table, etc.)
- Move both extractors next to ExtractV2Table
- Rename ExtractV2ScanRelation -> ExtractV2Relation per cloud-fan's suggestion
…ntifier

Group extractors by the type they operate on:
- DataSourceV2Relation extractors: ExtractV2Table, ExtractV2CatalogAndIdentifier
- DataSourceV2ScanRelation extractors: ExtractV2Scan, ExtractV2Relation
@yyanyy yyanyy force-pushed the extract-v2-scan-relation-extractors branch from 5937f3b to d57ee7c Compare April 2, 2026 21:50
@cloud-fan
Copy link
Copy Markdown
Contributor

the failed test is unrelated, thanks, merging to master!

@cloud-fan cloud-fan closed this in af9c8b3 Apr 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants