[SPARK-56273][SQL] Simplify extracting fields from DataSourceV2ScanRelation#55070
[SPARK-56273][SQL] Simplify extracting fields from DataSourceV2ScanRelation#55070yyanyy wants to merge 5 commits intoapache:masterfrom
Conversation
szehon-ho
left a comment
There was a problem hiding this comment.
nice idea, just a few comments
| * An extractor that returns a 3-tuple of ([[DataSourceV2Relation]], [[Scan]], | ||
| * output attributes) from a [[DataSourceV2ScanRelation]]. | ||
| * | ||
| * This is useful when a pattern match site needs the relation, scan, and output |
| * This is useful when a pattern match site only needs the scan and should not be | ||
| * coupled to the full constructor signature of [[DataSourceV2ScanRelation]]. | ||
| */ | ||
| object ExtractV2Scan { |
There was a problem hiding this comment.
nit: should we move it to the other Extractor for DSV2Relation below (like ExtractV2Table)
| /** | ||
| * An extractor that returns the [[Scan]] from a [[DataSourceV2ScanRelation]]. | ||
| * | ||
| * This is useful when a pattern match site only needs the scan and should not be |
There was a problem hiding this comment.
nit: should we remove this comment? (thinking, else why not put it on all the extractors?)
| */ | ||
| object ExtractV2ScanRelation { | ||
| def unapply(scanRelation: DataSourceV2ScanRelation) | ||
| : Option[(DataSourceV2Relation, Scan, Seq[AttributeReference])] = |
There was a problem hiding this comment.
ExtractV2Scan returns v2 scan, this returns DataSourceV2Relation, shall we name it ExtractV2Relation?
|
Thank you both for the quick review! I have updated the PR per comment, please take a look again when you have a chance |
| Some(scanRelation.scan) | ||
| } | ||
|
|
||
| object ExtractV2Relation { |
There was a problem hiding this comment.
We should still think a bit more about the name here. It does extract a relation but it does this in a scan context and the name doesn't say anything about that. I agree ExtractV2ScanRelation is also somewhat not accurate as we are not extracting it. I wonder if things like ExtractV2ScanInfo would be better.
Let me think. Name alternatives are welcome.
There was a problem hiding this comment.
+1 to change the name to indicate it extract more than relation (sorry didnt reply earlier to #55070 (comment)) ExtractV2ScanComponents is another one, but maybe Info is better as its shorter
There was a problem hiding this comment.
If we end up not extracting output, it could be something like: ExtractV2RelationAndScan.
The key ask: the name has to indicate it applies to scanning.
There was a problem hiding this comment.
nit: do we have to include Extract in the name? might be cleaner w/o it (it's already assumed it is an extractor..). V2RelationAndScan / V2Relation / V2ScanContext
There was a problem hiding this comment.
@anton5798, I had a similar question when we did extractors for DataSourceV2Relation but the consensus was that it is beneficial to have ExtractXXX explicitly.
There was a problem hiding this comment.
I'm sure ExtractV2ScanRelationInfo has crossed some of your mind but it's probably too long?
There was a problem hiding this comment.
I think I like ExtractV2ScanInfo the most and still include output, let me update the PR with this
|
|
||
| object ExtractV2Relation { | ||
| def unapply(scanRelation: DataSourceV2ScanRelation) | ||
| : Option[(DataSourceV2Relation, Scan, Seq[AttributeReference])] = |
There was a problem hiding this comment.
Is it common that we need output as well? Technically, you could do like this:
r @ ExtractV2Relation(_, scan: LocalScan) => r.output
There was a problem hiding this comment.
3 of the 6 instances that requires ExtractV2Relation would need output but all usages are currently in DataSourceV2Strategy so I guess we can consider it either way...
| @@ -258,7 +258,7 @@ class CollatedFilterPushDownToParquetV2Suite extends CollatedFilterPushDownToPar | |||
| override def getPushedDownFilters(query: DataFrame): Seq[Filter] = { | |||
| query.queryExecution.optimizedPlan.collectFirst { | |||
| case PhysicalOperation(_, _, | |||
| DataSourceV2ScanRelation(_, scan: ParquetScan, _, _, _)) => | |||
| ExtractV2Scan(scan: ParquetScan)) => | |||
There was a problem hiding this comment.
Can this now be moved to the same line?
|
I support this idea as well. |
…lation `DataSourceV2ScanRelation` is a case class with 5 fields. Many pattern match sites only need the `scan` field or a subset of `(relation, scan, output)`, using wildcards for the rest. This couples every match site to the constructor arity, so adding or removing fields requires updating all of them. This PR introduces two extractors following the `ExtractV2Table` precedent (SPARK-53720): - `ExtractV2Scan`: returns `Scan` - `ExtractV2ScanRelation`: returns `(DataSourceV2Relation, Scan, Seq[AttributeReference])` Updated 14 pattern match sites across 10 files. Type-based matches and constructor calls are unchanged.
…ractV2ScanRelation - Remove scaladoc comments from ExtractV2Scan and ExtractV2ScanRelation for consistency with other extractors (ExtractV2Table, etc.) - Move both extractors next to ExtractV2Table - Rename ExtractV2ScanRelation -> ExtractV2Relation per cloud-fan's suggestion
…ntifier Group extractors by the type they operate on: - DataSourceV2Relation extractors: ExtractV2Table, ExtractV2CatalogAndIdentifier - DataSourceV2ScanRelation extractors: ExtractV2Scan, ExtractV2Relation
…o, collapse single-line match
5937f3b to
d57ee7c
Compare
…+ OracleIntegrationSuite failures
|
the failed test is unrelated, thanks, merging to master! |
What changes were proposed in this pull request?
DataSourceV2ScanRelationis a case class with 5 fields. Many pattern match sites only need thescanfield or a subset of(relation, scan, output), using wildcards for the rest. This couples every match site to the constructor arity, so adding or removing fields requires updating all of them.This PR introduces two extractors following the
ExtractV2Tableprecedent (SPARK-53720):ExtractV2Scan: returnsScanExtractV2ScanRelation: returns(DataSourceV2Relation, Scan, Seq[AttributeReference])Updated 14 pattern match sites across 10 files. Type-based matches and constructor calls are unchanged.
Why are the changes needed?
Code simplification. Similar to 6bae835
Does this PR introduce any user-facing change?
no
How was this patch tested?
This PR relies on existing tests.
Was this patch authored or co-authored using generative AI tooling?
Yes - Opus 4.6