feat(cudf): Add CudfEnforceSingleRow GPU operator#16920
feat(cudf): Add CudfEnforceSingleRow GPU operator#16920perlitz wants to merge 36 commits intofacebookincubator:mainfrom
Conversation
Implements GPU version of EnforceSingleRow to maintain GPU pipeline continuity for scalar subqueries. Validates row count using GPU metadata without host↔device data transfer.
|
Hi @perlitz! Thank you for your pull request and welcome to our community. Action RequiredIn order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you. ProcessIn order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA. Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks! |
✅ Deploy Preview for meta-velox canceled.
|
| const core::PlanNodePtr& planNode, | ||
| exec::DriverCtx* /*ctx*/) const override { | ||
| // Check if GPU EnforceSingleRow is enabled in config | ||
| if (!CudfConfig::getInstance().enableEnforceSingleRow) { |
There was a problem hiding this comment.
You can just choose to not register the EnforceSingleRowAdapter based on the config instead of checking here
There was a problem hiding this comment.
Sure, I've moved the config check to registration time — if the feature is disabled, the adapter is simply never registered, and canRunOnGPU() no longer needs to know about config at all.
|
Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks! |
karthikeyann
left a comment
There was a problem hiding this comment.
Nice work!
Well implemented, and Good test coverage.
| assertQueryFails( | ||
| PlanBuilder().values({largeRows}).enforceSingleRow().planNode(), | ||
| "Expected single row of input. Received 1000 rows."); | ||
| } |
There was a problem hiding this comment.
Would adding a unit test case for 'zero columns in output type' makes sense?
There was a problem hiding this comment.
sure!
added TEST_F(CudfEnforceSingleRowTest, zeroColumns)
There was a problem hiding this comment.
cuDF tables with zero columns report num_rows()==0 regardless of the
actual row count, so the GPU pipeline cannot represent a 0-column
vector with rows. This edge case does not occur in real queries.
where is the error thrown while trying zero column rows? CudfFromVelox?
There was a problem hiding this comment.
no error is thrown - the operator falls back to CPU for zero-column inputs.
CudfFromVelox converts the input to a cuDF table that reports num_rows() == 0, so the GPU path can't handle it correctly.
IIUC cuDF tables with zero columns have no way to track row count (there's no column data to hold the rows)
so i've also removed the test for the GPU operator as the expected behavior for this case is fallback to CPU
There was a problem hiding this comment.
The cudf table is wrapped in CudfVector which has row count, so it is ok for cudf to represent table with no column and row count, but in Velox, usually, it does not handle no column table, in Gluten, we handle it early in JAVA code
| * This is a pass-through operator that performs validation on GPU metadata | ||
| * (row count) without transferring data between host and device. | ||
| */ | ||
| class CudfEnforceSingleRow : public exec::Operator, public CudfOperator { |
There was a problem hiding this comment.
Question for @devavret :
Should we inherit all cudf operators from CudfOperator or NvtxHelper?
Old operators still inherit from NvtxHelper, while recent new operators inherit from CudfOperator
There was a problem hiding this comment.
We've been discussing this in this issue: #16885
…based on config Signed-off-by: Yotam Perlitz <y.perlitz@ibm.com>
…ence in noMoreInput Signed-off-by: Yotam Perlitz <y.perlitz@ibm.com>
…ling Signed-off-by: Yotam Perlitz <y.perlitz@ibm.com>
Signed-off-by: Yotam Perlitz <y.perlitz@ibm.com>
Signed-off-by: Yotam Perlitz <y.perlitz@ibm.com>
Signed-off-by: Yotam Perlitz <y.perlitz@ibm.com>
Signed-off-by: Yotam Perlitz <y.perlitz@ibm.com>
…SingleRow tests Signed-off-by: Yotam Perlitz <y.perlitz@ibm.com>
… validation Signed-off-by: Yotam Perlitz <y.perlitz@ibm.com>
Signed-off-by: Yotam Perlitz <y.perlitz@ibm.com>
Signed-off-by: Yotam Perlitz <y.perlitz@ibm.com>
…rity Signed-off-by: Yotam Perlitz <y.perlitz@ibm.com>
| // We have not seen any data. Return a single row of all nulls. | ||
| // Create a CPU-side null row and convert to GPU | ||
| auto nullRow = BaseVector::create<RowVector>(outputType_, 1, pool()); | ||
| for (auto& child : nullRow->children()) { |
There was a problem hiding this comment.
Can we return the input_ here? For Velox, the children size may be bigger than the RowVector size after some operations such as slice, but for cudf table, does it have the special case? If not, we don't need the conversion.
Besides, even if we need to resize the children, I assume the cudf table also has the API to resize.
CC @karthikeyann @devavret
There was a problem hiding this comment.
Can we return the input_ here?
I think the operator is expected to return a single null row instead of input_ which is nullptr.
There was a problem hiding this comment.
I don't fully understand here before, is it possible to create a cudf table with 1 null row directly?
There was a problem hiding this comment.
Yes, we can create a cudf table with 1 null row directly on the GPU. Updated to use cudf::make_default_constructed_scalar + cudf::make_column_from_scalar per output column — this avoids the CPU→GPU conversion entirely.
Ist the same pattern used in CudfHashJoin.cpp
Should I do that?
There was a problem hiding this comment.
I think both of these are too small to worry about. you could also make empty cudf columns using column factories. But considering how little runtime impact this operator has, building from a velox rowvector is fine too.
There was a problem hiding this comment.
I looked into creating the null row directly on GPU, but the factory approach (make_numeric_column etc.) would need a switch per type category. The scalar approach (make_default_constructed_scalar + make_column_from_scalar) works for all types but is two steps per column.
Building from a Velox RowVector is the simplest — I'll keep that.
@jinchengchenghh are you on board with that?
There was a problem hiding this comment.
OK,this is a small issue, let us land this PR first and let others to optimize it if they are interested on that.
There was a problem hiding this comment.
This will make the logic more clean, but actually without performance gain.
The operator is always enabled — no need for a separate config flag.
Simple operator does not need detailed per-method log statements.
…ount Call cudfInput->size() instead of cudfInput->getTableView().num_rows().
| planNode->id(), | ||
| "CudfEnforceSingleRow"), | ||
| CudfOperator(operatorId, planNode->id()) { | ||
| isIdentityProjection_ = true; |
There was a problem hiding this comment.
You don't need to set isIdentityProjection_, cudf does not use this member
There was a problem hiding this comment.
Cudf Operators still sets it (Eg. CudfLimit, CudfFilterProject); it's good to follow conventions that Velox already follows.
Though isIdentityProjection_ is not used elsewhere, other similar variables could be useful. For example, This will be useful while doing Dynamic filter pushdowns.
…ingleRow cudf operators do not use this member.
5455fe9 to
12eabe5
Compare
Co-authored-by: Shruti Shivakumar <shruti.shivakumar@gmail.com>
Build Impact AnalysisFull build recommended. Files outside the dependency graph changed:
These directories are not fully covered by the dependency graph. A full build is the safest option. Slow path • Graph generated from PR branch |
|
Please fix the code style in CI |
|
@xiaoxmeng has imported this pull request. If you are a Meta employee, you can view this in D100439821. |
|
@xiaoxmeng merged this pull request in cc5af43. |
…16920) Summary: Implements GPU version of EnforceSingleRow to maintain GPU pipeline continuity for scalar subqueries. Validates row count using GPU metadata without host↔device data transfer. related to: facebookincubator#15772 closing: facebookincubator#16888 ### Performance Benchmarks (SF100, 5 iterations) All queries show **no significant performance difference** between GPU and CPU implementations, which is expected for this lightweight operator. The benefit is maintaining GPU pipeline continuity (avoiding GPU↔CPU transfers), not faster execution of the check itself. | Query | GPU mean±std | CPU mean±std | Diff | t-stat | 95% CI | Significant? | |-------|--------------|--------------|------|--------|--------|--------------| | Q6 (1 occ) | 1.738±0.048s | 1.736±0.034s | +0.1% | 0.076 | [-0.051, +0.055]s | NO | | Q14 (3 occ) | 11.490±0.309s | 11.190±0.165s | +2.7% | 1.914 | [-0.013, +0.613]s | NO | | Q44 (2 occ) | 7.294±0.358s | 7.102±0.135s | +2.7% | 1.121 | [-0.151, +0.535]s | NO | | Q54 (2 occ) | 4.008±0.443s | 3.818±0.036s | +5.0% | 0.956 | [-0.208, +0.588]s | NO | | Q58 (3 occ) | 3.806±0.123s | 3.750±0.053s | +1.5% | 0.936 | [-0.064, +0.176]s | NO | **Methodology**: Welch t-test with 95% confidence intervals. "Not significant" means |t| < 2.0 (p ≥ 0.05), indicating performance differences are within statistical noise. **Test environment**: SF100 INT32 data (~43GB), local NVMe storage, NVIDIA RTX PRO 6000 Blackwell, 5 independent runs per mode. Pull Request resolved: facebookincubator#16920 Reviewed By: tanjialiang Differential Revision: D100439821 Pulled By: xiaoxmeng fbshipit-source-id: 1a982e43a4aca83bb025e2f181ddbc3544346aff
Implements GPU version of EnforceSingleRow to maintain GPU pipeline continuity for scalar subqueries. Validates row count using GPU metadata without host↔device data transfer.
related to: #15772
closing: #16888
Performance Benchmarks (SF100, 5 iterations)
All queries show no significant performance difference between GPU and CPU implementations, which is expected for
this lightweight operator. The benefit is maintaining GPU pipeline continuity (avoiding GPU↔CPU transfers), not faster
execution of the check itself.
Methodology: Welch t-test with 95% confidence intervals. "Not significant" means |t| < 2.0 (p ≥ 0.05), indicating
performance differences are within statistical noise.
Test environment: SF100 INT32 data (~43GB), local NVMe storage, NVIDIA RTX PRO 6000 Blackwell, 5 independent runs
per mode.