Skip to content

feat(cudf): Add CudfEnforceSingleRow GPU operator#16920

Closed
perlitz wants to merge 36 commits intofacebookincubator:mainfrom
perlitz:pr-enforce-single-row
Closed

feat(cudf): Add CudfEnforceSingleRow GPU operator#16920
perlitz wants to merge 36 commits intofacebookincubator:mainfrom
perlitz:pr-enforce-single-row

Conversation

@perlitz
Copy link
Copy Markdown
Contributor

@perlitz perlitz commented Mar 25, 2026

Implements GPU version of EnforceSingleRow to maintain GPU pipeline continuity for scalar subqueries. Validates row count using GPU metadata without host↔device data transfer.

related to: #15772
closing: #16888

Performance Benchmarks (SF100, 5 iterations)

All queries show no significant performance difference between GPU and CPU implementations, which is expected for
this lightweight operator. The benefit is maintaining GPU pipeline continuity (avoiding GPU↔CPU transfers), not faster
execution of the check itself.

Query GPU mean±std CPU mean±std Diff t-stat 95% CI Significant?
Q6 (1 occ) 1.738±0.048s 1.736±0.034s +0.1% 0.076 [-0.051, +0.055]s NO
Q14 (3 occ) 11.490±0.309s 11.190±0.165s +2.7% 1.914 [-0.013, +0.613]s NO
Q44 (2 occ) 7.294±0.358s 7.102±0.135s +2.7% 1.121 [-0.151, +0.535]s NO
Q54 (2 occ) 4.008±0.443s 3.818±0.036s +5.0% 0.956 [-0.208, +0.588]s NO
Q58 (3 occ) 3.806±0.123s 3.750±0.053s +1.5% 0.936 [-0.064, +0.176]s NO

Methodology: Welch t-test with 95% confidence intervals. "Not significant" means |t| < 2.0 (p ≥ 0.05), indicating
performance differences are within statistical noise.

Test environment: SF100 INT32 data (~43GB), local NVMe storage, NVIDIA RTX PRO 6000 Blackwell, 5 independent runs
per mode.

Implements GPU version of EnforceSingleRow to maintain GPU pipeline
continuity for scalar subqueries. Validates row count using GPU
metadata without host↔device data transfer.
@meta-cla
Copy link
Copy Markdown

meta-cla Bot commented Mar 25, 2026

Hi @perlitz!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks!

@netlify
Copy link
Copy Markdown

netlify Bot commented Mar 25, 2026

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit 2757c76
🔍 Latest deploy log https://app.netlify.com/projects/meta-velox/deploys/69d52a9d0b3c2a0008266957

Copy link
Copy Markdown
Collaborator

@devavret devavret left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

const core::PlanNodePtr& planNode,
exec::DriverCtx* /*ctx*/) const override {
// Check if GPU EnforceSingleRow is enabled in config
if (!CudfConfig::getInstance().enableEnforceSingleRow) {
Copy link
Copy Markdown
Collaborator

@devavret devavret Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can just choose to not register the EnforceSingleRowAdapter based on the config instead of checking here

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I've moved the config check to registration time — if the feature is disabled, the adapter is simply never registered, and canRunOnGPU() no longer needs to know about config at all.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 25, 2026
@meta-cla
Copy link
Copy Markdown

meta-cla Bot commented Mar 25, 2026

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

Copy link
Copy Markdown
Collaborator

@karthikeyann karthikeyann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work!
Well implemented, and Good test coverage.

Comment thread velox/experimental/cudf/exec/CudfEnforceSingleRow.cpp Outdated
assertQueryFails(
PlanBuilder().values({largeRows}).enforceSingleRow().planNode(),
"Expected single row of input. Received 1000 rows.");
}
Copy link
Copy Markdown
Collaborator

@karthikeyann karthikeyann Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would adding a unit test case for 'zero columns in output type' makes sense?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure!
added TEST_F(CudfEnforceSingleRowTest, zeroColumns)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cuDF tables with zero columns report num_rows()==0 regardless of the
actual row count, so the GPU pipeline cannot represent a 0-column
vector with rows. This edge case does not occur in real queries.

where is the error thrown while trying zero column rows? CudfFromVelox?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no error is thrown - the operator falls back to CPU for zero-column inputs.
CudfFromVelox converts the input to a cuDF table that reports num_rows() == 0, so the GPU path can't handle it correctly.
IIUC cuDF tables with zero columns have no way to track row count (there's no column data to hold the rows)
so i've also removed the test for the GPU operator as the expected behavior for this case is fallback to CPU

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cudf table is wrapped in CudfVector which has row count, so it is ok for cudf to represent table with no column and row count, but in Velox, usually, it does not handle no column table, in Gluten, we handle it early in JAVA code

Comment thread velox/experimental/cudf/exec/CudfEnforceSingleRow.h Outdated
Comment thread velox/experimental/cudf/exec/CudfEnforceSingleRow.cpp Outdated
Comment thread velox/experimental/cudf/tests/EnforceSingleRowTest.cpp Outdated
* This is a pass-through operator that performs validation on GPU metadata
* (row count) without transferring data between host and device.
*/
class CudfEnforceSingleRow : public exec::Operator, public CudfOperator {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question for @devavret :
Should we inherit all cudf operators from CudfOperator or NvtxHelper?
Old operators still inherit from NvtxHelper, while recent new operators inherit from CudfOperator

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've been discussing this in this issue: #16885

Comment thread velox/experimental/cudf/tests/EnforceSingleRowTest.cpp Outdated
@karthikeyann karthikeyann added the cudf cudf related - GPU acceleration label Mar 26, 2026
perlitz and others added 15 commits March 26, 2026 09:59
…based on config

Signed-off-by: Yotam Perlitz <y.perlitz@ibm.com>
…ence in noMoreInput

Signed-off-by: Yotam Perlitz <y.perlitz@ibm.com>
…ling

Signed-off-by: Yotam Perlitz <y.perlitz@ibm.com>
Signed-off-by: Yotam Perlitz <y.perlitz@ibm.com>
Signed-off-by: Yotam Perlitz <y.perlitz@ibm.com>
Signed-off-by: Yotam Perlitz <y.perlitz@ibm.com>
Signed-off-by: Yotam Perlitz <y.perlitz@ibm.com>
…SingleRow tests

Signed-off-by: Yotam Perlitz <y.perlitz@ibm.com>
… validation

Signed-off-by: Yotam Perlitz <y.perlitz@ibm.com>
Signed-off-by: Yotam Perlitz <y.perlitz@ibm.com>
Signed-off-by: Yotam Perlitz <y.perlitz@ibm.com>
…rity

Signed-off-by: Yotam Perlitz <y.perlitz@ibm.com>
Comment thread velox/experimental/cudf/exec/ToCudf.cpp Outdated
Comment thread velox/experimental/cudf/exec/CudfEnforceSingleRow.cpp Outdated
Comment thread velox/experimental/cudf/exec/CudfEnforceSingleRow.cpp Outdated
Comment thread velox/experimental/cudf/exec/CudfEnforceSingleRow.cpp Outdated
// We have not seen any data. Return a single row of all nulls.
// Create a CPU-side null row and convert to GPU
auto nullRow = BaseVector::create<RowVector>(outputType_, 1, pool());
for (auto& child : nullRow->children()) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we return the input_ here? For Velox, the children size may be bigger than the RowVector size after some operations such as slice, but for cudf table, does it have the special case? If not, we don't need the conversion.

Besides, even if we need to resize the children, I assume the cudf table also has the API to resize.
CC @karthikeyann @devavret

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we return the input_ here?

I think the operator is expected to return a single null row instead of input_ which is nullptr.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't fully understand here before, is it possible to create a cudf table with 1 null row directly?

Copy link
Copy Markdown
Contributor Author

@perlitz perlitz Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can create a cudf table with 1 null row directly on the GPU. Updated to use cudf::make_default_constructed_scalar + cudf::make_column_from_scalar per output column — this avoids the CPU→GPU conversion entirely.

Ist the same pattern used in CudfHashJoin.cpp

Should I do that?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think both of these are too small to worry about. you could also make empty cudf columns using column factories. But considering how little runtime impact this operator has, building from a velox rowvector is fine too.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked into creating the null row directly on GPU, but the factory approach (make_numeric_column etc.) would need a switch per type category. The scalar approach (make_default_constructed_scalar + make_column_from_scalar) works for all types but is two steps per column.
Building from a Velox RowVector is the simplest — I'll keep that.

@jinchengchenghh are you on board with that?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK,this is a small issue, let us land this PR first and let others to optimize it if they are interested on that.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will make the logic more clean, but actually without performance gain.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great

perlitz added 3 commits March 31, 2026 13:17
The operator is always enabled — no need for a separate config flag.
Simple operator does not need detailed per-method log statements.
…ount

Call cudfInput->size() instead of cudfInput->getTableView().num_rows().
planNode->id(),
"CudfEnforceSingleRow"),
CudfOperator(operatorId, planNode->id()) {
isIdentityProjection_ = true;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need to set isIdentityProjection_, cudf does not use this member

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cudf Operators still sets it (Eg. CudfLimit, CudfFilterProject); it's good to follow conventions that Velox already follows.
Though isIdentityProjection_ is not used elsewhere, other similar variables could be useful. For example, This will be useful while doing Dynamic filter pushdowns.

…ingleRow

cudf operators do not use this member.
@perlitz perlitz force-pushed the pr-enforce-single-row branch from 5455fe9 to 12eabe5 Compare March 31, 2026 15:29
Comment thread velox/experimental/cudf/exec/CudfEnforceSingleRow.cpp
Co-authored-by: Shruti Shivakumar <shruti.shivakumar@gmail.com>
Comment thread velox/experimental/cudf/exec/CudfEnforceSingleRow.cpp Outdated
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 2, 2026

Build Impact Analysis

Full build recommended. Files outside the dependency graph changed:

  • velox/experimental/cudf/CudfConfig.h
  • velox/experimental/cudf/exec/CMakeLists.txt
  • velox/experimental/cudf/exec/CudfEnforceSingleRow.cpp
  • velox/experimental/cudf/exec/CudfEnforceSingleRow.h
  • velox/experimental/cudf/exec/OperatorAdapters.cpp
  • velox/experimental/cudf/tests/CMakeLists.txt
  • velox/experimental/cudf/tests/EnforceSingleRowTest.cpp

These directories are not fully covered by the dependency graph. A full build is the safest option.

cmake --build _build/release

Slow path • Graph generated from PR branch

@jinchengchenghh
Copy link
Copy Markdown
Collaborator

Please fix the code style in CI

@shrshi shrshi added the ready-to-merge PR that have been reviewed and are ready for merging. PRs with this tag notify the Velox Meta oncall label Apr 8, 2026
@meta-codesync
Copy link
Copy Markdown

meta-codesync Bot commented Apr 11, 2026

@xiaoxmeng has imported this pull request. If you are a Meta employee, you can view this in D100439821.

@meta-codesync
Copy link
Copy Markdown

meta-codesync Bot commented Apr 11, 2026

@xiaoxmeng merged this pull request in cc5af43.

shrshi pushed a commit to patdevinwilson/velox that referenced this pull request Apr 13, 2026
…16920)

Summary:
Implements GPU version of EnforceSingleRow to maintain GPU pipeline continuity for scalar subqueries. Validates row count using GPU metadata without host↔device data transfer.

related to: facebookincubator#15772
closing: facebookincubator#16888

### Performance Benchmarks (SF100, 5 iterations)

  All queries show **no significant performance difference** between GPU and CPU implementations, which is expected for
  this lightweight operator. The benefit is maintaining GPU pipeline continuity (avoiding GPU↔CPU transfers), not faster
  execution of the check itself.

  | Query | GPU mean±std | CPU mean±std | Diff | t-stat | 95% CI | Significant? |
  |-------|--------------|--------------|------|--------|--------|--------------|
  | Q6 (1 occ) | 1.738±0.048s | 1.736±0.034s | +0.1% | 0.076 | [-0.051, +0.055]s | NO |
  | Q14 (3 occ) | 11.490±0.309s | 11.190±0.165s | +2.7% | 1.914 | [-0.013, +0.613]s | NO |
  | Q44 (2 occ) | 7.294±0.358s | 7.102±0.135s | +2.7% | 1.121 | [-0.151, +0.535]s | NO |
  | Q54 (2 occ) | 4.008±0.443s | 3.818±0.036s | +5.0% | 0.956 | [-0.208, +0.588]s | NO |
  | Q58 (3 occ) | 3.806±0.123s | 3.750±0.053s | +1.5% | 0.936 | [-0.064, +0.176]s | NO |

  **Methodology**: Welch t-test with 95% confidence intervals. "Not significant" means |t| < 2.0 (p ≥ 0.05), indicating
  performance differences are within statistical noise.

  **Test environment**: SF100 INT32 data (~43GB), local NVMe storage, NVIDIA RTX PRO 6000 Blackwell, 5 independent runs
  per mode.

Pull Request resolved: facebookincubator#16920

Reviewed By: tanjialiang

Differential Revision: D100439821

Pulled By: xiaoxmeng

fbshipit-source-id: 1a982e43a4aca83bb025e2f181ddbc3544346aff
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. cudf cudf related - GPU acceleration Merged ready-to-merge PR that have been reviewed and are ready for merging. PRs with this tag notify the Velox Meta oncall

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants