feat(cudf): Add CudfEnforceSingleRow GPU operator by perlitz · Pull Request #16920 · facebookincubator/velox

perlitz · 2026-03-25T14:46:14Z

Implements GPU version of EnforceSingleRow to maintain GPU pipeline continuity for scalar subqueries. Validates row count using GPU metadata without host↔device data transfer.

related to: #15772
closing: #16888

Performance Benchmarks (SF100, 5 iterations)

All queries show no significant performance difference between GPU and CPU implementations, which is expected for
this lightweight operator. The benefit is maintaining GPU pipeline continuity (avoiding GPU↔CPU transfers), not faster
execution of the check itself.

Query	GPU mean±std	CPU mean±std	Diff	t-stat	95% CI	Significant?
Q6 (1 occ)	1.738±0.048s	1.736±0.034s	+0.1%	0.076	[-0.051, +0.055]s	NO
Q14 (3 occ)	11.490±0.309s	11.190±0.165s	+2.7%	1.914	[-0.013, +0.613]s	NO
Q44 (2 occ)	7.294±0.358s	7.102±0.135s	+2.7%	1.121	[-0.151, +0.535]s	NO
Q54 (2 occ)	4.008±0.443s	3.818±0.036s	+5.0%	0.956	[-0.208, +0.588]s	NO
Q58 (3 occ)	3.806±0.123s	3.750±0.053s	+1.5%	0.936	[-0.064, +0.176]s	NO

Methodology: Welch t-test with 95% confidence intervals. "Not significant" means |t| < 2.0 (p ≥ 0.05), indicating
performance differences are within statistical noise.

Test environment: SF100 INT32 data (~43GB), local NVMe storage, NVIDIA RTX PRO 6000 Blackwell, 5 independent runs
per mode.

Implements GPU version of EnforceSingleRow to maintain GPU pipeline continuity for scalar subqueries. Validates row count using GPU metadata without host↔device data transfer.

meta-cla · 2026-03-25T14:46:21Z

Hi @perlitz!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks!

netlify · 2026-03-25T14:46:22Z

✅ Deploy Preview for meta-velox canceled.

Name	Link
🔨 Latest commit	`2757c76`
🔍 Latest deploy log	https://app.netlify.com/projects/meta-velox/deploys/69d52a9d0b3c2a0008266957

devavret

LGTM

devavret · 2026-03-25T17:40:50Z

+      const core::PlanNodePtr& planNode,
+      exec::DriverCtx* /*ctx*/) const override {
+    // Check if GPU EnforceSingleRow is enabled in config
+    if (!CudfConfig::getInstance().enableEnforceSingleRow) {


You can just choose to not register the EnforceSingleRowAdapter based on the config instead of checking here

Sure, I've moved the config check to registration time — if the feature is disabled, the adapter is simply never registered, and canRunOnGPU() no longer needs to know about config at all.

meta-cla · 2026-03-25T21:19:11Z

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

karthikeyann

Nice work!
Well implemented, and Good test coverage.

karthikeyann · 2026-03-26T05:05:34Z

+  assertQueryFails(
+      PlanBuilder().values({largeRows}).enforceSingleRow().planNode(),
+      "Expected single row of input. Received 1000 rows.");
+}


Would adding a unit test case for 'zero columns in output type' makes sense?

sure!
added TEST_F(CudfEnforceSingleRowTest, zeroColumns)

cuDF tables with zero columns report num_rows()==0 regardless of the
actual row count, so the GPU pipeline cannot represent a 0-column
vector with rows. This edge case does not occur in real queries.

where is the error thrown while trying zero column rows? CudfFromVelox?

no error is thrown - the operator falls back to CPU for zero-column inputs.
CudfFromVelox converts the input to a cuDF table that reports num_rows() == 0, so the GPU path can't handle it correctly.
IIUC cuDF tables with zero columns have no way to track row count (there's no column data to hold the rows)
so i've also removed the test for the GPU operator as the expected behavior for this case is fallback to CPU

The cudf table is wrapped in CudfVector which has row count, so it is ok for cudf to represent table with no column and row count, but in Velox, usually, it does not handle no column table, in Gluten, we handle it early in JAVA code

karthikeyann · 2026-03-26T05:30:29Z

+ * This is a pass-through operator that performs validation on GPU metadata
+ * (row count) without transferring data between host and device.
+ */
+class CudfEnforceSingleRow : public exec::Operator, public CudfOperator {


Question for @devavret :
Should we inherit all cudf operators from CudfOperator or NvtxHelper?
Old operators still inherit from NvtxHelper, while recent new operators inherit from CudfOperator

We've been discussing this in this issue: #16885

…based on config Signed-off-by: Yotam Perlitz <y.perlitz@ibm.com>

…ence in noMoreInput Signed-off-by: Yotam Perlitz <y.perlitz@ibm.com>

…ling Signed-off-by: Yotam Perlitz <y.perlitz@ibm.com>

Signed-off-by: Yotam Perlitz <y.perlitz@ibm.com>

…SingleRow tests Signed-off-by: Yotam Perlitz <y.perlitz@ibm.com>

… validation Signed-off-by: Yotam Perlitz <y.perlitz@ibm.com>

Signed-off-by: Yotam Perlitz <y.perlitz@ibm.com>

…rity Signed-off-by: Yotam Perlitz <y.perlitz@ibm.com>

jinchengchenghh · 2026-03-31T10:43:15Z

+    // We have not seen any data. Return a single row of all nulls.
+    // Create a CPU-side null row and convert to GPU
+    auto nullRow = BaseVector::create<RowVector>(outputType_, 1, pool());
+    for (auto& child : nullRow->children()) {


Can we return the input_ here? For Velox, the children size may be bigger than the RowVector size after some operations such as slice, but for cudf table, does it have the special case? If not, we don't need the conversion.

Besides, even if we need to resize the children, I assume the cudf table also has the API to resize.
CC @karthikeyann @devavret

Can we return the input_ here?

I think the operator is expected to return a single null row instead of input_ which is nullptr.

I don't fully understand here before, is it possible to create a cudf table with 1 null row directly?

Yes, we can create a cudf table with 1 null row directly on the GPU. Updated to use cudf::make_default_constructed_scalar + cudf::make_column_from_scalar per output column — this avoids the CPU→GPU conversion entirely.

Ist the same pattern used in CudfHashJoin.cpp

Should I do that?

I think both of these are too small to worry about. you could also make empty cudf columns using column factories. But considering how little runtime impact this operator has, building from a velox rowvector is fine too.

I looked into creating the null row directly on GPU, but the factory approach (make_numeric_column etc.) would need a switch per type category. The scalar approach (make_default_constructed_scalar + make_column_from_scalar) works for all types but is two steps per column.
Building from a Velox RowVector is the simplest — I'll keep that.

@jinchengchenghh are you on board with that?

OK，this is a small issue, let us land this PR first and let others to optimize it if they are interested on that.

This will make the logic more clean, but actually without performance gain.

The operator is always enabled — no need for a separate config flag.

Simple operator does not need detailed per-method log statements.

…ount Call cudfInput->size() instead of cudfInput->getTableView().num_rows().

jinchengchenghh · 2026-03-31T13:30:06Z

+          planNode->id(),
+          "CudfEnforceSingleRow"),
+      CudfOperator(operatorId, planNode->id()) {
+  isIdentityProjection_ = true;


You don't need to set isIdentityProjection_, cudf does not use this member

Cudf Operators still sets it (Eg. CudfLimit, CudfFilterProject); it's good to follow conventions that Velox already follows.
Though isIdentityProjection_ is not used elsewhere, other similar variables could be useful. For example, This will be useful while doing Dynamic filter pushdowns.

…ingleRow cudf operators do not use this member.

Co-authored-by: Shruti Shivakumar <shruti.shivakumar@gmail.com>

github-actions · 2026-04-02T22:35:31Z

Build Impact Analysis

Full build recommended. Files outside the dependency graph changed:

velox/experimental/cudf/CudfConfig.h
velox/experimental/cudf/exec/CMakeLists.txt
velox/experimental/cudf/exec/CudfEnforceSingleRow.cpp
velox/experimental/cudf/exec/CudfEnforceSingleRow.h
velox/experimental/cudf/exec/OperatorAdapters.cpp
velox/experimental/cudf/tests/CMakeLists.txt
velox/experimental/cudf/tests/EnforceSingleRowTest.cpp

These directories are not fully covered by the dependency graph. A full build is the safest option.

cmake --build _build/release

Slow path • Graph generated from PR branch

jinchengchenghh · 2026-04-03T13:43:32Z

Please fix the code style in CI

…rmat CI

…onvention consistency

meta-codesync · 2026-04-11T04:43:47Z

@xiaoxmeng has imported this pull request. If you are a Meta employee, you can view this in D100439821.

meta-codesync · 2026-04-11T09:01:42Z

@xiaoxmeng merged this pull request in cc5af43.

…16920) Summary: Implements GPU version of EnforceSingleRow to maintain GPU pipeline continuity for scalar subqueries. Validates row count using GPU metadata without host↔device data transfer. related to: facebookincubator#15772 closing: facebookincubator#16888 ### Performance Benchmarks (SF100, 5 iterations) All queries show **no significant performance difference** between GPU and CPU implementations, which is expected for this lightweight operator. The benefit is maintaining GPU pipeline continuity (avoiding GPU↔CPU transfers), not faster execution of the check itself. | Query | GPU mean±std | CPU mean±std | Diff | t-stat | 95% CI | Significant? | |-------|--------------|--------------|------|--------|--------|--------------| | Q6 (1 occ) | 1.738±0.048s | 1.736±0.034s | +0.1% | 0.076 | [-0.051, +0.055]s | NO | | Q14 (3 occ) | 11.490±0.309s | 11.190±0.165s | +2.7% | 1.914 | [-0.013, +0.613]s | NO | | Q44 (2 occ) | 7.294±0.358s | 7.102±0.135s | +2.7% | 1.121 | [-0.151, +0.535]s | NO | | Q54 (2 occ) | 4.008±0.443s | 3.818±0.036s | +5.0% | 0.956 | [-0.208, +0.588]s | NO | | Q58 (3 occ) | 3.806±0.123s | 3.750±0.053s | +1.5% | 0.936 | [-0.064, +0.176]s | NO | **Methodology**: Welch t-test with 95% confidence intervals. "Not significant" means |t| < 2.0 (p ≥ 0.05), indicating performance differences are within statistical noise. **Test environment**: SF100 INT32 data (~43GB), local NVMe storage, NVIDIA RTX PRO 6000 Blackwell, 5 independent runs per mode. Pull Request resolved: facebookincubator#16920 Reviewed By: tanjialiang Differential Revision: D100439821 Pulled By: xiaoxmeng fbshipit-source-id: 1a982e43a4aca83bb025e2f181ddbc3544346aff

feat(cudf): Add CudfEnforceSingleRow GPU operator

4960890

Implements GPU version of EnforceSingleRow to maintain GPU pipeline continuity for scalar subqueries. Validates row count using GPU metadata without host↔device data transfer.

perlitz requested review from bdice, devavret, karthikeyann and mhaseeb123 as code owners March 25, 2026 14:46

Merge branch 'main' into pr-enforce-single-row

52f2e81

devavret approved these changes Mar 25, 2026

View reviewed changes

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 25, 2026

karthikeyann reviewed Mar 26, 2026

View reviewed changes

Merge branch 'main' into pr-enforce-single-row

93a9c97

karthikeyann added the cudf cudf related - GPU acceleration label Mar 26, 2026

karthikeyann assigned perlitz Mar 26, 2026

perlitz and others added 15 commits March 26, 2026 09:59

Merge branch 'main' into pr-enforce-single-row

2762cc8

refactor: Enable conditional registration of EnforceSingleRowAdapter …

a254cf2

…based on config Signed-off-by: Yotam Perlitz <y.perlitz@ibm.com>

fix: Update CudfEnforceSingleRow to use current device resource refer…

433af19

…ence in noMoreInput Signed-off-by: Yotam Perlitz <y.perlitz@ibm.com>

fix: Synchronize stream in noMoreInput to ensure proper resource hand…

18791c4

…ling Signed-off-by: Yotam Perlitz <y.perlitz@ibm.com>

refactor: Update CudfEnforceSingleRow to use the current pool reference

6a4bc76

Signed-off-by: Yotam Perlitz <y.perlitz@ibm.com>

fix: prevent input_ from shadowing Operator::input_

a6a149a

Signed-off-by: Yotam Perlitz <y.perlitz@ibm.com>

test: Add test case for zero columns in EnforceSingleRow

026db4d

Signed-off-by: Yotam Perlitz <y.perlitz@ibm.com>

fix: Update lambda parameters in multipleColumns test for clarity

fc95d1f

Signed-off-by: Yotam Perlitz <y.perlitz@ibm.com>

refactor: Replace assertQueryFails with VELOX_ASSERT_THROW in Enforce…

47ddb0f

…SingleRow tests Signed-off-by: Yotam Perlitz <y.perlitz@ibm.com>

fix: Simplify zeroColumns test by using AssertQueryBuilder for result…

2bd048a

… validation Signed-off-by: Yotam Perlitz <y.perlitz@ibm.com>

style: Format code for better readability in EnforceSingleRowAdapter

28ed9ea

Signed-off-by: Yotam Perlitz <y.perlitz@ibm.com>

fix: Remove unnecessary blank line in CudfEnforceSingleRowTest class

1fc11a7

Signed-off-by: Yotam Perlitz <y.perlitz@ibm.com>

style: Improve comment formatting in twoSingleRowBatches test for cla…

a30fbde

…rity Signed-off-by: Yotam Perlitz <y.perlitz@ibm.com>

Merge branch 'main' into pr-enforce-single-row

241215f

Merge branch 'main' into pr-enforce-single-row

bbbd439

perlitz added 2 commits March 30, 2026 21:36

Merge branch 'main' into pr-enforce-single-row

c46cda3

Merge branch 'main' into pr-enforce-single-row

1860dc5

jinchengchenghh reviewed Mar 31, 2026

View reviewed changes

perlitz added 3 commits March 31, 2026 13:17

fix(cudf): Remove EnforceSingleRow config toggle

3430cbb

The operator is always enabled — no need for a separate config flag.

fix(cudf): Remove verbose debug logging from CudfEnforceSingleRow

de47a91

Simple operator does not need detailed per-method log statements.

fix(cudf): Use BaseVector::size() instead of cudf table API for row c…

83738db

…ount Call cudfInput->size() instead of cudfInput->getTableView().num_rows().

jinchengchenghh reviewed Mar 31, 2026

View reviewed changes

fix(cudf): Remove unnecessary isIdentityProjection_ from CudfEnforceS…

12eabe5

…ingleRow cudf operators do not use this member.

perlitz force-pushed the pr-enforce-single-row branch from 5455fe9 to 12eabe5 Compare March 31, 2026 15:29

jinchengchenghh approved these changes Mar 31, 2026

View reviewed changes

shrshi approved these changes Apr 1, 2026

View reviewed changes

Comment thread velox/experimental/cudf/exec/CudfEnforceSingleRow.cpp

Update velox/experimental/cudf/exec/CudfEnforceSingleRow.cpp

bd6a3cd

Co-authored-by: Shruti Shivakumar <shruti.shivakumar@gmail.com>

karthikeyann reviewed Apr 2, 2026

View reviewed changes

Comment thread velox/experimental/cudf/exec/CudfEnforceSingleRow.cpp Outdated

Merge branch 'main' into pr-enforce-single-row

9f6b367

patdevinwilson mentioned this pull request Apr 2, 2026

[cuDF] Expand GPU operator support for Presto TPC-DS #15772

Open

perlitz and others added 3 commits April 7, 2026 15:31

fix(cudf): Remove trailing blank line in CudfConfig.h to fix clang-fo…

9665bfb

…rmat CI

Merge branch 'main' into pr-enforce-single-row

61e2f95

fix(cudf): Use std::move(input_) instead of std::exchange for Velox c…

2757c76

…onvention consistency

shrshi added the ready-to-merge PR that have been reviewed and are ready for merging. PRs with this tag notify the Velox Meta oncall label Apr 8, 2026

meta-codesync Bot closed this in cc5af43 Apr 11, 2026

facebook-github-tools Bot added the Merged label Apr 11, 2026

jinchengchenghh mentioned this pull request Apr 13, 2026

refactor(cudf): Register Spark and Presto functions separately #16960

Closed

karthikeyann mentioned this pull request Apr 13, 2026

feat(cudf): Add TPC-DS benchmark with reusable plan loader and CuDF support #16357

Open

3 tasks

shrshi mentioned this pull request Apr 20, 2026

Add GPU support for EnforceSingleRow operator #16888

Open

5 tasks

Conversation

perlitz commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Performance Benchmarks (SF100, 5 iterations)

Uh oh!

meta-cla Bot commented Mar 25, 2026

Action Required

Process

Uh oh!

netlify Bot commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for meta-velox canceled.

Uh oh!

devavret left a comment

Choose a reason for hiding this comment

Uh oh!

devavret Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

meta-cla Bot commented Mar 25, 2026

Uh oh!

karthikeyann left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

karthikeyann Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

perlitz Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

perlitz commented Mar 25, 2026 •

edited

Loading

netlify Bot commented Mar 25, 2026 •

edited

Loading

devavret Mar 25, 2026 •

edited

Loading

karthikeyann Mar 26, 2026 •

edited

Loading

perlitz Mar 31, 2026 •

edited

Loading