Remove pinned host memory from barrier solver by rg20 · Pull Request #1321 · NVIDIA/cuopt

rg20 · 2026-05-28T19:52:08Z

Replace all pinned_dense_vector_t members in iteration_data_t with plain dense_vector_t, eliminating CPU<->GPU synchronization overhead from page-locked memory allocation. Removes 169 net lines.

Vectors removed (pinned -> plain or deleted entirely):

10 direction vectors (dw_aff, dx_aff, dy_aff, dv_aff, dz_aff and their corrector counterparts)
5 RHS vectors (primal_rhs, bound_rhs, dual_rhs, complementarity_xz_rhs, complementarity_wv_rhs)
5 residual vectors (primal_residual, bound_residual, dual_residual, complementarity_xz_residual, complementarity_wv_residual)
diag, inv_diag, inv_sqrt_diag (CPU-only, converted to dense_vector_t)
c, b (constants, converted; permanent d_b_ added to avoid per-iteration device_copy in compute_primal_dual_objective)
restrict_u_ (converted; permanent d_restrict_u_ added, copied once)
w, x, y, v, z, upper_bounds (state vectors, converted)

Also removes the CPU compute_residuals function entirely (replaced by gpu_compute_residuals path) and simplifies gpu_compute_search_direction signature by removing unused pinned vector parameters.

Validated on 179 benchmark problems (portfolio/maros/qplib): identical results vs baseline under --cudss-deterministic true.

Description

Issue

Checklist

I am familiar with the Contributing Guidelines.
Testing
- New or existing tests cover these changes
- Added tests
- Created an issue to follow-up
- NA
Documentation
- The documentation is up to date with these changes
- Added new documentation
- NA

Replace all pinned_dense_vector_t members in iteration_data_t with plain dense_vector_t, eliminating CPU<->GPU synchronization overhead from page-locked memory allocation. Removes 169 net lines. Vectors removed (pinned -> plain or deleted entirely): - 10 direction vectors (dw_aff, dx_aff, dy_aff, dv_aff, dz_aff and their corrector counterparts) - 5 RHS vectors (primal_rhs, bound_rhs, dual_rhs, complementarity_xz_rhs, complementarity_wv_rhs) - 5 residual vectors (primal_residual, bound_residual, dual_residual, complementarity_xz_residual, complementarity_wv_residual) - diag, inv_diag, inv_sqrt_diag (CPU-only, converted to dense_vector_t) - c, b (constants, converted; permanent d_b_ added to avoid per-iteration device_copy in compute_primal_dual_objective) - restrict_u_ (converted; permanent d_restrict_u_ added, copied once) - w, x, y, v, z, upper_bounds (state vectors, converted) Also removes the CPU compute_residuals function entirely (replaced by gpu_compute_residuals path) and simplifies gpu_compute_search_direction signature by removing unused pinned vector parameters. Validated on 179 benchmark problems (portfolio/maros/qplib): identical results vs baseline under --cudss-deterministic true. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

copy-pr-bot · 2026-05-28T19:52:11Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

yuwenchen95 · 2026-05-29T07:43:07Z

Would this be with release/26.06 or postponed to the next release?

coderabbitai · 2026-06-03T15:10:39Z

Caution

Review failed

Failed to post review comments

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Iteration state, RHS/residual assembly, objective dot-products, and search-direction work moved to device-resident buffers; gpu_compute_search_direction now allocates device work internally, and host copies occur only at explicit synchronized snapshot/solution-export points. The PR also adds kernel-launch wrappers, large test refactors to inline-LP helpers, benchmark reporting, CI/workflow/package bumps, and many skill/eval docs and signatures.

Changes

GPU-Resident Barrier Solver & Repo-wide updates

Layer / File(s)	Summary
Barrier data and init (device buffers) `cpp/src/barrier/barrier.cu`, `cpp/src/barrier/barrier.hpp`	Iteration storage and many work vectors converted to device-first storage; `d_b_`, `d_restrict_u_` added; c/b uploaded to device in constructor; pinned-host allocator include moved/removed.
Residuals and RHS device-first `cpp/src/barrier/barrier.cu`	Initial primal/dual checks, gpu_compute_residuals, and complementarity assembly done on-device into `d_*` buffers; host staging removed.
Search-direction & ADAT changes `cpp/src/barrier/barrier.cu`	`gpu_compute_search_direction` no longer accepts pinned-host direction outputs; it allocates/resizes internal device work buffers and keeps solves on device; ADAT path copies `inv_diag` back when host ADAT is used.
Affine/corrector RHS on device `cpp/src/barrier/barrier.cu`	`compute_affine_rhs` / `compute_cc_rhs` assemble/negate/zero RHS on device (`d_h_`, `d_dual_rhs_`, `d_bound_rhs_`, `d_dw_`) with thrust operations and device-to-device copies.
Iteration loop & exports `cpp/src/barrier/barrier.cu`	Initial iterate uploaded to GPU; objective dot-products use `d_b_`/`d_restrict_u_`; iteration uses device-only direction flow; snapshots and solution export paths explicitly synchronize device→host (`w/x/y/v/z`) before calling `to_solution`. `to_solution` signature updated (removed dual_residual).
Feasibility-jump kernels & launch wrappers `cpp/src/mip_heuristics/feasibility_jump/*`	Added typed `get_launch_dims_` and `launch_` wrappers; replaced raw <<<>>> / cooperative launches with wrapper API; extended explicit template instantiations.
Clique/presolve refactor `cpp/src/mip_heuristics/presolve/conflict_graph/*`	CSR-based var→clique maps, small-clique adjacency finalization, simplified extend-phase with work budgets; several API/constness and signature changes.
Presolve bounds update gating `cpp/src/mip_heuristics/presolve/*`	Introduced `candidate_bound_scale`, added device helper `accept_candidate_bound_update`, and gated bound writes by scaled absolute-tolerance.
Diving hyper-params and B&B wiring `cpp/include/cuopt/linear_programming/mip/`, `cpp/src/branch_and_bound/`	New templated diving hyper-params type, solver settings extended, is_search_strategy_enabled/feasible symbol helpers, B&B wiring to publish benchmark metrics and use diving params.
Benchmarks & reporting `benchmarks/linear_programming/cuopt/*`	Added MIPLIB2017 BKS lookup and per-instance gap reporting; run_mip now prints gap-closed-to-BKS line.
Tests: inline-LP helpers `cpp/tests/*`, `cpp/tests/utilities/inline_lp_test_utils.hpp`	Many unit/integration tests refactored to construct problems from inline LP strings via new parse_inline_lp test helper.
CI, packaging, and manifests `.github/workflows/`, `ci/`, `dependencies.yaml`, `python/*`, `VERSION`, manifests	Switched reusable workflow refs to `rapidsai/shared-workflows@main`, bumped many RAPIDS package pins to 26.8.*, updated container image tags, added RAPIDS_PACKAGE_NAME exports in CI scripts, and bumped plugin/manifest versions.
Skill docs, evals, sigs `skills/**`, `.claude-plugin/marketplace.json`	Added/renamed many cuopt-* skill docs, benchmarks, eval datasets, skill cards, and Sigstore DSSE signature bundles; added `SECURITY.md` and CODEOWNERS entry for it.
Minor API/robustness fixes `cpp/src/pdlp/cuopt_c.cpp`, `cpp/src/utilities/manual_cuda_graph.cuh`, others	Argument validation in `cuOptGetErrorString`, improved manual CUDA graph capture recovery, OMP include guard tweaks, and other robustness changes.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

NVIDIA/cuopt#1408: Overlapping doc changes for dual-values guidance (LP/QP) and related API documentation.

Suggested reviewers

akifcorduk
Iroy30
KyleFromNVIDIA

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

coderabbitai

🧹 Nitpick comments (2)

cpp/src/barrier/barrier.cu (2)

1386-1388: 💤 Low value

Verify dense-columns path synchronization is correct.

The D2H copy of inv_diag followed by synchronize() before host-side solve_adat is necessary for correctness when n_dense_columns > 0. The host solve uses the current device-computed inv_diag values.

However, consider using RAFT_CUDA_TRY wrapper for consistency with other sync points in this file.

Suggested change for consistency

       raft::copy(inv_diag.data(), d_inv_diag.data(), d_inv_diag.size(), stream_view_);
-      stream_view_.synchronize();
+      RAFT_CUDA_TRY(cudaStreamSynchronize(stream_view_));

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/src/barrier/barrier.cu` around lines 1386 - 1388, The D2H copy of
inv_diag currently uses raft::copy(...) followed by stream_view_.synchronize();
for correctness when n_dense_columns > 0—wrap the CUDA synchronization in the
RAFT_CUDA_TRY macro for consistency with other sync points (i.e., ensure the
raft::copy and the subsequent stream_view_.synchronize() call are protected by
RAFT_CUDA_TRY) so device errors are checked before proceeding to host-side work
such as host_copy(...) and the host solve (solve_adat).

2471-2472: 💤 Low value

Minor performance: prefer D2D copy from d_b_ instead of H2D from lp.rhs.

Since d_b_ is already a permanent device copy of lp.rhs (copied once at construction, line 351), using it as the source avoids an H2D transfer each iteration.

Suggested optimization

   data.d_primal_residual_.resize(lp.num_rows, stream_view_);
-  raft::copy(data.d_primal_residual_.data(), lp.rhs.data(), lp.rhs.size(), stream_view_);
+  raft::copy(data.d_primal_residual_.data(), data.d_b_.data(), data.d_b_.size(), stream_view_);

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/src/barrier/barrier.cu` around lines 2471 - 2472, Replace the
host-to-device copy from lp.rhs with a device-to-device copy from the existing
device buffer d_b_; specifically, change the raft::copy call that writes into
data.d_primal_residual_ (currently using lp.rhs.data()) to use d_b_.data() (or
the appropriate device pointer named d_b_) and keep the size and stream_view_
unchanged so the transfer is D2D instead of H2D.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@cpp/src/barrier/barrier.cu`:
- Around line 1386-1388: The D2H copy of inv_diag currently uses raft::copy(...)
followed by stream_view_.synchronize(); for correctness when n_dense_columns >
0—wrap the CUDA synchronization in the RAFT_CUDA_TRY macro for consistency with
other sync points (i.e., ensure the raft::copy and the subsequent
stream_view_.synchronize() call are protected by RAFT_CUDA_TRY) so device errors
are checked before proceeding to host-side work such as host_copy(...) and the
host solve (solve_adat).
- Around line 2471-2472: Replace the host-to-device copy from lp.rhs with a
device-to-device copy from the existing device buffer d_b_; specifically, change
the raft::copy call that writes into data.d_primal_residual_ (currently using
lp.rhs.data()) to use d_b_.data() (or the appropriate device pointer named d_b_)
and keep the size and stream_view_ unchanged so the transfer is D2D instead of
H2D.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: de9a0363-49f9-4235-9a1f-39c929d3f62a

📥 Commits

Reviewing files that changed from the base of the PR and between 3fba293 and d0957f6.

📒 Files selected for processing (2)

cpp/src/barrier/barrier.cu
cpp/src/barrier/barrier.hpp

💤 Files with no reviewable changes (1)

cpp/src/barrier/barrier.hpp

…inned_memory

yuwenchen95 · 2026-06-04T10:12:55Z

  // Verify A*x = b
-  data.primal_residual = lp.rhs;
-  data.cusparse_view_.spmv(1.0, data.x, -1.0, data.primal_residual);
+  dense_vector_t<i_t, f_t> primal_residual(lp.num_rows);


Nit: rename it to init_primal_residual

Suggested change

dense_vector_t<i_t, f_t> primal_residual(lp.num_rows);

dense_vector_t<i_t, f_t> init_primal_residual(lp.num_rows);

yuwenchen95

Some resize calls still appears in the main while loop of barrier method. Since dimensions of these vectors are uncganed once set up, it's better to regroup all resize operations at the beginning of a barrier methods.

yuwenchen95 · 2026-06-04T10:14:18Z

 #endif

  if (data.n_upper_bounds > 0) {
+    dense_vector_t<i_t, f_t> bound_residual(data.n_upper_bounds);


Add init_ prefix like above.

yuwenchen95 · 2026-06-04T10:16:45Z

-  data.z.pairwise_subtract(data.c, data.dual_residual);
-  if (data.Q.n > 0) { matrix_vector_multiply(data.Q, -1.0, data.x, 1.0, data.dual_residual); }
-  data.cusparse_view_.transpose_spmv(1.0, data.y, 1.0, data.dual_residual);
+  dense_vector_t<i_t, f_t> dual_residual(lp.num_cols);


Suggested change

dense_vector_t<i_t, f_t> dual_residual(lp.num_cols);

dense_vector_t<i_t, f_t> init_dual_residual(lp.num_cols);

yuwenchen95 · 2026-06-04T13:27:18Z


-  data.d_primal_residual_.resize(data.primal_residual.size(), stream_view_);
-  raft::copy(data.d_primal_residual_.data(), lp.rhs.data(), lp.rhs.size(), stream_view_);
+  data.d_primal_residual_.resize(lp.num_rows, stream_view_);


Nit: it would look clearer if we only resize d_primal_residual_ and d_dual_residual_ at the first time we call it.

yuwenchen95 · 2026-06-04T13:28:38Z

                                  stream_view_.value());
  RAFT_CHECK_CUDA(stream_view_);
+  if (data.Q.n > 0) {
+    auto descr_dual_residual = data.cusparse_view_.create_vector(data.d_dual_residual_);


descr_dual_residual should be added into the initialization of data, instead of creating it every time.

create_vector does not have any overhead, its just wrapper

yuwenchen95 · 2026-06-04T13:35:15Z

-  data.d_dy_.resize(dy.size(), stream_view_);
-  data.d_dz_.resize(dz.size(), stream_view_);
-  data.d_dv_.resize(dv.size(), stream_view_);
+  {


Logically better to move the code block Barrier: GPU allocation and copies before the while loop of a IPM

yuwenchen95 · 2026-06-04T13:47:18Z

+  // D2D: RHS = residuals (all on device)
  data.cone_combined_step_ = false;
  data.cone_sigma_mu_      = f_t(0);
+  raft::copy(


Better to resize d_bound_rhs_ and d_dw_ only once at the beginning of IPM.

Resize has no overhead if the size is same as before

yuwenchen95 · 2026-06-04T14:22:08Z

-                     vector_norm2<i_t, f_t>(data.primal_residual),
-                     vector_norm2<i_t, f_t>(data.dual_residual),
+                     primal_residual_norm,
+                     dual_residual_norm,


dual_residual_norm is not used since to_solution recomputes the dual z and then the residual. We'd better not pass it into to_solution.

rg20 · 2026-06-11T19:22:16Z

/ok to test a824930

copy-pr-bot · 2026-06-11T19:23:32Z

/ok to test a824930

@rg20, there was an error processing your request: E2

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/2/

rg20 · 2026-06-11T19:23:42Z

/ok to test ef22921

rg20 · 2026-06-12T13:25:26Z

/merge

rg20 requested a review from a team as a code owner May 28, 2026 19:52

rg20 requested review from akifcorduk and hlinsen May 28, 2026 19:52

rg20 marked this pull request as draft May 28, 2026 19:52

rg20 changed the base branch from main to release/26.06 May 29, 2026 15:17

rg20 added improvement Improves an existing functionality non-breaking Introduces a non-breaking change labels May 29, 2026

rg20 added this to the 26.06 milestone May 29, 2026

hlinsen added 2 commits June 2, 2026 07:26

Fix merge conflicts

d58270e

Use pinned mem only for inv diag

d0957f6

rg20 marked this pull request as ready for review June 3, 2026 15:04

coderabbitai Bot reviewed Jun 3, 2026

View reviewed changes

chris-maes modified the milestones: 26.06, 26.08 Jun 3, 2026

hlinsen added 2 commits June 3, 2026 15:30

Merge branch 'release/26.06' of github.com:NVIDIA/cuopt into remove_p…

ba245ae

…inned_memory

Avoid more copies

aa3ead9

yuwenchen95 reviewed Jun 4, 2026

View reviewed changes

hlinsen approved these changes Jun 9, 2026

View reviewed changes

rg20 added 2 commits June 9, 2026 12:09

Address PR review comments

daeee26

Revert wrong change

a824930

Merge remote-tracking branch 'upstream/main' into remove_pinned_memory

ef22921

rg20 requested review from a team as code owners June 11, 2026 19:22

rg20 requested a review from KyleFromNVIDIA June 11, 2026 19:22

rg20 requested a review from Iroy30 June 11, 2026 19:22

rg20 changed the base branch from release/26.06 to main June 11, 2026 19:23

rg20 requested review from chris-maes and removed request for a team, Iroy30, KyleFromNVIDIA and akifcorduk June 11, 2026 19:24

rapids-bot Bot merged commit 3dba505 into NVIDIA:main Jun 12, 2026
98 checks passed

	dense_vector_t<i_t, f_t> primal_residual(lp.num_rows);
	dense_vector_t<i_t, f_t> init_primal_residual(lp.num_rows);

	dense_vector_t<i_t, f_t> dual_residual(lp.num_cols);
	dense_vector_t<i_t, f_t> init_dual_residual(lp.num_cols);

Conversation

rg20 commented May 28, 2026

Description

Issue

Checklist

Uh oh!

copy-pr-bot Bot commented May 28, 2026

Uh oh!

yuwenchen95 commented May 29, 2026

Uh oh!

coderabbitai Bot commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Reviews paused

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

yuwenchen95 Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yuwenchen95 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rg20 commented Jun 11, 2026

Uh oh!

copy-pr-bot Bot commented Jun 11, 2026

Uh oh!

rg20 commented Jun 11, 2026

Uh oh!

rg20 commented Jun 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

coderabbitai Bot commented Jun 3, 2026 •

edited

Loading

yuwenchen95 Jun 4, 2026 •

edited

Loading