Skip to content

Remove pinned host memory from barrier solver#1321

Merged
rapids-bot[bot] merged 8 commits into
NVIDIA:mainfrom
rg20:remove_pinned_memory
Jun 12, 2026
Merged

Remove pinned host memory from barrier solver#1321
rapids-bot[bot] merged 8 commits into
NVIDIA:mainfrom
rg20:remove_pinned_memory

Conversation

@rg20

@rg20 rg20 commented May 28, 2026

Copy link
Copy Markdown
Contributor

Replace all pinned_dense_vector_t members in iteration_data_t with plain dense_vector_t, eliminating CPU<->GPU synchronization overhead from page-locked memory allocation. Removes 169 net lines.

Vectors removed (pinned -> plain or deleted entirely):

  • 10 direction vectors (dw_aff, dx_aff, dy_aff, dv_aff, dz_aff and their corrector counterparts)
  • 5 RHS vectors (primal_rhs, bound_rhs, dual_rhs, complementarity_xz_rhs, complementarity_wv_rhs)
  • 5 residual vectors (primal_residual, bound_residual, dual_residual, complementarity_xz_residual, complementarity_wv_residual)
  • diag, inv_diag, inv_sqrt_diag (CPU-only, converted to dense_vector_t)
  • c, b (constants, converted; permanent d_b_ added to avoid per-iteration device_copy in compute_primal_dual_objective)
  • restrict_u_ (converted; permanent d_restrict_u_ added, copied once)
  • w, x, y, v, z, upper_bounds (state vectors, converted)

Also removes the CPU compute_residuals function entirely (replaced by gpu_compute_residuals path) and simplifies gpu_compute_search_direction signature by removing unused pinned vector parameters.

Validated on 179 benchmark problems (portfolio/maros/qplib): identical results vs baseline under --cudss-deterministic true.

Description

Issue

Checklist

  • I am familiar with the Contributing Guidelines.
  • Testing
    • New or existing tests cover these changes
    • Added tests
    • Created an issue to follow-up
    • NA
  • Documentation
    • The documentation is up to date with these changes
    • Added new documentation
    • NA

Replace all pinned_dense_vector_t members in iteration_data_t with plain
dense_vector_t, eliminating CPU<->GPU synchronization overhead from
page-locked memory allocation. Removes 169 net lines.

Vectors removed (pinned -> plain or deleted entirely):
- 10 direction vectors (dw_aff, dx_aff, dy_aff, dv_aff, dz_aff and
  their corrector counterparts)
- 5 RHS vectors (primal_rhs, bound_rhs, dual_rhs,
  complementarity_xz_rhs, complementarity_wv_rhs)
- 5 residual vectors (primal_residual, bound_residual, dual_residual,
  complementarity_xz_residual, complementarity_wv_residual)
- diag, inv_diag, inv_sqrt_diag (CPU-only, converted to dense_vector_t)
- c, b (constants, converted; permanent d_b_ added to avoid
  per-iteration device_copy in compute_primal_dual_objective)
- restrict_u_ (converted; permanent d_restrict_u_ added, copied once)
- w, x, y, v, z, upper_bounds (state vectors, converted)

Also removes the CPU compute_residuals function entirely (replaced by
gpu_compute_residuals path) and simplifies gpu_compute_search_direction
signature by removing unused pinned vector parameters.

Validated on 179 benchmark problems (portfolio/maros/qplib): identical
results vs baseline under --cudss-deterministic true.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@rg20 rg20 requested a review from a team as a code owner May 28, 2026 19:52
@rg20 rg20 requested review from akifcorduk and hlinsen May 28, 2026 19:52
@copy-pr-bot

copy-pr-bot Bot commented May 28, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@rg20 rg20 marked this pull request as draft May 28, 2026 19:52
@yuwenchen95

Copy link
Copy Markdown
Contributor

Would this be with release/26.06 or postponed to the next release?

@rg20 rg20 changed the base branch from main to release/26.06 May 29, 2026 15:17
@rg20 rg20 added improvement Improves an existing functionality non-breaking Introduces a non-breaking change labels May 29, 2026
@rg20 rg20 added this to the 26.06 milestone May 29, 2026
@rg20 rg20 marked this pull request as ready for review June 3, 2026 15:04
@coderabbitai

coderabbitai Bot commented Jun 3, 2026

Copy link
Copy Markdown

Review Change Stack

Caution

Review failed

Failed to post review comments

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Iteration state, RHS/residual assembly, objective dot-products, and search-direction work moved to device-resident buffers; gpu_compute_search_direction now allocates device work internally, and host copies occur only at explicit synchronized snapshot/solution-export points. The PR also adds kernel-launch wrappers, large test refactors to inline-LP helpers, benchmark reporting, CI/workflow/package bumps, and many skill/eval docs and signatures.

Changes

GPU-Resident Barrier Solver & Repo-wide updates

Layer / File(s) Summary
Barrier data and init (device buffers)
cpp/src/barrier/barrier.cu, cpp/src/barrier/barrier.hpp
Iteration storage and many work vectors converted to device-first storage; d_b_, d_restrict_u_ added; c/b uploaded to device in constructor; pinned-host allocator include moved/removed.
Residuals and RHS device-first
cpp/src/barrier/barrier.cu
Initial primal/dual checks, gpu_compute_residuals, and complementarity assembly done on-device into d_* buffers; host staging removed.
Search-direction & ADAT changes
cpp/src/barrier/barrier.cu
gpu_compute_search_direction no longer accepts pinned-host direction outputs; it allocates/resizes internal device work buffers and keeps solves on device; ADAT path copies inv_diag back when host ADAT is used.
Affine/corrector RHS on device
cpp/src/barrier/barrier.cu
compute_affine_rhs / compute_cc_rhs assemble/negate/zero RHS on device (d_h_, d_dual_rhs_, d_bound_rhs_, d_dw_) with thrust operations and device-to-device copies.
Iteration loop & exports
cpp/src/barrier/barrier.cu
Initial iterate uploaded to GPU; objective dot-products use d_b_/d_restrict_u_; iteration uses device-only direction flow; snapshots and solution export paths explicitly synchronize device→host (w/x/y/v/z) before calling to_solution. to_solution signature updated (removed dual_residual).
Feasibility-jump kernels & launch wrappers
cpp/src/mip_heuristics/feasibility_jump/*
Added typed get_launch_dims_* and launch_* wrappers; replaced raw <<<>>> / cooperative launches with wrapper API; extended explicit template instantiations.
Clique/presolve refactor
cpp/src/mip_heuristics/presolve/conflict_graph/*
CSR-based var→clique maps, small-clique adjacency finalization, simplified extend-phase with work budgets; several API/constness and signature changes.
Presolve bounds update gating
cpp/src/mip_heuristics/presolve/*
Introduced candidate_bound_scale, added device helper accept_candidate_bound_update, and gated bound writes by scaled absolute-tolerance.
Diving hyper-params and B&B wiring
cpp/include/cuopt/linear_programming/mip/*, cpp/src/branch_and_bound/*
New templated diving hyper-params type, solver settings extended, is_search_strategy_enabled/feasible symbol helpers, B&B wiring to publish benchmark metrics and use diving params.
Benchmarks & reporting
benchmarks/linear_programming/cuopt/*
Added MIPLIB2017 BKS lookup and per-instance gap reporting; run_mip now prints gap-closed-to-BKS line.
Tests: inline-LP helpers
cpp/tests/*, cpp/tests/utilities/inline_lp_test_utils.hpp
Many unit/integration tests refactored to construct problems from inline LP strings via new parse_inline_lp test helper.
CI, packaging, and manifests
.github/workflows/*, ci/*, dependencies.yaml, python/*, VERSION, manifests
Switched reusable workflow refs to rapidsai/shared-workflows@main, bumped many RAPIDS package pins to 26.8.*, updated container image tags, added RAPIDS_PACKAGE_NAME exports in CI scripts, and bumped plugin/manifest versions.
Skill docs, evals, sigs
skills/**, .claude-plugin/marketplace.json
Added/renamed many cuopt-* skill docs, benchmarks, eval datasets, skill cards, and Sigstore DSSE signature bundles; added SECURITY.md and CODEOWNERS entry for it.
Minor API/robustness fixes
cpp/src/pdlp/cuopt_c.cpp, cpp/src/utilities/manual_cuda_graph.cuh, others
Argument validation in cuOptGetErrorString, improved manual CUDA graph capture recovery, OMP include guard tweaks, and other robustness changes.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

  • NVIDIA/cuopt#1408: Overlapping doc changes for dual-values guidance (LP/QP) and related API documentation.

Suggested reviewers

  • akifcorduk
  • Iroy30
  • KyleFromNVIDIA
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
cpp/src/barrier/barrier.cu (2)

1386-1388: 💤 Low value

Verify dense-columns path synchronization is correct.

The D2H copy of inv_diag followed by synchronize() before host-side solve_adat is necessary for correctness when n_dense_columns > 0. The host solve uses the current device-computed inv_diag values.

However, consider using RAFT_CUDA_TRY wrapper for consistency with other sync points in this file.

Suggested change for consistency
       raft::copy(inv_diag.data(), d_inv_diag.data(), d_inv_diag.size(), stream_view_);
-      stream_view_.synchronize();
+      RAFT_CUDA_TRY(cudaStreamSynchronize(stream_view_));
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/src/barrier/barrier.cu` around lines 1386 - 1388, The D2H copy of
inv_diag currently uses raft::copy(...) followed by stream_view_.synchronize();
for correctness when n_dense_columns > 0—wrap the CUDA synchronization in the
RAFT_CUDA_TRY macro for consistency with other sync points (i.e., ensure the
raft::copy and the subsequent stream_view_.synchronize() call are protected by
RAFT_CUDA_TRY) so device errors are checked before proceeding to host-side work
such as host_copy(...) and the host solve (solve_adat).

2471-2472: 💤 Low value

Minor performance: prefer D2D copy from d_b_ instead of H2D from lp.rhs.

Since d_b_ is already a permanent device copy of lp.rhs (copied once at construction, line 351), using it as the source avoids an H2D transfer each iteration.

Suggested optimization
   data.d_primal_residual_.resize(lp.num_rows, stream_view_);
-  raft::copy(data.d_primal_residual_.data(), lp.rhs.data(), lp.rhs.size(), stream_view_);
+  raft::copy(data.d_primal_residual_.data(), data.d_b_.data(), data.d_b_.size(), stream_view_);
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/src/barrier/barrier.cu` around lines 2471 - 2472, Replace the
host-to-device copy from lp.rhs with a device-to-device copy from the existing
device buffer d_b_; specifically, change the raft::copy call that writes into
data.d_primal_residual_ (currently using lp.rhs.data()) to use d_b_.data() (or
the appropriate device pointer named d_b_) and keep the size and stream_view_
unchanged so the transfer is D2D instead of H2D.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@cpp/src/barrier/barrier.cu`:
- Around line 1386-1388: The D2H copy of inv_diag currently uses raft::copy(...)
followed by stream_view_.synchronize(); for correctness when n_dense_columns >
0—wrap the CUDA synchronization in the RAFT_CUDA_TRY macro for consistency with
other sync points (i.e., ensure the raft::copy and the subsequent
stream_view_.synchronize() call are protected by RAFT_CUDA_TRY) so device errors
are checked before proceeding to host-side work such as host_copy(...) and the
host solve (solve_adat).
- Around line 2471-2472: Replace the host-to-device copy from lp.rhs with a
device-to-device copy from the existing device buffer d_b_; specifically, change
the raft::copy call that writes into data.d_primal_residual_ (currently using
lp.rhs.data()) to use d_b_.data() (or the appropriate device pointer named d_b_)
and keep the size and stream_view_ unchanged so the transfer is D2D instead of
H2D.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: de9a0363-49f9-4235-9a1f-39c929d3f62a

📥 Commits

Reviewing files that changed from the base of the PR and between 3fba293 and d0957f6.

📒 Files selected for processing (2)
  • cpp/src/barrier/barrier.cu
  • cpp/src/barrier/barrier.hpp
💤 Files with no reviewable changes (1)
  • cpp/src/barrier/barrier.hpp

@chris-maes chris-maes modified the milestones: 26.06, 26.08 Jun 3, 2026
Comment thread cpp/src/barrier/barrier.cu Outdated
// Verify A*x = b
data.primal_residual = lp.rhs;
data.cusparse_view_.spmv(1.0, data.x, -1.0, data.primal_residual);
dense_vector_t<i_t, f_t> primal_residual(lp.num_rows);

@yuwenchen95 yuwenchen95 Jun 4, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: rename it to init_primal_residual

Suggested change
dense_vector_t<i_t, f_t> primal_residual(lp.num_rows);
dense_vector_t<i_t, f_t> init_primal_residual(lp.num_rows);

@yuwenchen95 yuwenchen95 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some resize calls still appears in the main while loop of barrier method. Since dimensions of these vectors are uncganed once set up, it's better to regroup all resize operations at the beginning of a barrier methods.

Comment thread cpp/src/barrier/barrier.cu Outdated
#endif

if (data.n_upper_bounds > 0) {
dense_vector_t<i_t, f_t> bound_residual(data.n_upper_bounds);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add init_ prefix like above.

Comment thread cpp/src/barrier/barrier.cu Outdated
data.z.pairwise_subtract(data.c, data.dual_residual);
if (data.Q.n > 0) { matrix_vector_multiply(data.Q, -1.0, data.x, 1.0, data.dual_residual); }
data.cusparse_view_.transpose_spmv(1.0, data.y, 1.0, data.dual_residual);
dense_vector_t<i_t, f_t> dual_residual(lp.num_cols);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
dense_vector_t<i_t, f_t> dual_residual(lp.num_cols);
dense_vector_t<i_t, f_t> init_dual_residual(lp.num_cols);


data.d_primal_residual_.resize(data.primal_residual.size(), stream_view_);
raft::copy(data.d_primal_residual_.data(), lp.rhs.data(), lp.rhs.size(), stream_view_);
data.d_primal_residual_.resize(lp.num_rows, stream_view_);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: it would look clearer if we only resize d_primal_residual_ and d_dual_residual_ at the first time we call it.

stream_view_.value());
RAFT_CHECK_CUDA(stream_view_);
if (data.Q.n > 0) {
auto descr_dual_residual = data.cusparse_view_.create_vector(data.d_dual_residual_);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

descr_dual_residual should be added into the initialization of data, instead of creating it every time.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

create_vector does not have any overhead, its just wrapper

data.d_dy_.resize(dy.size(), stream_view_);
data.d_dz_.resize(dz.size(), stream_view_);
data.d_dv_.resize(dv.size(), stream_view_);
{

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logically better to move the code block Barrier: GPU allocation and copies before the while loop of a IPM

// D2D: RHS = residuals (all on device)
data.cone_combined_step_ = false;
data.cone_sigma_mu_ = f_t(0);
raft::copy(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to resize d_bound_rhs_ and d_dw_ only once at the beginning of IPM.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resize has no overhead if the size is same as before

Comment thread cpp/src/barrier/barrier.cu Outdated
vector_norm2<i_t, f_t>(data.primal_residual),
vector_norm2<i_t, f_t>(data.dual_residual),
primal_residual_norm,
dual_residual_norm,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dual_residual_norm is not used since to_solution recomputes the dual z and then the residual. We'd better not pass it into to_solution.

@rg20

rg20 commented Jun 11, 2026

Copy link
Copy Markdown
Contributor Author

/ok to test a824930

@rg20 rg20 requested review from a team as code owners June 11, 2026 19:22
@rg20 rg20 requested a review from KyleFromNVIDIA June 11, 2026 19:22
@rg20 rg20 requested a review from Iroy30 June 11, 2026 19:22
@rg20 rg20 changed the base branch from release/26.06 to main June 11, 2026 19:23
@copy-pr-bot

copy-pr-bot Bot commented Jun 11, 2026

Copy link
Copy Markdown

/ok to test a824930

@rg20, there was an error processing your request: E2

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/2/

@rg20

rg20 commented Jun 11, 2026

Copy link
Copy Markdown
Contributor Author

/ok to test ef22921

@rg20 rg20 requested review from chris-maes and removed request for a team, Iroy30, KyleFromNVIDIA and akifcorduk June 11, 2026 19:24
@rg20

rg20 commented Jun 12, 2026

Copy link
Copy Markdown
Contributor Author

/merge

@rapids-bot rapids-bot Bot merged commit 3dba505 into NVIDIA:main Jun 12, 2026
98 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

improvement Improves an existing functionality non-breaking Introduces a non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants