CuD-PDLP#1391
Conversation
…he cycle seems to be fixed, cuopt compiles
+ style too
compiles and runs
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (10)
💤 Files with no reviewable changes (5)
🚧 Files skipped from review as they are similar to previous changes (4)
📝 WalkthroughWalkthroughAdds end-to-end distributed multi-GPU PDLP: CMake/third-party wiring, partitioner contracts and METIS/KaMinPar backends, partition file I/O and rank data, per-GPU shard types, a multi-GPU engine (halo/exchange/allreduce/distributed SpMV/scaling), PDHG/PDLP solver multi-GPU wiring and constructors, distributed scaling/refactoring, convergence/restart adaptations, and tests. ChangesDistributed Multi-GPU PDLP
🎯 4 (Complex) | ⏱️ ~75 minutes Suggested labels: Suggested reviewers:
✨ Finishing Touches🧪 Generate unit tests (beta)
|
There was a problem hiding this comment.
Actionable comments posted: 11
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
cpp/src/pdlp/solve.cu (1)
769-784:⚠️ Potential issue | 🟠 Major | ⚡ Quick winReject distributed
problem_tcalls before any early return.The new guard sits below the zero-constraint return and the FP32 fallback. With
use_distributed_pdlp=trueplusSinglePrecision, this path returnsrun_pdlp_solver_in_fp32(...)instead of raising the intended validation error, so an unsupported distributed configuration silently runs the single-GPU solver.Suggested fix
static optimization_problem_solution_t<i_t, f_t> run_pdlp_solver( detail::problem_t<i_t, f_t>& problem, pdlp_solver_settings_t<i_t, f_t> const& settings, const timer_t& timer, bool is_batch_mode) { + cuopt_expects(!settings.hyper_params.use_distributed_pdlp, + error_type_t::ValidationError, + "Distributed PDLP must be entered via solve_lp(mps_data_model, ...) " + "so the master GPU never materializes the full problem. Call sites " + "with a problem_t cannot dispatch to distributed mode."); + detail::pdlp_graph_disabled_flag().store(settings.hyper_params.pdlp_disable_graph, std::memory_order_relaxed); if (problem.n_constraints == 0) { ... } `#if` PDLP_INSTANTIATE_FLOAT || CUOPT_INSTANTIATE_FLOAT if constexpr (std::is_same_v<f_t, double>) { if (settings.pdlp_precision == pdlp_precision_t::SinglePrecision) { return run_pdlp_solver_in_fp32(problem, settings, timer, is_batch_mode); } } `#endif` - cuopt_expects(!settings.hyper_params.use_distributed_pdlp, ...);🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cpp/src/pdlp/solve.cu` around lines 769 - 784, The distributed-mode validation (cuopt_expects(!settings.hyper_params.use_distributed_pdlp, ...)) must be performed before any early returns so a distributed call cannot accidentally take the FP32 fallback or zero-constraint path; move or duplicate that check to occur before the SinglePrecision/FP32 branch and before the zero-constraint return so that when settings.hyper_params.use_distributed_pdlp is true (for problem_t inputs) the function immediately raises the ValidationError rather than calling run_pdlp_solver_in_fp32 or returning early. Ensure the check references the same validation message and error_type_t::ValidationError used currently.cpp/src/pdlp/pdlp.cu (1)
3063-3079:⚠️ Potential issue | 🔴 Critical | ⚡ Quick winThe distributed average path is still unsafe in release builds.
When
multi_gpu_engineis present andnever_restart_to_averageis false, Line 3071 uses plainassert(false). In release builds that disappears, and the subsequentraft::copywritesprimal_size_h_/dual_size_h_elements intounscaled_*_avg_solution_, which were never resized for the distributed ctor. That turns this TODO into an invalid device-copy / wrong-result path instead of a clean runtime rejection.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cpp/src/pdlp/pdlp.cu` around lines 3063 - 3079, The path that handles multi-GPU (multi_gpu_engine) uses assert(false) which vanishes in release builds and leads to invalid device copies into unscaled_primal_avg_solution_ / unscaled_dual_avg_solution_; fix by replacing the assert with a deterministic runtime guard: either resize/allocate unscaled_primal_avg_solution_ and unscaled_dual_avg_solution_ to primal_size_h_ and dual_size_h_ (and synchronize/validate device pointers) before calling raft::copy from pdhg_solver_.get_primal_solution() / get_dual_solution(), or explicitly fail early by logging and throwing a runtime_error when multi_gpu_engine is true so the copy is never attempted; update the branch around internal_solver_iterations_ <= 1 where multi_gpu_engine is checked to implement one of these safe behaviors.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@cpp/cuopt_cli.cpp`:
- Around line 180-184: When lp_settings.hyper_params.use_distributed_pdlp is
true, guard the distributed PDLP call by checking that handle_ptr is non-null
before invoking cuopt::linear_programming::solve_lp(handle_ptr.get(), ...); if
handle_ptr is null, fail fast with a clear error (e.g., log and exit or throw)
rather than calling the distributed overload; update the branch that currently
chooses between solve_lp(handle_ptr.get(), mps_data_model, lp_settings) and
solve_lp(problem_interface.get(), lp_settings) to validate handle_ptr first and
only call the distributed overload when handle_ptr is valid.
- Around line 439-447: The code currently computes requested_gpus and then uses
std::min(...) to compute provisioned_gpus and
memory_resources.reserve(provisioned_gpus) without validating requested_gpus;
add explicit validation after computing requested_gpus (and after remapping -1
when use_distributed_pdlp is true) to ensure requested_gpus > 0 and that
raft::device_setter::get_device_count() > 0 before calling std::min or reserve.
If either value is non-positive, return/log an error or throw an exception
(consistent with surrounding error handling) referencing the parameters obtained
via settings.get_parameter<int>(CUOPT_NUM_GPUS) and
settings.get_parameter<int>(CUOPT_DISTRIBUTED_PDLP_NUM_GPUS) so the code never
calls memory_resources.reserve with a non-positive size.
In `@cpp/src/pdlp/cusparse_view.cu`:
- Around line 501-511: The mixed-precision branch still sizes and recreates FP32
matrices using op_problem_scaled.nnz which can differ per shard; update that
block to use the shard-local nnz values (e.g. static_cast<int64_t>(A_.size())
and static_cast<int64_t>(A_T_.size())) when allocating/sizing A_mixed_ and
A_T_mixed_ and when copying/transposing data for A_T.create / A.create so you
don't overrun A_T_ or leave stale nnz metadata; ensure any metadata fields set
during the FP32 recreate follow the shard-local sizes and that all
transforms/read ranges use those local sizes (A_, A_T_, A_mixed_, A_T_mixed_).
In `@cpp/src/pdlp/distributed_pdlp/partition_loader.cu`:
- Around line 77-87: Validate partition and CSR metadata before any
slicing/indexing: check that parts.size() >= nb_cstr + nb_vars before creating
cstr_parts/var_parts, ensure all entries in parts are within [0, nb_parts)
before using them to index rank_data_t<i_t,f_t>, and verify CSR arrays
(offsets/indices) have expected lengths (e.g., offsets.size() >= rows+1 and
indices.size() == nnz) before dereferencing in functions that build/iterate the
CSR (referencing variables parts, nb_cstr, nb_vars, rank_data_t, and the CSR
offset/index containers); use cuopt_expects (or the existing error path) to fail
early with clear messages when any check fails.
In `@cpp/src/pdlp/pdlp.cu`:
- Around line 821-825: The distributed gather of the current iterate is missing
on several return paths so master buffers can be stale; call the multi-GPU
gather before any return that serializes the current solution. Specifically,
ensure pdhg_solver_.get_mgpu_engine() and its method
gather_potential_next_solutions_to_master(pdhg_solver_,
current_termination_strategy_.get_convergence_information().get_reduced_cost())
is invoked centrally before any code that calls
fill_return_problem_solution(...), and add the same centralized gather call on
the other identified return sites (including the ConcurrentLimit and
PrimalFeasible/infeasibility exits referenced around lines ~859-863 and
~1541-1545) so the master full-size solution/reduced-cost buffers are populated
on every distributed return path.
- Around line 387-393: The distributed constructor pdlp_solver_t( problem_t<i_t,
f_t>& placeholder_problem, ... ) currently delegates to the regular ctor before
shard sizes exist, causing functions that use primal_size_h_/dual_size_h_ (e.g.,
set_initial_primal_solution, handling of initial_primal_solution and
initial_dual_solution and warm-start data) to operate on zero-length buffers;
update this constructor to either (a) validate and reject any initial-state
options (initial_primal_solution, initial_dual_solution, warm-start) up front
and return an error, or (b) defer all logic that applies initial iterates (calls
to set_initial_primal_solution / set_initial_dual_solution and warm-start
handling) until after shard construction when primal_size_h_ and dual_size_h_
are set, ensuring no modulo/divide-by-zero or zero-length copies occur.
In `@cpp/src/pdlp/solve.cu`:
- Around line 759-760: The global flag detail::pdlp_graph_disabled_flag() is
being mutated per-solve causing races; instead make the graph-disable decision
local to each solver instance and avoid writing the process-global flag from
solve entrypoints. Change callers that currently
store(settings.hyper_params.pdlp_disable_graph, ...) to pass the
pdlp_disable_graph boolean into the solver instance (or ctor) and have
ping_pong_graph_t::run() and related graph code read that instance-level flag
rather than detail::pdlp_graph_disabled_flag(); remove writes to the global flag
in solve functions so concurrent solves do not flip each other’s mode.
- Around line 2129-2134: The current overload erroneously hard-fails via
cuopt_expects when settings.hyper_params.use_distributed_pdlp is false and
always forwards to solve_lp_distributed_from_mps, removing the original
single-GPU/direct-MPS path; restore the prior behavior by replacing the
hard-fail with a branch: if settings.hyper_params.use_distributed_pdlp is true
call solve_lp_distributed_from_mps(handle_ptr, mps_data_model, settings,
problem_checking, use_pdlp_solver_mode) else call the non-distributed/MPS
entrypoint (the original direct-MPS function used previously—e.g.,
solve_lp_from_mps or the equivalent direct-MPS routine) so both paths are
supported, and keep or adjust cuopt_expects to validate only unsupported
parameter combinations if needed.
- Around line 2157-2205: solve_lp_distributed_from_mps builds
detail::pdlp_solver_t using settings_resolved but never applies settings.method
or calls set_pdlp_solver_mode, so requested PDLP modes/presets are ignored; fix
by checking settings_resolved.use_pdlp_solver_mode (and/or
settings_resolved.method) before constructing the solver and call
set_pdlp_solver_mode(settings_resolved) to map the preset/method into the solver
settings (or apply the mapping to settings_resolved) so the subsequent
detail::pdlp_solver_t(placeholder_problem, mps_data_model, settings_resolved) is
constructed with the intended PDLP mode.
In `@cpp/tests/linear_programming/pdlp_test.cu`:
- Around line 188-191: The test currently sets distributed_pdlp_num_gpus = -1
which lets a single-GPU run bypass the multi-GPU/NCCL path; change the test to
first query the available GPU count and if fewer than 2 GPUs are present skip
the test, otherwise set pdlp_solver_settings_t::distributed_pdlp_num_gpus to at
least 2 (e.g., max(2, available_gpus)) before calling solve_lp(&handle, problem,
dist_settings) so the distributed PDLP path is actually exercised (use
pdlp_solver_settings_t, dist_settings, distributed_pdlp_num_gpus and solve_lp as
the loci to modify).
- Around line 248-252: The test pdlp_class::distributed_parity_square41 is
loading the wrong dataset; change the argument to
expect_distributed_matches_base in that test so it points to
"linear_programming/square41/square41.mps" instead of
"linear_programming/neos3/neos3.mps" so the regression covers the intended
square41 case (update the call site in the distributed_parity_square41 test that
invokes expect_distributed_matches_base).
---
Outside diff comments:
In `@cpp/src/pdlp/pdlp.cu`:
- Around line 3063-3079: The path that handles multi-GPU (multi_gpu_engine) uses
assert(false) which vanishes in release builds and leads to invalid device
copies into unscaled_primal_avg_solution_ / unscaled_dual_avg_solution_; fix by
replacing the assert with a deterministic runtime guard: either resize/allocate
unscaled_primal_avg_solution_ and unscaled_dual_avg_solution_ to primal_size_h_
and dual_size_h_ (and synchronize/validate device pointers) before calling
raft::copy from pdhg_solver_.get_primal_solution() / get_dual_solution(), or
explicitly fail early by logging and throwing a runtime_error when
multi_gpu_engine is true so the copy is never attempted; update the branch
around internal_solver_iterations_ <= 1 where multi_gpu_engine is checked to
implement one of these safe behaviors.
In `@cpp/src/pdlp/solve.cu`:
- Around line 769-784: The distributed-mode validation
(cuopt_expects(!settings.hyper_params.use_distributed_pdlp, ...)) must be
performed before any early returns so a distributed call cannot accidentally
take the FP32 fallback or zero-constraint path; move or duplicate that check to
occur before the SinglePrecision/FP32 branch and before the zero-constraint
return so that when settings.hyper_params.use_distributed_pdlp is true (for
problem_t inputs) the function immediately raises the ValidationError rather
than calling run_pdlp_solver_in_fp32 or returning early. Ensure the check
references the same validation message and error_type_t::ValidationError used
currently.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 7df2a4b9-585b-4517-afcb-1aa089ecb1c1
📒 Files selected for processing (41)
cpp/CMakeLists.txtcpp/cmake/thirdparty/get_kaminpar.cmakecpp/cuopt_cli.cppcpp/include/cuopt/linear_programming/constants.hcpp/include/cuopt/linear_programming/pdlp/pdlp_hyper_params.cuhcpp/include/cuopt/linear_programming/pdlp/solver_settings.hppcpp/src/math_optimization/solver_settings.cucpp/src/pdlp/CMakeLists.txtcpp/src/pdlp/cusparse_view.cucpp/src/pdlp/distributed_pdlp/kaminpar_partitioner.cppcpp/src/pdlp/distributed_pdlp/kaminpar_partitioner.hppcpp/src/pdlp/distributed_pdlp/metis_partitioner.cucpp/src/pdlp/distributed_pdlp/metis_partitioner.hppcpp/src/pdlp/distributed_pdlp/multi_gpu_engine.cucpp/src/pdlp/distributed_pdlp/multi_gpu_engine.hppcpp/src/pdlp/distributed_pdlp/partition_loader.cucpp/src/pdlp/distributed_pdlp/partition_loader.hppcpp/src/pdlp/distributed_pdlp/partitioner.cucpp/src/pdlp/distributed_pdlp/partitioner.hppcpp/src/pdlp/distributed_pdlp/rank_data.hppcpp/src/pdlp/distributed_pdlp/shard.cucpp/src/pdlp/distributed_pdlp/shard.hppcpp/src/pdlp/initial_scaling_strategy/initial_scaling.cucpp/src/pdlp/initial_scaling_strategy/initial_scaling.cuhcpp/src/pdlp/pdhg.cucpp/src/pdlp/pdhg.hppcpp/src/pdlp/pdlp.cucpp/src/pdlp/pdlp.cuhcpp/src/pdlp/restart_strategy/pdlp_restart_strategy.cucpp/src/pdlp/saddle_point.cucpp/src/pdlp/solve.cucpp/src/pdlp/solve.cuhcpp/src/pdlp/step_size_strategy/adaptive_step_size_strategy.cucpp/src/pdlp/step_size_strategy/adaptive_step_size_strategy.hppcpp/src/pdlp/termination_strategy/convergence_information.cucpp/src/pdlp/termination_strategy/convergence_information.hppcpp/src/pdlp/termination_strategy/termination_strategy.cucpp/src/pdlp/termination_strategy/termination_strategy.hppcpp/src/pdlp/utilities/mgpu_trace.cuhcpp/src/pdlp/utilities/ping_pong_graph.cuhcpp/tests/linear_programming/pdlp_test.cu
|
/ok to test 818ffcd |
|
@coderabbitai review |
✅ Action performedReview finished.
|
|
/ok to test f3b6343 |
…et a 6x on construction time !! #devtechouquoilateam
|
/ok to test f3b6343 |
@Bubullzz, there was an error processing your request: See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/2/ |
Not review ready
Not merge ready
Just to let team have a look at it but definitely needs a big clean up
closes #891