Skip to content

Update code to Optimise-away small GPU allocations for projector & mu… unitaryHACK26#783

Open
thedaemon-wizard wants to merge 3 commits into
QuEST-Kit:develfrom
thedaemon-wizard:optimise-small-gpu-allocations-749
Open

Update code to Optimise-away small GPU allocations for projector & mu… unitaryHACK26#783
thedaemon-wizard wants to merge 3 commits into
QuEST-Kit:develfrom
thedaemon-wizard:optimise-small-gpu-allocations-749

Conversation

@thedaemon-wizard

Copy link
Copy Markdown

Profile and optimise-away small GPU allocations

Closes #749

Summary

The single-GPU backend copied the qubit-index list from host to device
(cudaMalloc + cudaMemcpyAsync + cudaFree) on every call of several
multi-qubit operations, via the getDevInts() helper. For small Quregs this
fixed allocation latency dominates the actual kernel runtime — exactly the
overhead issue #749 asks us to profile and remove.

This PR eliminates that per-call copy for the two operations agreed in scope,
following the register-resident-bitmask pattern already used by
thrust_statevec_calcExpecAnyTargZ_sub:

Operation Public API Subroutine Technique
Multi-qubit projector applyMultiQubitProjector thrust_{statevec,densmatr}_multiQubitProjector_sub reformulate the per-qubit test as two primitive bitmasks (no list at all)
Multi-qubit outcome probability calcProbOfMultiQubitOutcome thrust_{statevec,densmatr}_calcProbOfMultiQubitOutcome_sub carry the tiny sorted list by value inside the Thrust functor

All changes are confined to a single file: quest/src/gpu/gpu_thrust.cuh
(52 insertions, 52 deletions). No public API, dispatch, or thrust_* signatures
change — only the internals.

Design

Projectors — bitmask reformulation (removes the copy for all sizes)

A projector keeps an amplitude iff its target qubits match the requested
outcomes. The original device-array test

getValueOfBits(n, targetsPtr, numBits) == retainValue

is mathematically identical to the register-only primitive

(n & qubitMask) == valueMask

where qubitMask = util_getBitMask(qubits) flags the target positions and
valueMask = util_getBitMask(qubits, outcomes) holds the desired outcome bits.
Both masks are plain qindex scalars passed as kernel arguments, so no device
array is allocated at all
. For the density-matrix functor the same masks are
applied to both the row and column substates:

qreal fac = renorm * ((r & qubitMask) == valueMask) * ((c & qubitMask) == valueMask);

Outcome probability — pass the list by value (removes the copy for all sizes)

calcProbOfMultiQubitOutcome uses functor_insertBits, whose
insertBitsWithMaskedValues is a scatter that needs the actual sorted qubit
positions (a bitmask is insufficient). Instead of allocating a
device_vector, the functor now stores the positions in a List64 by value
— a trivially-copyable, fixed-size (int[64]), CUDA-kernel-compatible struct
that already exists in the codebase precisely for this purpose
(quest/src/core/lists.hpp). The list rides along as a kernel argument, so
cudaMalloc/cudaMemcpy disappear for every qubit count (the previous code only
avoided the copy never — it always called getDevInts).

getDevInts() itself is untouched and still used by ~20 other operations in
gpu_subroutines.cpp that are out of scope for this issue.

Results (RTX PRO 6000 Blackwell, CUDA 13.0, sm_120)

Nsight Systems — CUDA API call counts

Trace of N = 8…12, 200 reps each, both operations, captured with
Nsight Systems 2025.3.2 (nsys profile + per-call CUDA runtime API counts):

CUDA runtime call baseline optimised change
cudaMalloc 3020 1010 −66%
cudaFree 3021 1011 −66%
cudaMemcpyAsync 3015 1005 −67%
cudaLaunchKernel 2025 2025 unchanged (identical compute)

The ~2000 eliminated allocations are exactly the two per-call qubit-list copies
(one in the projector, one in the probability path). The residual ~1000
cudaMalloc in the optimised build is thrust::reduce's own internal temporary
in calcProb — inherent to Thrust and out of scope. cudaLaunchKernel is
unchanged, confirming the kernels themselves are untouched.

Wall-clock per call (microseconds), numTargs = 3, 2000 reps

N projector base projector opt speedup prob base prob opt speedup
4 12.41 6.48 1.92× 20.35 14.59 1.39×
8 11.74 6.40 1.83× 19.44 14.18 1.37×
12 11.85 6.52 1.82× 19.79 14.44 1.37×
16 12.22 6.84 1.79× 28.63 23.21 1.23×
20 58.41 13.74 4.25× 67.99 61.99 1.10×

In the small-Qureg regime the projector is consistently ~1.8× faster and the
probability calc ~1.4× faster, purely from removing the allocation. (The
large baseline jump at N≥17 is the allocation interacting with the now-larger
state kernels; removing it also removes that cliff.)

Correctness

Built with -D QUEST_ENABLE_CUDA=ON -D QUEST_BUILD_TESTS=ON -D CMAKE_CUDA_ARCHITECTURES=120 -D CMAKE_BUILD_TYPE=Release against CUDA 13.0.
The unit tests for the affected operations pass across all four deployments
(CPU, CPU+OpenMP, GPU, GPU+OpenMP):

./build/tests/tests "*QubitProjector*,*calcProbOfMultiQubitOutcome*,*calcProbOfQubitOutcome*"
# All tests passed (57146 assertions in 8 test cases)

(A correct sm_120 build is also implicitly validated, since a wrong architecture
silently corrupts GPU results.)

Re-verified end-to-end on a freshly re-cloned devel (HEAD b9830592) with the
single-file change re-applied: configure, build, and the test command above all
pass unchanged.

How to reproduce the measurements

The numbers above come from a small standalone driver (kept out of this PR to
preserve the single-file diff) that, against both a clean origin/devel build and
this branch:

  1. builds QuEST with -D QUEST_ENABLE_CUDA=ON -D CMAKE_CUDA_ARCHITECTURES=120 -D CMAKE_BUILD_TYPE=Release;
  2. for each N, allocates a Qureg, then times a loop of applyMultiQubitProjector
    and calcProbOfMultiQubitOutcome (numTargs = 3) with a CUDA-synchronised
    wall clock over many reps (per-call µs in the table);
  3. wraps a shorter run under nsys profile and tallies CUDA runtime API calls
    (cudaMalloc/cudaFree/cudaMemcpyAsync/cudaLaunchKernel) for the table above.

Happy to share the driver/scripts separately if useful for CI; they are not part of
this change.

Notes for reviewers

  • Base branch is devel (the active unitaryHACK branch and the only one that
    builds on CUDA 13 — main/v4.2 still uses thrust::binary_function, removed
    in CUDA 13's libcu++).
  • This optimises QuEST's native Thrust GPU backend (the path taken when
    cuQuantum is not enabled). cuStateVec is a separate backend, not what Profile and optimise-away small GPU allocations #749
    concerns; it was left disabled for these measurements. (For the record, as of
    2026 cuStateVec does support CUDA 13 and Blackwell, so this is a scope choice,
    not a compatibility workaround.)
  • This aligns with the in-flight "James' GPU refactor" placeholder note on
    getDevInts in gpu_thrust.cuh.

AI usage disclosure

Per unitaryHACK's AI guide ("human-in-the-loop";
honesty required for bounty eligibility): an AI coding assistant (Anthropic Claude,
via Claude Code) was used as a co-pilot for parts of this work — to help survey the
gpu_thrust.cuh code paths, brainstorm the bitmask/List64-by-value reformulation,
and draft this PR description and the profiling methodology. It
was not the author of record: every change was reviewed, compiled, and tested by
me on real hardware (RTX PRO 6000 Blackwell, CUDA 13.0, sm_120). The diff is a single
file (+52/−52), the algebraic equivalence of the bitmask reformulation was checked by
hand, and correctness was confirmed by the upstream unit tests passing across all four
deployments (CPU / CPU+OpenMP / GPU / GPU+OpenMP). No unverified or copy-pasted AI
output is included.

unitaryHACK 2026 checklist

  • PR description links the issue (Closes #749).
  • Code is compiled and tested on real hardware (not unverified AI output).
  • AI assistance disclosed (see "AI usage disclosure" above), per the AI guide.
  • Scope kept tight; ≤ 4 open PRs; GitHub activity public.

@TysonRayJones

Copy link
Copy Markdown
Member

This is a wonderful diff - I'm kicking myself for not noticing functor_projectStateVec wasn't even leveraging orderedness! 🎉

Can you please share the mentioned driver/scripts for benchmarking? Can either whack it into a comment here, or include it into the diff (which we can delete later - changes will be squashed so it won't pollute your work).

@TysonRayJones

Copy link
Copy Markdown
Member

Note to self

The template parameters of the below functions and functors are now redundant:

  • functor_projectStateVec
  • functor_projectDensMatr
  • thrust_statevec_multiQubitProjector_sub
  • thrust_densmatr_multiQubitProjector_sub
  • gpu_statevec_multiQubitProjector_sub
  • gpu_densmatr_multiQubitProjector_sub

They can all be removed, along with the parameter dispatch in accel_statevec_multiQubitProjector_sub and accel_densmatr_multiQubitProjector_sub. I can do this myself in a cleanup commit (unless @thedaemon-wizard wishes to do it!)

…review)

After the bitmask reformulation the projector no longer specialises on the target
count, so its numTargs template is dead code:

- drop the template from functor_projectStateVec/functor_projectDensMatr,
  thrust_{statevec,densmatr}_multiQubitProjector_sub and
  gpu_{statevec,densmatr}_multiQubitProjector_sub, and their
  INSTANTIATE_FUNC_OPTIMISED_FOR_NUM_TARGS instantiations;
- apply the same bitmask reformulation to the CPU projector
  (cpu_{statevec,densmatr}_multiQubitProjector_sub) so its template goes too;
- simplify accel_{statevec,densmatr}_multiQubitProjector_sub to a plain
  isGpuAccelerated ? gpu_ : cpu_ branch (no GET_CPU_OR_GPU_FUNC dispatch).

Shared dispatch macros and the calcProb* template chain are untouched. Unit tests
pass on CPU/CPU+OMP/GPU/GPU+OMP (57146 assertions). Adds a throw-away benchmarks/
driver (not wired into CMake/CI); safe to squash/drop on merge.
@thedaemon-wizard

thedaemon-wizard commented Jun 9, 2026

Copy link
Copy Markdown
Author

This is a wonderful diff - I'm kicking myself for not noticing functor_projectStateVec wasn't even leveraging orderedness! 🎉

Can you please share the mentioned driver/scripts for benchmarking? Can either whack it into a comment here, or include it into the diff (which we can delete later - changes will be squashed so it won't pollute your work).

Thanks @TysonRayJones! Glad it's useful. 🎉

I've added the driver to the PR under benchmarks/benchmark_749.cpp (with a
short benchmarks/README.md). It's deliberately not wired into CMake/CI — happy
for it to be deleted in the squash, as you suggested. It builds straight against
QuEST via the built-in USER_SOURCE_NAMES mechanism:

cmake -S . -B build_bench \
    -D QUEST_ENABLE_CUDA=ON -D CMAKE_CUDA_ARCHITECTURES=120 \
    -D CMAKE_BUILD_TYPE=Release \
    -D USER_SOURCE_NAMES=benchmarks/benchmark_749.cpp \
    -D USER_OUTPUT_EXE_NAME=bench_749
cmake --build build_bench --target bench_749 -j
./build_bench/bench_749 4 20 3 2000        # [minQ maxQ numTargs reps]

It forces the single-GPU path (useGpuAccel=1, distribution/threads off) and
syncQuESTEnv()s around each timed region so it measures completed GPU work. Build
it once against clean origin/devel and once against this branch for before/after.

On my machine (RTX PRO 6000 Blackwell, CUDA 13.0, sm_120):

CUDA runtime API counts (N = 8…12, 200 reps, both ops, via nsys):

CUDA runtime call baseline optimised change
cudaMalloc 3020 1010 −66%
cudaFree 3021 1011 −66%
cudaMemcpyAsync 3015 1005 −67%
cudaLaunchKernel 2025 2025 unchanged

Per-call wall time (µs, numTargs = 3, 2000 reps):

N projector base projector opt speedup prob base prob opt speedup
4 12.41 6.48 1.92× 20.35 14.59 1.39×
8 11.74 6.40 1.83× 19.44 14.18 1.37×
12 11.85 6.52 1.82× 19.79 14.44 1.37×
16 12.22 6.84 1.79× 28.63 23.21 1.23×
20 58.41 13.74 4.25× 67.99 61.99 1.10×

(The residual ~1000 cudaMalloc in the optimised build is thrust::reduce's own
internal temporary inside calcProb — inherent to Thrust, out of scope here.)

@thedaemon-wizard

thedaemon-wizard commented Jun 9, 2026

Copy link
Copy Markdown
Author

Note to self

The template parameters of the below functions and functors are now redundant:

* `functor_projectStateVec`

* `functor_projectDensMatr`

* `thrust_statevec_multiQubitProjector_sub`

* `thrust_densmatr_multiQubitProjector_sub`

* `gpu_statevec_multiQubitProjector_sub`

* `gpu_densmatr_multiQubitProjector_sub`

They can all be removed, along with the parameter dispatch in accel_statevec_multiQubitProjector_sub and accel_densmatr_multiQubitProjector_sub. I can do this myself in a cleanup commit (unless @thedaemon-wizard wishes to do it!)

Done — I went ahead and removed them (pushed in a follow-up commit). Summary:

  • Dropped the now-dead template parameter from functor_projectStateVec,
    functor_projectDensMatr, thrust_{statevec,densmatr}_multiQubitProjector_sub,
    and gpu_{statevec,densmatr}_multiQubitProjector_sub, and removed their
    INSTANTIATE_FUNC_OPTIMISED_FOR_NUM_TARGS instantiations.
  • Simplified accel_{statevec,densmatr}_multiQubitProjector_sub to a plain
    qureg.isGpuAccelerated ? gpu_… : cpu_… branch (matching the style already used
    elsewhere in accelerator.cpp), so the GET_CPU_OR_GPU_FUNC_OPTIMISED_FOR_ONE_PARAM
    dispatch is gone for the projector. I left the shared dispatch macros untouched
    since packAmpsIntoBuffer, partialTrace_sub and the calcProb* family still
    rely on them.

One thing to confirm: your note listed the GPU-side functions, but the
GET_CPU_OR_GPU_FUNC_… dispatch also fans out to the CPU projector, which was
still using its template param (SET_VAR_AT_COMPILE_TIME to unroll
getValueOfBits). To remove the dispatch cleanly I applied the same bitmask
reformulation to the CPU projector
too:
getValueOfBits(n, qubits) == retainValue(n & qubitMask) == valueMask
(and the density-matrix (v1==v2) && (retainValue==v1) becomes
(r & qubitMask)==valueMask && (c & qubitMask)==valueMask), with
qubitMask = util_getBitMask(qubits) and valueMask = util_getBitMask(qubits, outcomes).
That makes the CPU template redundant as well, so it could be removed symmetrically.
If you'd rather keep the CPU path templated/unrolled, say the word and I'll instead
branch only the GPU side in accel_* and leave the CPU dispatch as-is.

A couple of notes for the record:

  • The CPU reformulation drops the per-amp inner loop over targets in favour of two
    mask compares, so it shouldn't regress (and removes the getValueOfBits unroll
    entirely); happy to micro-benchmark the CPU side if useful.
  • Removing the seven <0>…<5>,<-1> instantiations per projector function also
    trims a little compile time / object-code, with no runtime cost since the
    projector no longer benefits from compile-time numTargs unrolling.

Verification (RTX PRO 6000 Blackwell, CUDA 13.0, sm_120, Release): rebuilt
clean, and the affected unit tests pass across all four deployments —
tests "*QubitProjector*,*calcProbOfMultiQubitOutcome*,*calcProbOfQubitOutcome*"
All tests passed (57146 assertions in 8 test cases) (CPU / CPU+OpenMP / GPU /
GPU+OpenMP).

Also confirming: I'm fine with the benchmarks/ driver being deleted in the squash —
just let me know if you'd prefer I drop it from the branch now instead.

@thedaemon-wizard thedaemon-wizard changed the title Update code to Optimise-away small GPU allocations for projector & mu… Update code to Optimise-away small GPU allocations for projector & mu… unitaryHACK26 Jun 11, 2026
@TysonRayJones

TysonRayJones commented Jun 15, 2026

Copy link
Copy Markdown
Member

Hi there Amon,

It may take us a little while to integrate this - we have an idea how to further optimise a final remaining cudaAsyncMcpy beign invoked by functor_insertBits within thrust_(statevec|densmatr)_calcProbOfMultiQubitOutcome_sub (potentially using instrinsics, like in #717)! Or we may end up merging this as-is, and optimising later.

In any case, @JPRichings has profiled your solution and confirmed that you have satisfied the unitaryHACK challenge 🎉 🎉 Please comment on the issue (#749) so that we can assign it to you, awarding you the challenge. Nice work!

…-Kit#717)

Open the one-bit gap with a single shared low mask + a shift-by-one of the high
bits, instead of the equivalent right/left-shift + nested concatenateBits.
Algebraically identical (cannot change results); shaves a couple of ops off every
unrolled insertBits iteration, benefiting all callers. Op-count/readability
cleanup, not a measured speed win. Verified against the full unit-test suite
(2,293,050 assertions) across CPU/CPU+OMP/GPU/GPU+OMP; the one unrelated failure
(setQuESTNumGpuThreadsPerBlock) is pre-existing on devel.
@thedaemon-wizard

Copy link
Copy Markdown
Author

Thanks so much @TysonRayJones, and thanks @JPRichings for profiling it — delighted it satisfies the challenge! 🎉

On the further optimisation: I dug into it and wanted to share what I found, since there are actually two separate things bundled in there.

  • The remaining cudaMemcpyAsync in thrust_*_calcProbOfMultiQubitOutcome_sub is the thrust::reduce result transfer — it copies the single reduced qreal back to the host, which is unavoidable while the function returns the probability to the caller. (The List64-by-value change already removed the avoidable per-call copy, the host→device qubit list.) So I don't think that particular memcpy can be dropped without changing the API contract.
  • The Optimise getValueOfBits and insertBits #717 angle (optimising getValueOfBits/insertBits) is really a compute win: the per-element bit-deposit functor_insertBits runs via insertBitsWithMaskedValues → insertBits → insertBit. There's no native GPU PDEP/PEXT, so a single-instruction deposit isn't available, but I took a low-risk first pass: I rewrote insertBit to open the gap with one shared low mask + a shift-by-one of the high bits, instead of the equivalent right/left-shift + nested concatenateBits. It's algebraically identical (so it can't change results) and shaves a couple of ops off every unrolled iteration, benefiting all insertBits callers.
  • One thing that is squarely in Profile and optimise-away small GPU allocations #749's "small allocations" spirit: the residual per-call cudaMalloc I mentioned earlier is thrust::reduce's internal temp-storage allocation. That one could be pooled away with a Thrust caching allocator / thrust::cuda::par(pool), a reused CUB DeviceReduce temp buffer, or the stream-ordered allocator (cudaMallocAsync + a mem pool) — which would remove the temp alloc across every thrust::reduce/sort site, not just calcProb. That's a broader, library-wide change so I didn't want to sneak it into this PR, but I'm very happy to take it on as a separate follow-up if you think it's worthwhile.

I verified the insertBit tidy against the entire unit-test suite across all four deployments (CPU / CPU+OpenMP / GPU / GPU+OpenMP): 2,293,050 assertions pass, every projector / calcProb / controlled-gate / partial-trace case green. In full transparency, it's an op-count/readability cleanup rather than a speed win — wall-clock is unchanged within measurement noise at the scales I benchmarked (the deposit isn't the bottleneck for these ops), so please treat it purely as a tidy-up in the spirit of #717. I've kept it on its own commit (9b87ec3e) so it's trivial to drop or fold into your planned cleanup — merge as-is, squash, or cherry-pick, whatever's easiest for your integration. The larger magic-mask/butterfly deposit (which would mainly help the generic large-numTargs path and needs per-call mask precomputation) I'm happy to leave to you, or to attempt in a follow-up if you'd like.

One small unrelated heads-up from running the whole suite on this box (RTX PRO 6000 Blackwell, CUDA 13): the setQuESTNumGpuThreadsPerBlock "Exceeds device maximum" test (tests/unit/experimental.cpp) fails on devel independently of this PR — it passes 999999, which trips the warp-divisibility check (999999 % 32 ≠ 0) before the max-value check, so the thrown message doesn't match the expected "Exceeds the hardware-imposed maximum". A multiple-of-32 value above the device max would exercise the intended path. Happy to open a tiny separate PR for that if useful.

@JPRichings

Copy link
Copy Markdown
Contributor

Thanks, for this additional work. we have also spotted the setQuESTNumGpuThreadsPerBlock issue happy for you to open a separate pull request but I have fixed it in a separate PR that we are looking to pull in the devel in the next few days.

@thedaemon-wizard

thedaemon-wizard commented Jun 16, 2026

Copy link
Copy Markdown
Author

Thanks @JPRichings! Great to hear it's already fixed — I'll skip opening a separate PR for setQuESTNumGpuThreadsPerBlock then, to avoid duplicating your change, and just exclude that pre-existing failure from my own runs. Looking forward to it landing in devel.

Thanks again to you both for the thorough review and for profiling the solution.

@TysonRayJones

Copy link
Copy Markdown
Member

Oh good call about the remaining memcpy. There is a precedent for having a persistent user-clearable GPU cache we can potentially re-use. Let's think through that in a separate PR however.

Alas I'm not convinced your insertBit change actually does anything except reduce readability. Inlining and compiler optimisation (removing the two superfluous shifts) makes the new and old code identical in their bit operations. Please revert that diff, and we'll proceed to cleaning this up when we get a sec! 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants