Update code to Optimise-away small GPU allocations for projector & mu… unitaryHACK26 by thedaemon-wizard · Pull Request #783 · QuEST-Kit/QuEST

thedaemon-wizard · 2026-06-07T14:35:33Z

Profile and optimise-away small GPU allocations

Closes #749

Summary

The single-GPU backend copied the qubit-index list from host to device
(cudaMalloc + cudaMemcpyAsync + cudaFree) on every call of several
multi-qubit operations, via the getDevInts() helper. For small Quregs this
fixed allocation latency dominates the actual kernel runtime — exactly the
overhead issue #749 asks us to profile and remove.

This PR eliminates that per-call copy for the two operations agreed in scope,
following the register-resident-bitmask pattern already used by
thrust_statevec_calcExpecAnyTargZ_sub:

Operation	Public API	Subroutine	Technique
Multi-qubit projector	`applyMultiQubitProjector`	`thrust_{statevec,densmatr}_multiQubitProjector_sub`	reformulate the per-qubit test as two primitive bitmasks (no list at all)
Multi-qubit outcome probability	`calcProbOfMultiQubitOutcome`	`thrust_{statevec,densmatr}_calcProbOfMultiQubitOutcome_sub`	carry the tiny sorted list by value inside the Thrust functor

All changes are confined to a single file: quest/src/gpu/gpu_thrust.cuh
(52 insertions, 52 deletions). No public API, dispatch, or thrust_* signatures
change — only the internals.

Design

Projectors — bitmask reformulation (removes the copy for all sizes)

A projector keeps an amplitude iff its target qubits match the requested
outcomes. The original device-array test

getValueOfBits(n, targetsPtr, numBits) == retainValue

is mathematically identical to the register-only primitive

(n & qubitMask) == valueMask

where qubitMask = util_getBitMask(qubits) flags the target positions and
valueMask = util_getBitMask(qubits, outcomes) holds the desired outcome bits.
Both masks are plain qindex scalars passed as kernel arguments, so no device
array is allocated at all. For the density-matrix functor the same masks are
applied to both the row and column substates:

qreal fac = renorm * ((r & qubitMask) == valueMask) * ((c & qubitMask) == valueMask);

Outcome probability — pass the list by value (removes the copy for all sizes)

calcProbOfMultiQubitOutcome uses functor_insertBits, whose
insertBitsWithMaskedValues is a scatter that needs the actual sorted qubit
positions (a bitmask is insufficient). Instead of allocating a
device_vector, the functor now stores the positions in a List64 by value
— a trivially-copyable, fixed-size (int[64]), CUDA-kernel-compatible struct
that already exists in the codebase precisely for this purpose
(quest/src/core/lists.hpp). The list rides along as a kernel argument, so
cudaMalloc/cudaMemcpy disappear for every qubit count (the previous code only
avoided the copy never — it always called getDevInts).

getDevInts() itself is untouched and still used by ~20 other operations in
gpu_subroutines.cpp that are out of scope for this issue.

Results (RTX PRO 6000 Blackwell, CUDA 13.0, sm_120)

Nsight Systems — CUDA API call counts

Trace of N = 8…12, 200 reps each, both operations, captured with
Nsight Systems 2025.3.2 (nsys profile + per-call CUDA runtime API counts):

CUDA runtime call	baseline	optimised	change
`cudaMalloc`	3020	1010	−66%
`cudaFree`	3021	1011	−66%
`cudaMemcpyAsync`	3015	1005	−67%
`cudaLaunchKernel`	2025	2025	unchanged (identical compute)

The ~2000 eliminated allocations are exactly the two per-call qubit-list copies
(one in the projector, one in the probability path). The residual ~1000
cudaMalloc in the optimised build is thrust::reduce's own internal temporary
in calcProb — inherent to Thrust and out of scope. cudaLaunchKernel is
unchanged, confirming the kernels themselves are untouched.

Wall-clock per call (microseconds), `numTargs = 3`, 2000 reps

N	projector base	projector opt	speedup	prob base	prob opt	speedup
4	12.41	6.48	1.92×	20.35	14.59	1.39×
8	11.74	6.40	1.83×	19.44	14.18	1.37×
12	11.85	6.52	1.82×	19.79	14.44	1.37×
16	12.22	6.84	1.79×	28.63	23.21	1.23×
20	58.41	13.74	4.25×	67.99	61.99	1.10×

In the small-Qureg regime the projector is consistently ~1.8× faster and the
probability calc ~1.4× faster, purely from removing the allocation. (The
large baseline jump at N≥17 is the allocation interacting with the now-larger
state kernels; removing it also removes that cliff.)

Correctness

Built with -D QUEST_ENABLE_CUDA=ON -D QUEST_BUILD_TESTS=ON -D CMAKE_CUDA_ARCHITECTURES=120 -D CMAKE_BUILD_TYPE=Release against CUDA 13.0.
The unit tests for the affected operations pass across all four deployments
(CPU, CPU+OpenMP, GPU, GPU+OpenMP):

./build/tests/tests "*QubitProjector*,*calcProbOfMultiQubitOutcome*,*calcProbOfQubitOutcome*"
# All tests passed (57146 assertions in 8 test cases)

(A correct sm_120 build is also implicitly validated, since a wrong architecture
silently corrupts GPU results.)

Re-verified end-to-end on a freshly re-cloned devel (HEAD b9830592) with the
single-file change re-applied: configure, build, and the test command above all
pass unchanged.

How to reproduce the measurements

The numbers above come from a small standalone driver (kept out of this PR to
preserve the single-file diff) that, against both a clean origin/devel build and
this branch:

builds QuEST with -D QUEST_ENABLE_CUDA=ON -D CMAKE_CUDA_ARCHITECTURES=120 -D CMAKE_BUILD_TYPE=Release;
for each N, allocates a Qureg, then times a loop of applyMultiQubitProjector
and calcProbOfMultiQubitOutcome (numTargs = 3) with a CUDA-synchronised
wall clock over many reps (per-call µs in the table);
wraps a shorter run under nsys profile and tallies CUDA runtime API calls
(cudaMalloc/cudaFree/cudaMemcpyAsync/cudaLaunchKernel) for the table above.

Happy to share the driver/scripts separately if useful for CI; they are not part of
this change.

Notes for reviewers

Base branch is devel (the active unitaryHACK branch and the only one that
builds on CUDA 13 — main/v4.2 still uses thrust::binary_function, removed
in CUDA 13's libcu++).
This optimises QuEST's native Thrust GPU backend (the path taken when
cuQuantum is not enabled). cuStateVec is a separate backend, not what Profile and optimise-away small GPU allocations #749
concerns; it was left disabled for these measurements. (For the record, as of
2026 cuStateVec does support CUDA 13 and Blackwell, so this is a scope choice,
not a compatibility workaround.)
This aligns with the in-flight "James' GPU refactor" placeholder note on
getDevInts in gpu_thrust.cuh.

AI usage disclosure

Per unitaryHACK's AI guide ("human-in-the-loop";
honesty required for bounty eligibility): an AI coding assistant (Anthropic Claude,
via Claude Code) was used as a co-pilot for parts of this work — to help survey the
gpu_thrust.cuh code paths, brainstorm the bitmask/List64-by-value reformulation,
and draft this PR description and the profiling methodology. It
was not the author of record: every change was reviewed, compiled, and tested by
me on real hardware (RTX PRO 6000 Blackwell, CUDA 13.0, sm_120). The diff is a single
file (+52/−52), the algebraic equivalence of the bitmask reformulation was checked by
hand, and correctness was confirmed by the upstream unit tests passing across all four
deployments (CPU / CPU+OpenMP / GPU / GPU+OpenMP). No unverified or copy-pasted AI
output is included.

unitaryHACK 2026 checklist

PR description links the issue (Closes #749).
Code is compiled and tested on real hardware (not unverified AI output).
AI assistance disclosed (see "AI usage disclosure" above), per the AI guide.
Scope kept tight; ≤ 4 open PRs; GitHub activity public.

…lti-qubit prob

TysonRayJones · 2026-06-09T04:03:41Z

This is a wonderful diff - I'm kicking myself for not noticing functor_projectStateVec wasn't even leveraging orderedness! 🎉

Can you please share the mentioned driver/scripts for benchmarking? Can either whack it into a comment here, or include it into the diff (which we can delete later - changes will be squashed so it won't pollute your work).

TysonRayJones · 2026-06-09T04:10:19Z

Note to self

The template parameters of the below functions and functors are now redundant:

functor_projectStateVec
functor_projectDensMatr
thrust_statevec_multiQubitProjector_sub
thrust_densmatr_multiQubitProjector_sub
gpu_statevec_multiQubitProjector_sub
gpu_densmatr_multiQubitProjector_sub

They can all be removed, along with the parameter dispatch in accel_statevec_multiQubitProjector_sub and accel_densmatr_multiQubitProjector_sub. I can do this myself in a cleanup commit (unless @thedaemon-wizard wishes to do it!)

…review) After the bitmask reformulation the projector no longer specialises on the target count, so its numTargs template is dead code: - drop the template from functor_projectStateVec/functor_projectDensMatr, thrust_{statevec,densmatr}_multiQubitProjector_sub and gpu_{statevec,densmatr}_multiQubitProjector_sub, and their INSTANTIATE_FUNC_OPTIMISED_FOR_NUM_TARGS instantiations; - apply the same bitmask reformulation to the CPU projector (cpu_{statevec,densmatr}_multiQubitProjector_sub) so its template goes too; - simplify accel_{statevec,densmatr}_multiQubitProjector_sub to a plain isGpuAccelerated ? gpu_ : cpu_ branch (no GET_CPU_OR_GPU_FUNC dispatch). Shared dispatch macros and the calcProb* template chain are untouched. Unit tests pass on CPU/CPU+OMP/GPU/GPU+OMP (57146 assertions). Adds a throw-away benchmarks/ driver (not wired into CMake/CI); safe to squash/drop on merge.

thedaemon-wizard · 2026-06-09T07:28:03Z

This is a wonderful diff - I'm kicking myself for not noticing functor_projectStateVec wasn't even leveraging orderedness! 🎉

Can you please share the mentioned driver/scripts for benchmarking? Can either whack it into a comment here, or include it into the diff (which we can delete later - changes will be squashed so it won't pollute your work).

Thanks @TysonRayJones! Glad it's useful. 🎉

I've added the driver to the PR under benchmarks/benchmark_749.cpp (with a
short benchmarks/README.md). It's deliberately not wired into CMake/CI — happy
for it to be deleted in the squash, as you suggested. It builds straight against
QuEST via the built-in USER_SOURCE_NAMES mechanism:

cmake -S . -B build_bench \
    -D QUEST_ENABLE_CUDA=ON -D CMAKE_CUDA_ARCHITECTURES=120 \
    -D CMAKE_BUILD_TYPE=Release \
    -D USER_SOURCE_NAMES=benchmarks/benchmark_749.cpp \
    -D USER_OUTPUT_EXE_NAME=bench_749
cmake --build build_bench --target bench_749 -j
./build_bench/bench_749 4 20 3 2000        # [minQ maxQ numTargs reps]

It forces the single-GPU path (useGpuAccel=1, distribution/threads off) and
syncQuESTEnv()s around each timed region so it measures completed GPU work. Build
it once against clean origin/devel and once against this branch for before/after.

On my machine (RTX PRO 6000 Blackwell, CUDA 13.0, sm_120):

CUDA runtime API counts (N = 8…12, 200 reps, both ops, via nsys):

CUDA runtime call	baseline	optimised	change
`cudaMalloc`	3020	1010	−66%
`cudaFree`	3021	1011	−66%
`cudaMemcpyAsync`	3015	1005	−67%
`cudaLaunchKernel`	2025	2025	unchanged

Per-call wall time (µs, numTargs = 3, 2000 reps):

N	projector base	projector opt	speedup	prob base	prob opt	speedup
4	12.41	6.48	1.92×	20.35	14.59	1.39×
8	11.74	6.40	1.83×	19.44	14.18	1.37×
12	11.85	6.52	1.82×	19.79	14.44	1.37×
16	12.22	6.84	1.79×	28.63	23.21	1.23×
20	58.41	13.74	4.25×	67.99	61.99	1.10×

(The residual ~1000 cudaMalloc in the optimised build is thrust::reduce's own
internal temporary inside calcProb — inherent to Thrust, out of scope here.)

thedaemon-wizard · 2026-06-09T07:29:17Z

Note to self

The template parameters of the below functions and functors are now redundant:
* `functor_projectStateVec`

* `functor_projectDensMatr`

* `thrust_statevec_multiQubitProjector_sub`

* `thrust_densmatr_multiQubitProjector_sub`

* `gpu_statevec_multiQubitProjector_sub`

* `gpu_densmatr_multiQubitProjector_sub`
They can all be removed, along with the parameter dispatch in accel_statevec_multiQubitProjector_sub and accel_densmatr_multiQubitProjector_sub. I can do this myself in a cleanup commit (unless @thedaemon-wizard wishes to do it!)

Done — I went ahead and removed them (pushed in a follow-up commit). Summary:

Dropped the now-dead template parameter from functor_projectStateVec,
functor_projectDensMatr, thrust_{statevec,densmatr}_multiQubitProjector_sub,
and gpu_{statevec,densmatr}_multiQubitProjector_sub, and removed their
INSTANTIATE_FUNC_OPTIMISED_FOR_NUM_TARGS instantiations.
Simplified accel_{statevec,densmatr}_multiQubitProjector_sub to a plain
qureg.isGpuAccelerated ? gpu_… : cpu_… branch (matching the style already used
elsewhere in accelerator.cpp), so the GET_CPU_OR_GPU_FUNC_OPTIMISED_FOR_ONE_PARAM
dispatch is gone for the projector. I left the shared dispatch macros untouched
since packAmpsIntoBuffer, partialTrace_sub and the calcProb* family still
rely on them.

One thing to confirm: your note listed the GPU-side functions, but the
GET_CPU_OR_GPU_FUNC_… dispatch also fans out to the CPU projector, which was
still using its template param (SET_VAR_AT_COMPILE_TIME to unroll
getValueOfBits). To remove the dispatch cleanly I applied the same bitmask
reformulation to the CPU projector too:
getValueOfBits(n, qubits) == retainValue ≡ (n & qubitMask) == valueMask
(and the density-matrix (v1==v2) && (retainValue==v1) becomes
(r & qubitMask)==valueMask && (c & qubitMask)==valueMask), with
qubitMask = util_getBitMask(qubits) and valueMask = util_getBitMask(qubits, outcomes).
That makes the CPU template redundant as well, so it could be removed symmetrically.
If you'd rather keep the CPU path templated/unrolled, say the word and I'll instead
branch only the GPU side in accel_* and leave the CPU dispatch as-is.

A couple of notes for the record:

The CPU reformulation drops the per-amp inner loop over targets in favour of two
mask compares, so it shouldn't regress (and removes the getValueOfBits unroll
entirely); happy to micro-benchmark the CPU side if useful.
Removing the seven <0>…<5>,<-1> instantiations per projector function also
trims a little compile time / object-code, with no runtime cost since the
projector no longer benefits from compile-time numTargs unrolling.

Verification (RTX PRO 6000 Blackwell, CUDA 13.0, sm_120, Release): rebuilt
clean, and the affected unit tests pass across all four deployments —
tests "*QubitProjector*,*calcProbOfMultiQubitOutcome*,*calcProbOfQubitOutcome*"
→ All tests passed (57146 assertions in 8 test cases) (CPU / CPU+OpenMP / GPU /
GPU+OpenMP).

Also confirming: I'm fine with the benchmarks/ driver being deleted in the squash —
just let me know if you'd prefer I drop it from the branch now instead.

TysonRayJones · 2026-06-15T16:09:24Z

Hi there Amon,

It may take us a little while to integrate this - we have an idea how to further optimise a final remaining cudaAsyncMcpy beign invoked by functor_insertBits within thrust_(statevec|densmatr)_calcProbOfMultiQubitOutcome_sub (potentially using instrinsics, like in #717)! Or we may end up merging this as-is, and optimising later.

In any case, @JPRichings has profiled your solution and confirmed that you have satisfied the unitaryHACK challenge 🎉 🎉 Please comment on the issue (#749) so that we can assign it to you, awarding you the challenge. Nice work!

…-Kit#717) Open the one-bit gap with a single shared low mask + a shift-by-one of the high bits, instead of the equivalent right/left-shift + nested concatenateBits. Algebraically identical (cannot change results); shaves a couple of ops off every unrolled insertBits iteration, benefiting all callers. Op-count/readability cleanup, not a measured speed win. Verified against the full unit-test suite (2,293,050 assertions) across CPU/CPU+OMP/GPU/GPU+OMP; the one unrelated failure (setQuESTNumGpuThreadsPerBlock) is pre-existing on devel.

thedaemon-wizard · 2026-06-16T14:08:10Z

Thanks so much @TysonRayJones, and thanks @JPRichings for profiling it — delighted it satisfies the challenge! 🎉

On the further optimisation: I dug into it and wanted to share what I found, since there are actually two separate things bundled in there.

The remaining cudaMemcpyAsync in thrust_*_calcProbOfMultiQubitOutcome_sub is the thrust::reduce result transfer — it copies the single reduced qreal back to the host, which is unavoidable while the function returns the probability to the caller. (The List64-by-value change already removed the avoidable per-call copy, the host→device qubit list.) So I don't think that particular memcpy can be dropped without changing the API contract.
The Optimise getValueOfBits and insertBits #717 angle (optimising getValueOfBits/insertBits) is really a compute win: the per-element bit-deposit functor_insertBits runs via insertBitsWithMaskedValues → insertBits → insertBit. There's no native GPU PDEP/PEXT, so a single-instruction deposit isn't available, but I took a low-risk first pass: I rewrote insertBit to open the gap with one shared low mask + a shift-by-one of the high bits, instead of the equivalent right/left-shift + nested concatenateBits. It's algebraically identical (so it can't change results) and shaves a couple of ops off every unrolled iteration, benefiting all insertBits callers.
One thing that is squarely in Profile and optimise-away small GPU allocations #749's "small allocations" spirit: the residual per-call cudaMalloc I mentioned earlier is thrust::reduce's internal temp-storage allocation. That one could be pooled away with a Thrust caching allocator / thrust::cuda::par(pool), a reused CUB DeviceReduce temp buffer, or the stream-ordered allocator (cudaMallocAsync + a mem pool) — which would remove the temp alloc across every thrust::reduce/sort site, not just calcProb. That's a broader, library-wide change so I didn't want to sneak it into this PR, but I'm very happy to take it on as a separate follow-up if you think it's worthwhile.

I verified the insertBit tidy against the entire unit-test suite across all four deployments (CPU / CPU+OpenMP / GPU / GPU+OpenMP): 2,293,050 assertions pass, every projector / calcProb / controlled-gate / partial-trace case green. In full transparency, it's an op-count/readability cleanup rather than a speed win — wall-clock is unchanged within measurement noise at the scales I benchmarked (the deposit isn't the bottleneck for these ops), so please treat it purely as a tidy-up in the spirit of #717. I've kept it on its own commit (9b87ec3e) so it's trivial to drop or fold into your planned cleanup — merge as-is, squash, or cherry-pick, whatever's easiest for your integration. The larger magic-mask/butterfly deposit (which would mainly help the generic large-numTargs path and needs per-call mask precomputation) I'm happy to leave to you, or to attempt in a follow-up if you'd like.

One small unrelated heads-up from running the whole suite on this box (RTX PRO 6000 Blackwell, CUDA 13): the setQuESTNumGpuThreadsPerBlock "Exceeds device maximum" test (tests/unit/experimental.cpp) fails on devel independently of this PR — it passes 999999, which trips the warp-divisibility check (999999 % 32 ≠ 0) before the max-value check, so the thrown message doesn't match the expected "Exceeds the hardware-imposed maximum". A multiple-of-32 value above the device max would exercise the intended path. Happy to open a tiny separate PR for that if useful.

JPRichings · 2026-06-16T14:22:13Z

Thanks, for this additional work. we have also spotted the setQuESTNumGpuThreadsPerBlock issue happy for you to open a separate pull request but I have fixed it in a separate PR that we are looking to pull in the devel in the next few days.

thedaemon-wizard · 2026-06-16T14:34:52Z

Thanks @JPRichings! Great to hear it's already fixed — I'll skip opening a separate PR for setQuESTNumGpuThreadsPerBlock then, to avoid duplicating your change, and just exclude that pre-existing failure from my own runs. Looking forward to it landing in devel.

Thanks again to you both for the thorough review and for profiling the solution.

TysonRayJones · 2026-06-16T15:50:03Z

Oh good call about the remaining memcpy. There is a precedent for having a persistent user-clearable GPU cache we can potentially re-use. Let's think through that in a separate PR however.

Alas I'm not convinced your insertBit change actually does anything except reduce readability. Inlining and compiler optimisation (removing the two superfluous shifts) makes the new and old code identical in their bit operations. Please revert that diff, and we'll proceed to cleaning this up when we get a sec! 🙏

Update code to Optimise-away small GPU allocations for projector & mu…

1bdb7a8

…lti-qubit prob

thedaemon-wizard changed the title ~~Update code to Optimise-away small GPU allocations for projector & mu…~~ Update code to Optimise-away small GPU allocations for projector & mu… unitaryHACK26 Jun 11, 2026

thedaemon-wizard mentioned this pull request Jun 16, 2026

Profile and optimise-away small GPU allocations #749

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update code to Optimise-away small GPU allocations for projector & mu… unitaryHACK26#783

Update code to Optimise-away small GPU allocations for projector & mu… unitaryHACK26#783
thedaemon-wizard wants to merge 3 commits into
QuEST-Kit:develfrom
thedaemon-wizard:optimise-small-gpu-allocations-749

thedaemon-wizard commented Jun 7, 2026

Uh oh!

TysonRayJones commented Jun 9, 2026

Uh oh!

TysonRayJones commented Jun 9, 2026

Uh oh!

thedaemon-wizard commented Jun 9, 2026 •

edited

Loading

Uh oh!

thedaemon-wizard commented Jun 9, 2026 •

edited

Loading

Uh oh!

TysonRayJones commented Jun 15, 2026 •

edited

Loading

Uh oh!

thedaemon-wizard commented Jun 16, 2026

Uh oh!

JPRichings commented Jun 16, 2026

Uh oh!

thedaemon-wizard commented Jun 16, 2026 •

edited

Loading

Uh oh!

TysonRayJones commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

thedaemon-wizard commented Jun 7, 2026

Profile and optimise-away small GPU allocations

Summary

Design

Projectors — bitmask reformulation (removes the copy for all sizes)

Outcome probability — pass the list by value (removes the copy for all sizes)

Results (RTX PRO 6000 Blackwell, CUDA 13.0, sm_120)

Nsight Systems — CUDA API call counts

Wall-clock per call (microseconds), numTargs = 3, 2000 reps

Correctness

How to reproduce the measurements

Notes for reviewers

AI usage disclosure

unitaryHACK 2026 checklist

Uh oh!

TysonRayJones commented Jun 9, 2026

Uh oh!

TysonRayJones commented Jun 9, 2026

Uh oh!

thedaemon-wizard commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

thedaemon-wizard commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TysonRayJones commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

thedaemon-wizard commented Jun 16, 2026

Uh oh!

JPRichings commented Jun 16, 2026

Uh oh!

thedaemon-wizard commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TysonRayJones commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Wall-clock per call (microseconds), `numTargs = 3`, 2000 reps

thedaemon-wizard commented Jun 9, 2026 •

edited

Loading

thedaemon-wizard commented Jun 9, 2026 •

edited

Loading

TysonRayJones commented Jun 15, 2026 •

edited

Loading

thedaemon-wizard commented Jun 16, 2026 •

edited

Loading