Optimise getValueOfBits and insertBits with BMI2 PEXT/PDEP (#717)#796
Optimise getValueOfBits and insertBits with BMI2 PEXT/PDEP (#717)#796nez0b wants to merge 2 commits into
Conversation
…#717) Replace the per-amplitude looped bit gather/scatter in the CPU statevector/ density-matrix kernels with x86 BMI2 PEXT/PDEP, hoisting the loop-invariant masks out of the 2^N loops so each per-amplitude call becomes one instruction: - insertBitsWithMaskedValues sites: compute the position mask once per gate and use _pdep_u64 (order-invariant scatter; unconditionally correct). - getValueOfBits sites: _pext_u64 when the qubits are strictly increasing, falling back to the original scalar loop otherwise (order is preserved). Portability: BMI2 is opt-in via a new QUEST_ENABLE_BMI2 CMake option, OFF by default, so a default build stays portable scalar (no BMI2 in the binary, no SIGILL on pre-BMI2 CPUs). Enabling it wires -mbmi2 through the library; the intrinsics are additionally guarded to x86 host TUs (never CUDA/HIP device code). The scalar fallback is byte-identical, and QUEST_BITWISE_FORCE_SCALAR forces it on a BMI2-capable host. Tests/benchmark: tests/unit/bitwise.cpp asserts the new helpers are bit-identical to the originals over exhaustive-small, randomised, and boundary inputs (bits 31/32/61/62/63 incl. the int64 sign bit); examples/automated adds a cross-platform benchmark that prints timings and which path was compiled in. Bit-identical to the scalar path (verified by unit tests, the QuEST suite for the touched kernels, and amplitude hashes of QFT/random/Grover/VQE circuits). Closes QuEST-Kit#717
|
Wew nice work! 🎉 Great to see thought put into the call sites! I have tailored the CI to run with the new instrinsic pathway, and will manually test on another Intel machine. Interestingly, this diff reveals two kinds of callers of
This is a very important difference. In the un-conditional case, everything can be fully inlined, and
I believe the current design is actually simultaneously the cleanest and best performing (when the compiler isn't stoopid, and when the function doesn't have a natural always-sorted formulation). I'm not above mostly for my own reference/history! I will make some cleanup changes to this PR, and possibly make an extension (e.g. an O(1) fallback alternative to the instrinsic). No more changes are necessary for unitaryHACK however! 🎉 Please comment on issue #717 so I can assign it to you! |
| // use template param to compile-time unroll loop in insertBits() | ||
| SET_VAR_AT_COMPILE_TIME(int, numBits, NumQubits, qubitInds.size()); | ||
|
|
||
| qindex qubitsPosMask = getBitMask(sortedQubitInds.data(), numBits); // loop-invariant: hoisted out of the per-amplitude loop |
There was a problem hiding this comment.
TODO: remove // loop-invariant: hoisted out of the per-amplitude loop comments
| * reuses the original unrolled scalar routines and stays byte-identical. | ||
| */ | ||
|
|
||
| INLINE qindex insertBitsWithMaskedValuesAndPosMask(qindex number, qindex valueMask, [[maybe_unused]] qindex posMask, [[maybe_unused]] const int* bitInds, [[maybe_unused]] int numBits) { |
There was a problem hiding this comment.
TODO: consider renaming (perhaps the existing insertBitsWithMaskedValues should be renamed too for better disambiguation)
| #ifdef QUEST_BITWISE_USE_BMI2 | ||
| return valueMask | (qindex) _pdep_u64((unsigned long long) number, ~ (unsigned long long) posMask); | ||
| #else | ||
| return valueMask | insertBits(number, bitInds, numBits, 0); |
There was a problem hiding this comment.
TODO: consider replacing this with O(1) bespoke masked logic to make bitInds,numBits always redundant
|
|
||
| // Checked once per gate (loop-invariant), never per amplitude: getValueOfBits is order-sensitive, | ||
| // so the PEXT path above is valid only when bitInds are strictly increasing. | ||
| INLINE bool isStrictlyIncreasing(const int* bitInds, int numBits) { |
There was a problem hiding this comment.
TODO: move this to utilities (no need to inline; it's called once per backend function)
|
|
||
| qindex ctrlsPosMask = getBitMask(sortedCtrls.data(), numCtrlBits); // loop-invariant: hoisted out of the per-amplitude loop | ||
| qindex targsPosMask = getBitMask(targs.data(), numTargBits); // loop-invariant: hoisted out of the per-amplitude loop | ||
| bool targsSorted = isStrictlyIncreasing(targs.data(), numTargBits); // likewise loop-invariant (order checked once per gate) |
There was a problem hiding this comment.
TODO: Mark targsSorted as const to improve compiler-chance of loop unrolling
|
|
||
| // t = value of targeted bits, which may be in the prefix substate | ||
| qindex t = getValueOfBits(i, targs.data(), numTargBits); | ||
| qindex t = (targsSorted ? getValueOfBitsFromSortedPosMask(i, targsPosMask, targs.data(), numTargBits) : getValueOfBits(i, targs.data(), numTargBits)); |
There was a problem hiding this comment.
TODO: add comment (and to other invocations) that we anticipate the compiler to unroll the loop(branch) into branched(loops), or otherwise that branch prediction will be so good as to remove all hot-loop-branching penalty
There was a problem hiding this comment.
TODO: bitwise is not an API module, so we probably will not expose a dedicated test file like this. No need when the existing operator tests critically depend upon the bitwise function's correctness
| # When the user opts in via QUEST_ENABLE_BMI2, compile only the issue-#717 bitwise test with -mbmi2 so | ||
| # it exercises the actual PEXT/PDEP path; otherwise (the default) it exercises the scalar fallback. The | ||
| # assertions hold identically either way. CMP0118: source properties are visible to the target's | ||
| # directory, so set it with explicit TARGET_DIRECTORY for the parent-scope 'tests' target. | ||
| if (QUEST_ENABLE_BMI2) | ||
| include(CheckCXXCompilerFlag) | ||
| check_cxx_compiler_flag("-mbmi2" QUEST_TEST_SUPPORTS_MBMI2) | ||
| if (QUEST_TEST_SUPPORTS_MBMI2) | ||
| set_source_files_properties( | ||
| bitwise.cpp | ||
| TARGET_DIRECTORY tests | ||
| PROPERTIES COMPILE_OPTIONS "-mbmi2" | ||
| ) | ||
| endif() | ||
| endif() No newline at end of file |
There was a problem hiding this comment.
TODO: redundant if we remove bitwise.cpp test
| # instead supplies their own -march=native still gets the fast path on their own CPU. | ||
| if (QUEST_ENABLE_BMI2) | ||
| include(CheckCXXCompilerFlag) | ||
| check_cxx_compiler_flag("-mbmi2" QUEST_COMPILER_SUPPORTS_MBMI2) |
There was a problem hiding this comment.
QUEST_COMPILER_SUPPORTS_MBMI2 is a local var, so should become quest_compiler_supports_bmi2
| if (QUEST_COMPILER_SUPPORTS_MBMI2) | ||
| target_compile_options(QuEST PRIVATE $<$<COMPILE_LANGUAGE:CXX>:-mbmi2>) | ||
| else() | ||
| message(WARNING "QUEST_ENABLE_BMI2=ON but the compiler does not accept -mbmi2; building the scalar fallback.") |
There was a problem hiding this comment.
TODO: error instead of warn here, unless we change BMI2 to be ON by default
| option( | ||
| QUEST_ENABLE_BMI2 | ||
| "Whether QuEST will accelerate CPU bit gather/scatter with x86 BMI2 (PEXT/PDEP) intrinsics (issue #717). Turned OFF by default; when ON, the resulting binary requires a BMI2-capable CPU at runtime." | ||
| OFF |
There was a problem hiding this comment.
TODO: consider enabling by default when detectedly supported by compiler
Closes #717.
Summary
This adds guarded x86 BMI2 fast paths for QuEST's hot bit gather/scatter helpers, it hoists the loop-invariant work out of the 2^N statevector loops so each call collapses to a single instruction:
insertBitsWithMaskedValues(...)(the gate-application path): I build the position mask once pergate and use
_pdep_u64, instead of rebuilding it on every amplitude.getValueOfBits(...)(measurement / diagonal-matrix / projector):_pext_u64when the qubits arestrictly increasing, falling back to the original loop otherwise.
CMakeLists.txtadds aQUEST_ENABLE_BMI2option (off by default): a default build staysportable scalar — no BMI2 in the binary, so it can't
SIGILLon a pre-BMI2 CPU — and-DQUEST_ENABLE_BMI2=ONwires-mbmi2(PEXT/PDEP) through the library (which then requires aBMI2-capable CPU at runtime).
The earlier PRs rebuilt the mask / re-checked sortedness inside
the per-amplitude function; here those are loop-invariant and computed once.
A note on impact. In practice this is a modest optimisation, and I want to be upfront about
that. The per-call speedup is large (6–12× in cache), but non-trivial state-vector simulation is
memory-bandwidth-bound, so most of that win is hidden: end-to-end I measure only ~1.0–1.3×
single-threaded, collapsing toward 1× (occasionally below) once the state is DRAM-bound or the run
saturates memory bandwidth across threads. It is probably not a large win for
big production runs.
Benchmarks
"Per call" means one invocation of the bit gather/scatter helper (
getValueOfBits/insertBitsWithMaskedValues). The plot below isolates that instruction. it's the bit-op run in a tightloop with the result kept in a register, no statevector traffic.
Pure bit-op speedup: PDEP/PEXT are about 6–12× faster than the scalar loops in cache (k=6, single thread),
compressing to ~4–7× at the DRAM wall (one thread can't saturate memory):
That isolated number is the ceiling, not what a gate sees — inside a real kernel the bit-op only
computes an index, and the strided complex-amplitude load/store it gates dominates the cost (see Thread
scaling below). The three views form one memory-roofline ladder — isolated instruction ~6–12× →
inside a gate ~2.5× → whole circuit ~1.0–1.3× — each diluted by progressively more memory traffic.
End-to-end on whole circuits (AI assisted benchmark generation) (Xeon 6448H, single thread, all bit-identical to the scalar build):
So the end-to-end win is modest — roughly 1.0–1.3× single-threaded, biggest where index math
dominates (controlled-phase-heavy circuits) on cache-resident states, and it shrinks toward 1× (and
occasionally below — e.g. 0.84× at nq=20 / 32 threads) once the state is DRAM-bound or threads
saturate bandwidth.
Thread scaling — the bit-op inside a real gate kernel (Xeon 6448H, L3-resident
q=18). Unlike theisolated plot above, each call here also reads/writes the strided complex amplitudes it indexes (~64 B of
scattered memory per call vs the plot's single sequential 8-byte read), so memory, not the instruction,
sets the pace:
Even at 1 thread this is only 2.5–3.9× — not the ~7–8× the isolated plot shows at q=18 — because the
strided amplitude traffic already dominates; it then erodes toward ~1× by 128 threads as bandwidth
saturates. (The 1–32-thread runs are pinned to one idle-gated socket; 64/128 spill across this shared
node's other sockets, so read those two columns as trend rather than precise figures.)
I also tried VPSHUFBITQMB (AVX-512)
Because
getValueOfBitscan be handed unsorted qubits, I benchmarked the AVX-512 arbitrary-ordergather (
VPSHUFBITQMB, BITALG) as an alternative to PEXT. Per-call, in cache (gather):I didn't use it, for three reasons: (1) it needs AVX-512 BITALG (Ice Lake-SP+ / Zen 4+), so it would
need runtime CPU dispatch rather than a compile-time guard; (2) for the common sorted case PEXT is
already faster per call; (3) the only thing it uniquely buys — the unsorted gather — is a tiny slice
of the work. Instrumenting the four circuit families, scatter (PDEP) outweighs gather by ~58:1,
and the unsorted part is ~0.5% of bitwise work. BMI2-only keeps the diff small and portable for the
whole win.
Tests
tests/unit/bitwise.cpp— checks the new mask-accepting helpers are bit-identical to the originalgetValueOfBits/insertBitsWithMaskedValuesover exhaustive-small and randomised inputs, plusdeterministic boundary cases at bits 31/32/61/62/63 (incl. the int64 sign bit), and
isStrictlyIncreasing. Built with-DQUEST_ENABLE_BMI2=ONit's compiled with-mbmi2, so itexercises the real PEXT/PDEP path; 41,849 assertions pass (and pass equally on the scalar fallback).
examples/automated/benchmark_bitwise_bmi2.cpp— a quick benchmark that prints the timings andwhich path was compiled in, so a non-x86 / non-
-mbmi2CI runner just prints the scalar timingsrather than hitting
SIGILL. Runs in well under a second.AI assistance
I used an AI assistant to help survey the candidate instructions (PEXT/PDEP, VPSHUFBITQMB, GFNI), draft
the hoisting and the compile guards, and assemble the benchmarks. I reviewed the diff myself, kept the
arbitrary-order semantics of
getValueOfBits, and ran the validation above before opening this.Closes #717