Implement CPU Fused Multi-SWAP with Lightweight Pipeline Buffer Reuse (contributes to #595)#794
Implement CPU Fused Multi-SWAP with Lightweight Pipeline Buffer Reuse (contributes to #595)#794zhangchuan92910 wants to merge 1 commit into
Conversation
|
You've made an extraordinarily large diff - it looks like you ran the source code through a linter. Please disable this in your AI or IDE, and commit only the diff relevant to the feature |
My sincere apologies for that! Thank you for pointing it out, and I will be more cautious when running the source code inside AI tools. I have just disabled the linter, reverted the formatting noise, and force-pushed a clean commit. The diff now strictly contains only the 148 lines relevant to the lightweight Fused Multi-SWAP logic. Thanks for pointing it out, and please let me know if there's anything else to adjust! |
Close #595
Overview
This PR implements Fused Multi-SWAP (SWAP fusion) for the distributed CPU backend, directly addressing the communication bottleneck outlined in #595. Instead of executing disjoint SWAPs sequentially (which wastefully forces amplitudes to cross the network multiple times), this implementation determines the final destination of each amplitude and sends it exactly once.
Algorithmic Approach & Acknowledgements
First, I would like to acknowledge the excellent work in PR #785 and PR #790. The core communication topology utilized in this PR shares the exact same mathematical foundation—the$2^k-1$ subcube XOR partner exchanges—which is a brilliant algorithmic choice for hypercube topologies.
However, this PR introduces a distinct engineering and memory-management philosophy that provides an alternative, highly optimized solution:
1. Lightweight Pipeline Buffer Reuse (Zero Dynamic Allocation)
Instead of batching all$2^k-1$ $O(N/2^k)$ buffer size per step. We can simply split the pre-existing
MPI_Isend/Irecvrequests at once (which inherently requires allocating and managing a large, dynamic staging cache likefusedSwapSendCacheto hold all payloads concurrently), this PR uses a sequential communication pipeline.Because we process one XOR partner at a time, we only need a strict
qureg.cpuCommBufferintosendandrecvhalves.mallocorstd::vector), no global state cache to maintain, and a drastically reduced memory footprint, completely eliminating OOM risks on memory-constrained HPC environments.2. Trade-offs in MPI Wait Latency
We acknowledge the trade-off in our pipeline approach regarding MPI latency. PR #790's approach of throwing all$2^k-1$ target exchanges to MPI concurrently and doing a single $2^k-1$ is generally very small (mostly $k \le 4$ ), we believe trading a few extra rounds of network latency for absolute $O(1)$ memory safety and zero memory allocation overhead is a highly practical and robust choice for the CPU backend.
MPI_Waitallhelps to hide network latency in massive clusters. Our sequential pipeline invokesMPI_Waitallonce per target iteration. However, because the number of rounds3. Native Compile-Time Unrolling
The buffer packing and unpacking kernels are tightly integrated with QuEST's native$k \le 5$ , ensuring the CPU benefits from branchless SIMD auto-vectorization.
SET_VAR_AT_COMPILE_TIMEmacro. The inner packing loops are fully unrolled at compile time for4. Safety Fallback for GPUs
While this PR focuses strictly on the CPU distributed logic (GPU acceleration is currently out of scope for this specific PR), it safely handles GPU-accelerated Quregs by injecting
syncQuregFromGpu()andsyncQuregToGpu()to ensure data consistency without breaking execution.Verification & Benchmarks
AI disclosure
This code was developed with the assistance of Google Gemini for algorithmic refactoring, performance analysis, and optimization prototyping. I provided pseudocode, have thoroughly reviewed the generated code, and have verified that the underlying communication topology and implementation details align strictly with the approaches discussed in the related scientific literature on SWAP fusion.