Adding Batched DMA support (hipMemcpyBatchAsync), and bmasweep preset#255
Open
gilbertlee-amd wants to merge 3 commits intoROCm:candidatefrom
Open
Adding Batched DMA support (hipMemcpyBatchAsync), and bmasweep preset#255gilbertlee-amd wants to merge 3 commits intoROCm:candidatefrom
gilbertlee-amd wants to merge 3 commits intoROCm:candidatefrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR introduces a new GPU batched DMA executor (B, backed by hipMemcpyBatchAsync in HIP/ROCm 7.0+) and a new bmasweep preset to compare standard DMA vs batched DMA, while also extending the DMA path to support multiple destination buffers.
Changes:
- Add
EXE_GPU_BDMA(“B”) executor support gated by HIP/ROCm 7.0+ (hipMemcpyBatchAsync). - Allow DMA (and BMA) transfers to specify multiple destinations and execute the copies accordingly.
- Add a
bmasweeppreset and update docs/changelog to expose the new executor.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| src/header/TransferBench.hpp | Adds BDMA executor, multi-destination DMA support, BDMA execution path, and related validation/topology updates. |
| src/client/Utilities.hpp | Adds string mapping for the new executor type. |
| src/client/Presets/Presets.hpp | Registers the new bmasweep preset. |
| src/client/Presets/BmaSweep.hpp | New preset to benchmark DMA vs batched DMA across multiple destinations. |
| examples/example.cfg | Documents the new executor in the example config. |
| CHANGELOG.md | Notes the new executor/preset and related DMA behavior changes. |
Comments suppressed due to low confidence (1)
src/header/TransferBench.hpp:5834
- Wildcard expansion for executor subindices treats EXE_GPU_BDMA like GFX/DMA and iterates over GetNumExecutorSubIndices(). Since BDMA reports 0 subindices, this branch currently generates no transfers when exeSubIndices is -2, instead of recursing once with subindex -1. BDMA should likely be handled like CPU here (set -1 and recurse once).
case EXE_GPU_GFX: case EXE_GPU_DMA: case EXE_GPU_BDMA:
{
// Iterate over all available subindices
ExeDevice exeDevice = {wc.exe.exeType, wc.exe.exeIndices[0], wc.exe.exeRanks[0], 0};
int numSubIndices = GetNumExecutorSubIndices(exeDevice);
for (int x = 0; x < numSubIndices; x++) {
wc.exe.exeSubIndices = {x};
result |= RecursiveWildcardTransferExpansion(wc, baseRankIndex, numBytes, numSubExecs, transfers);
}
wc.exe.exeSubIndices = {-1};
return result;
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
8bd9fe4 to
8bab3a2
Compare
nileshnegi
approved these changes
Apr 12, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
This adds a new Executor (
B) based on thehipMemcpyBatchAsynccall that was introduced in HIP 7.0.This new Executor supports SubExecutors - namely how many batches a single Transfer is broken into.
This allows comparing performance against standard
hipMemcpyAsync.A new
bmasweeppreset is also introduced to compare the two versionsTechnical Details
This new code also enables more than one destination when using DMA and Batched DMA (BMA) Executor.
When multiple destinations are provided, the copies are performed one after another with DMA, and all as different batches (respecting number of SubExecutors) for Batched DMA.
Test Result
Here are results from
bmasweepfrom a MI355X: