Skip to content

Adding Batched DMA support (hipMemcpyBatchAsync), and bmasweep preset#255

Open
gilbertlee-amd wants to merge 3 commits intoROCm:candidatefrom
gilbertlee-amd:BmaExecutor
Open

Adding Batched DMA support (hipMemcpyBatchAsync), and bmasweep preset#255
gilbertlee-amd wants to merge 3 commits intoROCm:candidatefrom
gilbertlee-amd:BmaExecutor

Conversation

@gilbertlee-amd
Copy link
Copy Markdown
Collaborator

Motivation

This adds a new Executor (B) based on the hipMemcpyBatchAsync call that was introduced in HIP 7.0.
This new Executor supports SubExecutors - namely how many batches a single Transfer is broken into.
This allows comparing performance against standard hipMemcpyAsync.
A new bmasweep preset is also introduced to compare the two versions

Technical Details

This new code also enables more than one destination when using DMA and Batched DMA (BMA) Executor.
When multiple destinations are provided, the copies are performed one after another with DMA, and all as different batches (respecting number of SubExecutors) for Batched DMA.

Test Result

Here are results from bmasweep from a MI355X:

[BMA Sweep Related]
EXE_INDEX            =            0 : Executing on GPU 0
LOCAL_COPY           =            0 : Excluding local copy to GPU 0
GPU_MEM_TYPE         =            0 : Using default GPU (0=default, 1=fine-grained, 2=uncached, 3=managed)
NUM_GPU_DEVICES      =            8 : Using 8 GPUs
NUM_SUB_EXECS        =            4 : 1,2,4,8

Performing 7 simultaneous DMA Transfers from GPU 0 other GPUs
Executing: ...................
┌------------┬--------┬---------------------------------------┐
│      Bytes │    DMA │ BMA (1)   BMA (2)   BMA (4)   BMA (8) │
├------------┼--------┼---------------------------------------┤
│       4096 │   0.68 │    0.67      0.35      0.18      0.09 │
│       8192 │   1.33 │    1.34      0.70      0.36      0.18 │
│      16384 │   2.61 │    2.68      1.41      0.72      0.35 │
│      32768 │   5.42 │    5.36      2.98      1.49      0.76 │
│      65536 │  10.52 │   10.54      5.53      3.01      1.52 │
│     131072 │  20.18 │   19.84     11.18      5.63      3.04 │
│     262144 │  33.90 │   32.19     21.52     11.21      5.67 │
│     524288 │  44.86 │   43.72     34.54     21.98     11.40 │
│    1048576 │  50.01 │   49.64     45.43     35.59     21.95 │
│    2097152 │  47.99 │   48.07     50.96     46.41     35.78 │
│    4194304 │  53.64 │   53.66     48.59     51.27     46.71 │
│    8388608 │  57.09 │   57.07     53.65     48.10     51.54 │
│   16777216 │  59.11 │   59.11     57.22     53.66     48.16 │
│   33554432 │  60.10 │   59.97     59.11     57.10     53.74 │
│   67108864 │  57.39 │   60.65     60.11     59.10     57.18 │
│  134217728 │  60.95 │   60.82     60.67     60.13     59.13 │
│  268435456 │  61.10 │   61.10     60.83     60.64     60.07 │
│  536870912 │  61.19 │   59.33     61.10     60.95     60.54 │
│ 1073741824 │  61.22 │   61.22     61.18     60.41     60.73 │
└------------┴--------┴---------------------------------------┘
Reported numbers are all GB/s, normalized for per Transfer for 7 Transfers

@gilbertlee-amd gilbertlee-amd requested review from a team as code owners April 10, 2026 22:17
@nileshnegi nileshnegi requested a review from Copilot April 10, 2026 22:19
@gilbertlee-amd gilbertlee-amd review requested due to automatic review settings April 10, 2026 22:20
@nileshnegi nileshnegi requested a review from Copilot April 10, 2026 23:41
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a new GPU batched DMA executor (B, backed by hipMemcpyBatchAsync in HIP/ROCm 7.0+) and a new bmasweep preset to compare standard DMA vs batched DMA, while also extending the DMA path to support multiple destination buffers.

Changes:

  • Add EXE_GPU_BDMA (“B”) executor support gated by HIP/ROCm 7.0+ (hipMemcpyBatchAsync).
  • Allow DMA (and BMA) transfers to specify multiple destinations and execute the copies accordingly.
  • Add a bmasweep preset and update docs/changelog to expose the new executor.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
src/header/TransferBench.hpp Adds BDMA executor, multi-destination DMA support, BDMA execution path, and related validation/topology updates.
src/client/Utilities.hpp Adds string mapping for the new executor type.
src/client/Presets/Presets.hpp Registers the new bmasweep preset.
src/client/Presets/BmaSweep.hpp New preset to benchmark DMA vs batched DMA across multiple destinations.
examples/example.cfg Documents the new executor in the example config.
CHANGELOG.md Notes the new executor/preset and related DMA behavior changes.
Comments suppressed due to low confidence (1)

src/header/TransferBench.hpp:5834

  • Wildcard expansion for executor subindices treats EXE_GPU_BDMA like GFX/DMA and iterates over GetNumExecutorSubIndices(). Since BDMA reports 0 subindices, this branch currently generates no transfers when exeSubIndices is -2, instead of recursing once with subindex -1. BDMA should likely be handled like CPU here (set -1 and recurse once).
      case EXE_GPU_GFX: case EXE_GPU_DMA: case EXE_GPU_BDMA:
      {
        // Iterate over all available subindices
        ExeDevice exeDevice = {wc.exe.exeType, wc.exe.exeIndices[0], wc.exe.exeRanks[0], 0};
        int numSubIndices = GetNumExecutorSubIndices(exeDevice);
        for (int x = 0; x < numSubIndices; x++) {
          wc.exe.exeSubIndices = {x};
          result |= RecursiveWildcardTransferExpansion(wc, baseRankIndex, numBytes, numSubExecs, transfers);
        }
        wc.exe.exeSubIndices = {-1};
        return result;

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants