Skip to content

Latency Testing#252

Draft
AtlantaPepsi wants to merge 3 commits intoROCm:candidatefrom
AtlantaPepsi:pingpong
Draft

Latency Testing#252
AtlantaPepsi wants to merge 3 commits intoROCm:candidatefrom
AtlantaPepsi:pingpong

Conversation

@AtlantaPepsi
Copy link
Copy Markdown
Contributor

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a ping-pong (“latency testing”) mode to TransferBench, including parsing a ping/pong pair syntax and reporting per-lap latency in the client output.

Changes:

  • Extend Transfer / config options to support ping-pong laps and a configurable flag stride.
  • Implement ping-pong execution behavior across CPU/GFX/DMA/NIC paths, including flag allocation/offsetting and timing adjustments.
  • Update client result tables and environment variables to display latency-oriented ping-pong results and configure stride.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 9 comments.

File Description
src/header/TransferBench.hpp Adds ping-pong transfer fields, parsing, resource linking, flag wait/slot logic, execution changes across executors, and iteration timing adjustments.
src/client/Utilities.hpp Updates result rendering to hide pong rows and show ping-pong latency rows/iteration details.
src/client/EnvVars.hpp Adds PINGPONG_STRIDE env var and wires it into cfg.general.pingpongStride.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

int32_t exeSubSlot = 0; ///< Executor subslot
int numSubExecs = 0; ///< Number of subExecutors to use for this Transfer
int laps = 0; ///< 0 = one-way transfer, >0 = ping (forward), <0 = pong (backward). abs(laps) = lap count
MemDevice flag = {}; ///< Memory device for pingpong signal flag (only used when laps != 0)
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Transfer::flag is default-initialized with {}, which sets memType to MEM_CPU (0) rather than MEM_NULL. For pingpong transfers created via other code paths (not the parser), this makes it impossible to reliably detect “flag not set” and can cause the pingpong validation to incorrectly pass. Initialize flag.memType to MEM_NULL by default (or add an explicit boolean/optional to indicate presence).

Suggested change
MemDevice flag = {}; ///< Memory device for pingpong signal flag (only used when laps != 0)
MemDevice flag = {MEM_NULL, 0, 0}; ///< Memory device for pingpong signal flag (only used when laps != 0)

Copilot uses AI. Check for mistakes.

// Pingpong flag checks
if (t.laps != 0) {
if (t.flag.memType == MEM_NULL) {
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pingpong validation only checks t.flag.memType == MEM_NULL and memRank, but it never validates cfg.general.pingpongStride (positive, multiple-of-8) even though later code assumes 8-byte alignment for atomic loads and uses the stride in modulo/address arithmetic. Add explicit validation for pingpongStride (and ideally validate it against lap count / allocation size) so invalid values don’t lead to misalignment, slot reuse, or out-of-bounds writes.

Suggested change
if (t.flag.memType == MEM_NULL) {
if (cfg.general.pingpongStride <= 0) {
errors.push_back({ERR_FATAL,
"Transfer %d: Pingpong transfer requires pingpongStride to be positive (instead of %d)",
i, cfg.general.pingpongStride});
hasFatalError = true;
} else if ((cfg.general.pingpongStride % 8) != 0) {
errors.push_back({ERR_FATAL,
"Transfer %d: Pingpong transfer requires pingpongStride to be a multiple of 8 bytes (instead of %d)",
i, cfg.general.pingpongStride});
hasFatalError = true;
} else if (t.flag.memType == MEM_NULL) {

Copilot uses AI. Check for mistakes.
Comment on lines +3943 to +3944
if (t.laps != 0)
dstAllocBytes = std::max(dstAllocBytes, std::min((size_t)1024, (size_t)8 * abs(t.laps)));
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For pingpong transfers, dstAllocBytes is derived from 8 * abs(t.laps) but does not incorporate cfg.general.pingpongStride. If pingpongStride is larger than 8, multiple laps will alias the same flag slot (and stale non-zero values will cause waits to pass immediately), which breaks ordering/latency measurement. Size the allocation (and the wrap-around size used by FlagSlot) based on pingpongStride * abs(laps) (or store/compare a per-lap sequence value instead of waiting on “non-zero”).

Suggested change
if (t.laps != 0)
dstAllocBytes = std::max(dstAllocBytes, std::min((size_t)1024, (size_t)8 * abs(t.laps)));
if (t.laps != 0) {
size_t const pingpongLaps = static_cast<size_t>(t.laps < 0 ? -static_cast<int64_t>(t.laps)
: static_cast<int64_t>(t.laps));
size_t const pingpongBytes = static_cast<size_t>(cfg.general.pingpongStride) * pingpongLaps;
dstAllocBytes = std::max(dstAllocBytes, pingpongBytes);
}

Copilot uses AI. Check for mistakes.
Comment on lines +5373 to +5380
// Cross-link flagMem for pingpong pairs (ping and pong are consecutive in transfers)
for (size_t i = 0; i + 1 < transfers.size(); i++) {
if (transfers[i].laps > 0 && transfers[i+1].laps < 0) {
auto pingRss = transferResources[i];
auto pongRss = transferResources[i + 1];
pingRss->flagMem = pongRss->dstMem[0];
pongRss->flagMem = pingRss->dstMem[0];
int stride = cfg.general.pingpongStride;
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This block assumes transferResources[i] corresponds to transfers[i], but transferResources is built by iterating executorMap and pushing each executor’s resources, so its order is by executor (map key), not by transferIdx. As a result, ping/pong pairing may link the wrong resources (wrong dstMem[0]), causing incorrect synchronization or crashes. Build an indexable structure keyed by transferIdx (e.g., std::vector<TransferResources*> byTransferIdx(transfers.size())) and use that for pingpong cross-linking.

Copilot uses AI. Check for mistakes.
Comment on lines +5491 to +5497
// Zero flag memory for pingpong transfers so each lap's slot starts at 0
for (size_t i = 0; i + 1 < transfers.size(); i++) {
if (transfers[i].laps > 0 && transfers[i+1].laps < 0) {
size_t allocBytes = std::min((size_t)1024, (size_t)8 * abs(transfers[i].laps));
for (int k = 0; k < 2; k++) {
void* dst = transferResources[i + k]->dstMem[0];
if (!dst) continue;
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The per-iteration flag-zeroing loop also indexes transferResources[i + k], inheriting the same ordering bug as the pingpong cross-linking logic (resources are not guaranteed to be in transfer-index order). Use a mapping by transferIdx here as well to ensure the correct ping/pong buffers are cleared.

Copilot uses AI. Check for mistakes.
Comment on lines +5498 to +5503
MemType mt = transfers[i + k].dsts[0].memType;
if (mt == MEM_CPU || mt == MEM_NULL)
memset(dst, 0, allocBytes);
else
hipMemset(dst, 0, allocBytes);
}
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For GPU flag buffers, this uses hipMemset(dst, ...) without selecting the correct device for dst and without checking the return code (unlike other HIP calls in this file). This can fail or write to the wrong device depending on the current device, and failures will be silently ignored. Set the device based on the flag buffer’s memIndex and propagate/append the HIP error (and ensure the memset completes before launching transfers if required).

Copilot uses AI. Check for mistakes.
{
// Replace any round brackets or '->' with spaces,
// Replace round brackets, '->', and ':' with spaces, but preserve '+'
for (int i = 1; line[i]; i++)
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for (int i = 1; line[i]; i++) reads line[1] unconditionally, which is out-of-bounds for empty strings or 1-character lines (e.g., blank lines in a config file). Guard on line.size() > 1 and iterate using line.size() rather than relying on a null terminator.

Suggested change
for (int i = 1; line[i]; i++)
for (std::string::size_type i = 1; i < line.size(); i++)

Copilot uses AI. Check for mistakes.
Comment on lines +586 to +602
Transfer const& pong = transfers[idx + 1];
if (isMultiRank) {
table.Set(rowIdx, 4, " %s->R%d%c%d->%s <+> %s->R%d%c%d->%s",
MemDevicesToStr(t.srcs).c_str(),
exeDevice.exeRank, ExeTypeStr[t.exeDevice.exeType], t.exeDevice.exeIndex,
MemDevicesToStr(t.dsts).c_str(),
MemDevicesToStr(pong.srcs).c_str(),
pong.exeDevice.exeRank, ExeTypeStr[pong.exeDevice.exeType], pong.exeDevice.exeIndex,
MemDevicesToStr(pong.dsts).c_str());
} else {
table.Set(rowIdx, 4, " %s->%c%d->%s <+> %s->%c%d->%s",
MemDevicesToStr(t.srcs).c_str(),
ExeTypeStr[t.exeDevice.exeType], t.exeDevice.exeIndex,
MemDevicesToStr(t.dsts).c_str(),
MemDevicesToStr(pong.srcs).c_str(),
ExeTypeStr[pong.exeDevice.exeType], pong.exeDevice.exeIndex,
MemDevicesToStr(pong.dsts).c_str());
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Transfer const& pong = transfers[idx + 1]; assumes the pong half always exists immediately after the ping index. If a ping transfer is last in the list, or transfer ordering differs (e.g., future parser changes / manual API usage), this will read out of bounds. Add a bounds/consistency check (e.g., idx + 1 < transfers.size() and transfers[idx+1].laps < 0) before dereferencing.

Suggested change
Transfer const& pong = transfers[idx + 1];
if (isMultiRank) {
table.Set(rowIdx, 4, " %s->R%d%c%d->%s <+> %s->R%d%c%d->%s",
MemDevicesToStr(t.srcs).c_str(),
exeDevice.exeRank, ExeTypeStr[t.exeDevice.exeType], t.exeDevice.exeIndex,
MemDevicesToStr(t.dsts).c_str(),
MemDevicesToStr(pong.srcs).c_str(),
pong.exeDevice.exeRank, ExeTypeStr[pong.exeDevice.exeType], pong.exeDevice.exeIndex,
MemDevicesToStr(pong.dsts).c_str());
} else {
table.Set(rowIdx, 4, " %s->%c%d->%s <+> %s->%c%d->%s",
MemDevicesToStr(t.srcs).c_str(),
ExeTypeStr[t.exeDevice.exeType], t.exeDevice.exeIndex,
MemDevicesToStr(t.dsts).c_str(),
MemDevicesToStr(pong.srcs).c_str(),
ExeTypeStr[pong.exeDevice.exeType], pong.exeDevice.exeIndex,
MemDevicesToStr(pong.dsts).c_str());
bool const hasPong = (static_cast<size_t>(idx + 1) < transfers.size()) && (transfers[idx + 1].laps < 0);
if (hasPong) {
Transfer const& pong = transfers[idx + 1];
if (isMultiRank) {
table.Set(rowIdx, 4, " %s->R%d%c%d->%s <+> %s->R%d%c%d->%s",
MemDevicesToStr(t.srcs).c_str(),
exeDevice.exeRank, ExeTypeStr[t.exeDevice.exeType], t.exeDevice.exeIndex,
MemDevicesToStr(t.dsts).c_str(),
MemDevicesToStr(pong.srcs).c_str(),
pong.exeDevice.exeRank, ExeTypeStr[pong.exeDevice.exeType], pong.exeDevice.exeIndex,
MemDevicesToStr(pong.dsts).c_str());
} else {
table.Set(rowIdx, 4, " %s->%c%d->%s <+> %s->%c%d->%s",
MemDevicesToStr(t.srcs).c_str(),
ExeTypeStr[t.exeDevice.exeType], t.exeDevice.exeIndex,
MemDevicesToStr(t.dsts).c_str(),
MemDevicesToStr(pong.srcs).c_str(),
ExeTypeStr[pong.exeDevice.exeType], pong.exeDevice.exeIndex,
MemDevicesToStr(pong.dsts).c_str());
}
} else {
if (isMultiRank) {
table.Set(rowIdx, 4, " %s->R%d%c%d->%s",
MemDevicesToStr(t.srcs).c_str(),
exeDevice.exeRank, ExeTypeStr[t.exeDevice.exeType], t.exeDevice.exeIndex,
MemDevicesToStr(t.dsts).c_str());
} else {
table.Set(rowIdx, 4, " %s->%c%d->%s",
MemDevicesToStr(t.srcs).c_str(),
ExeTypeStr[t.exeDevice.exeType], t.exeDevice.exeIndex,
MemDevicesToStr(t.dsts).c_str());
}

Copilot uses AI. Check for mistakes.
if (transfers[idx].laps >= 0) displayCount++;
numRows += 1 + displayCount;
if (ev.showIterations) {
numRows += (numTimedIterations + 1);
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When showIterations is enabled, numRows is only incremented by (numTimedIterations + 1) once per executor, but the rendering logic appends per-iteration rows for each displayed transfer. Because TableHelper::Set silently ignores out-of-bounds writes, this can truncate iteration output whenever an executor has multiple transfers. Update numRows to account for (numTimedIterations + 1) per displayed transfer (regular + pingpong).

Suggested change
numRows += (numTimedIterations + 1);
numRows += displayCount * (numTimedIterations + 1);

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants