Skip to content

Support TBG inside Pipeline#783

Open
caiomcbr wants to merge 5 commits intomainfrom
caiorocha/support_tbg_pipeline
Open

Support TBG inside Pipeline#783
caiomcbr wants to merge 5 commits intomainfrom
caiorocha/support_tbg_pipeline

Conversation

@caiomcbr
Copy link
Copy Markdown
Contributor

No description provided.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds support for thread-block-group (TBG) slicing within PIPELINE execution by carrying TBG metadata (tbId, tbgSize) in each Operation and applying per-slice offsets/sizes inside the device execution handlers.

Changes:

  • Extend Operation with tbId / tbgSize and populate them from tbg_info in the JSON execution plan.
  • Add device-side calcOffset/calcSize and apply TBG slicing across multiple kernel handlers (GET/PUT/COPY/REDUCE/packet ops).
  • Remove host-side per-TBG pre-slicing in ExecutionPlan::Impl::setupOperation (offsets/sizes remain “full”, slicing moves to device).

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File Description
src/core/include/execution_kernel.hpp Implements device-side TBG slicing and applies it in multiple operation handlers, including within PIPELINE iterations.
src/core/include/execution_common.hpp Adds tbId / tbgSize fields to Operation so the device kernel can apply TBG slicing.
src/core/executor/execution_plan.cc Parses tbg_info from JSON and stores it into Operation instead of pre-slicing offsets/sizes on the host.

Comment on lines +69 to +80
MSCCLPP_DEVICE_INLINE uint32_t calcOffset(uint32_t size, uint32_t index, uint32_t slices) {
constexpr uint32_t alignment = 16;
uint32_t nelems = size / alignment;
uint32_t minNelems = nelems / slices;
uint32_t remainder = nelems % slices;
uint32_t off = index * minNelems + (index < remainder ? index : remainder);
return off * alignment;
}

MSCCLPP_DEVICE_INLINE uint32_t calcSize(uint32_t size, uint32_t index, uint32_t slices) {
return calcOffset(size, index + 1, slices) - calcOffset(size, index, slices);
}
Copy link

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

calcOffset/calcSize compute nelems = size / 16, which drops any remaining bytes (<16). In pipeline mode this can silently skip data whenever unitSize (or the remaining bytes in the last iteration) isn’t a multiple of 16, and can even make calcSize() return 0 for size < 16.

Consider either (a) enforcing/validating that unitSize and all buffer sizes processed here are multiples of 16, or (b) updating the slicing logic to distribute the tail bytes so the full size is covered across slices.

Copilot uses AI. Check for mistakes.
Comment on lines +69 to +75
MSCCLPP_DEVICE_INLINE uint32_t calcOffset(uint32_t size, uint32_t index, uint32_t slices) {
constexpr uint32_t alignment = 16;
uint32_t nelems = size / alignment;
uint32_t minNelems = nelems / slices;
uint32_t remainder = nelems % slices;
uint32_t off = index * minNelems + (index < remainder ? index : remainder);
return off * alignment;
Copy link

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

calcOffset hardcodes alignment = 16, but ExecutionPlan supports a configurable buffer_alignment (default 16). If a plan sets buffer_alignment != 16, host-side offsets/sizes are computed on that alignment while device-side TBG slicing is computed on 16B units, which can produce inconsistent slicing and misaligned accesses.

Consider plumbing the plan’s bufferAlignment to device code (e.g., store it in DeviceExecutionPlan/Operation) or explicitly enforcing buffer_alignment == 16 when tbg_info is used.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants