Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds support for thread-block-group (TBG) slicing within PIPELINE execution by carrying TBG metadata (tbId, tbgSize) in each Operation and applying per-slice offsets/sizes inside the device execution handlers.
Changes:
- Extend
OperationwithtbId/tbgSizeand populate them fromtbg_infoin the JSON execution plan. - Add device-side
calcOffset/calcSizeand apply TBG slicing across multiple kernel handlers (GET/PUT/COPY/REDUCE/packet ops). - Remove host-side per-TBG pre-slicing in
ExecutionPlan::Impl::setupOperation(offsets/sizes remain “full”, slicing moves to device).
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
src/core/include/execution_kernel.hpp |
Implements device-side TBG slicing and applies it in multiple operation handlers, including within PIPELINE iterations. |
src/core/include/execution_common.hpp |
Adds tbId / tbgSize fields to Operation so the device kernel can apply TBG slicing. |
src/core/executor/execution_plan.cc |
Parses tbg_info from JSON and stores it into Operation instead of pre-slicing offsets/sizes on the host. |
| MSCCLPP_DEVICE_INLINE uint32_t calcOffset(uint32_t size, uint32_t index, uint32_t slices) { | ||
| constexpr uint32_t alignment = 16; | ||
| uint32_t nelems = size / alignment; | ||
| uint32_t minNelems = nelems / slices; | ||
| uint32_t remainder = nelems % slices; | ||
| uint32_t off = index * minNelems + (index < remainder ? index : remainder); | ||
| return off * alignment; | ||
| } | ||
|
|
||
| MSCCLPP_DEVICE_INLINE uint32_t calcSize(uint32_t size, uint32_t index, uint32_t slices) { | ||
| return calcOffset(size, index + 1, slices) - calcOffset(size, index, slices); | ||
| } |
There was a problem hiding this comment.
calcOffset/calcSize compute nelems = size / 16, which drops any remaining bytes (<16). In pipeline mode this can silently skip data whenever unitSize (or the remaining bytes in the last iteration) isn’t a multiple of 16, and can even make calcSize() return 0 for size < 16.
Consider either (a) enforcing/validating that unitSize and all buffer sizes processed here are multiples of 16, or (b) updating the slicing logic to distribute the tail bytes so the full size is covered across slices.
| MSCCLPP_DEVICE_INLINE uint32_t calcOffset(uint32_t size, uint32_t index, uint32_t slices) { | ||
| constexpr uint32_t alignment = 16; | ||
| uint32_t nelems = size / alignment; | ||
| uint32_t minNelems = nelems / slices; | ||
| uint32_t remainder = nelems % slices; | ||
| uint32_t off = index * minNelems + (index < remainder ? index : remainder); | ||
| return off * alignment; |
There was a problem hiding this comment.
calcOffset hardcodes alignment = 16, but ExecutionPlan supports a configurable buffer_alignment (default 16). If a plan sets buffer_alignment != 16, host-side offsets/sizes are computed on that alignment while device-side TBG slicing is computed on 16B units, which can produce inconsistent slicing and misaligned accesses.
Consider plumbing the plan’s bufferAlignment to device code (e.g., store it in DeviceExecutionPlan/Operation) or explicitly enforcing buffer_alignment == 16 when tbg_info is used.
No description provided.