Port distributed comm abilities to PR444#522
Port distributed comm abilities to PR444#522PKUZHOU wants to merge 2 commits intohw-native-sys:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a distributed communication framework for the PTO runtime, enabling multi-rank execution on A2/A3 hardware via HCCL and on simulation via POSIX shared memory. Key features include RDMA window management, asynchronous task completion using completion queues, and per-chip bootstrap channels, along with new distributed examples and updated Python tooling. Feedback highlights a race condition in the simulation barrier implementation, missing error handling for shared memory allocation timeouts, and an issue where profiling is hardcoded to false for deferred tasks. Additionally, the reviewer noted that querying dlerror when loading optional symbols is unnecessary.
| int arrived = __atomic_add_fetch(&hdr->barrier_count, 1, __ATOMIC_ACQ_REL); | ||
|
|
||
| if (arrived == h->nranks) { | ||
| __atomic_store_n(&hdr->barrier_count, 0, __ATOMIC_RELEASE); | ||
| __atomic_add_fetch(&hdr->barrier_phase, 1, __ATOMIC_ACQ_REL); | ||
| } else { | ||
| while (__atomic_load_n(&hdr->barrier_phase, __ATOMIC_ACQUIRE) == phase) { | ||
| usleep(50); | ||
| } | ||
| } |
There was a problem hiding this comment.
The barrier implementation in comm_barrier contains a race condition. The last rank to arrive resets barrier_count to 0 (line 177) before incrementing barrier_phase (line 178). If another rank (e.g., Rank B) exits the loop, returns from the function, and immediately enters the next barrier, it will increment barrier_count. If the last rank (Rank A) from the previous barrier then executes line 177, it will overwrite Rank B's increment, causing the next barrier to hang or behave incorrectly. A robust sense-reversing barrier should avoid resetting the shared counter in this manner or use multiple counters.
| T load_optional_symbol(void *handle, const char *name) { | ||
| dlerror(); | ||
| void *sym = dlsym(handle, name); | ||
| const char *err = dlerror(); | ||
| if (err) return nullptr; | ||
| return reinterpret_cast<T>(sym); | ||
| } |
There was a problem hiding this comment.
The implementation of load_optional_symbol unnecessarily queries dlerror(). According to the general rules for this repository, when looking up an optional symbol with dlsym, checking the return value for NULL is the correct approach. The absence of the symbol is intentional and should not be treated as an error by querying dlerror(), which can also be less efficient.
template <typename T>
T load_optional_symbol(void *handle, const char *name) {
return reinterpret_cast<T>(dlsym(handle, name));
}References
- When using dlsym to look up an optional symbol, checking the return value for NULL is the correct approach. Do not treat a NULL return as an error by checking dlerror(), as the absence of the symbol is intentional for reasons like backward compatibility.
| for (int i = 0; i < 5000; ++i) { | ||
| struct stat st; | ||
| if (fstat(fd, &st) == 0 && static_cast<size_t>(st.st_size) >= total) break; | ||
| usleep(1000); | ||
| } |
There was a problem hiding this comment.
In comm_alloc_windows, if the polling loop for the shared memory file size times out, the function continues to mmap (line 111) without verifying if the file was successfully initialized. This could lead to mapping an incomplete region or a subsequent crash (e.g., SIGBUS). Following repository guidelines, return values of resource allocation and setup functions like fstat should be checked, and failures handled gracefully. The function should return an error if the timeout is reached.
bool ok = false;
for (int i = 0; i < 5000; ++i) {
struct stat st;
if (fstat(fd, &st) == 0 && static_cast<size_t>(st.st_size) >= total) {
ok = true;
break;
}
usleep(1000);
}
if (!ok) {
std::fprintf(stderr, "comm_sim: timeout waiting for shm initialization\n");
close(fd);
return -1;
}References
- Always check the return values of resource allocation functions (like pthread_key_create) and handle potential failures gracefully, for example, by rolling back partial allocations.
| PTO2AsyncPollResult poll_result = async_wait_list.poll_and_complete<false>( | ||
| &rt->scheduler, local_bufs, | ||
| deferred_release_slot_states, deferred_release_count, MAX_DEFERRED_RELEASES | ||
| #if PTO2_SCHED_PROFILING | ||
| , thread_idx | ||
| #endif | ||
| ); |
There was a problem hiding this comment.
The call to poll_and_complete hardcodes the Profiling template parameter to false. This means that deferred-completion tasks will not be correctly profiled even when profiling_enabled is true. Additionally, the Profiling template parameter in poll_and_complete (and pto2_complete_task) appears to be unused in the current implementation, as the logic relies on the PTO2_SCHED_PROFILING macro instead. Consider removing the unused template parameter and ensuring profiling is correctly handled for async tasks.
No description provided.