Skip to content

Port distributed comm abilities to PR444#522

Open
PKUZHOU wants to merge 2 commits intohw-native-sys:mainfrom
PKUZHOU:pr_444_async
Open

Port distributed comm abilities to PR444#522
PKUZHOU wants to merge 2 commits intohw-native-sys:mainfrom
PKUZHOU:pr_444_async

Conversation

@PKUZHOU
Copy link
Copy Markdown

@PKUZHOU PKUZHOU commented Apr 12, 2026

No description provided.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a distributed communication framework for the PTO runtime, enabling multi-rank execution on A2/A3 hardware via HCCL and on simulation via POSIX shared memory. Key features include RDMA window management, asynchronous task completion using completion queues, and per-chip bootstrap channels, along with new distributed examples and updated Python tooling. Feedback highlights a race condition in the simulation barrier implementation, missing error handling for shared memory allocation timeouts, and an issue where profiling is hardcoded to false for deferred tasks. Additionally, the reviewer noted that querying dlerror when loading optional symbols is unnecessary.

Comment on lines +174 to +183
int arrived = __atomic_add_fetch(&hdr->barrier_count, 1, __ATOMIC_ACQ_REL);

if (arrived == h->nranks) {
__atomic_store_n(&hdr->barrier_count, 0, __ATOMIC_RELEASE);
__atomic_add_fetch(&hdr->barrier_phase, 1, __ATOMIC_ACQ_REL);
} else {
while (__atomic_load_n(&hdr->barrier_phase, __ATOMIC_ACQUIRE) == phase) {
usleep(50);
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The barrier implementation in comm_barrier contains a race condition. The last rank to arrive resets barrier_count to 0 (line 177) before incrementing barrier_phase (line 178). If another rank (e.g., Rank B) exits the loop, returns from the function, and immediately enters the next barrier, it will increment barrier_count. If the last rank (Rank A) from the previous barrier then executes line 177, it will overwrite Rank B's increment, causing the next barrier to hang or behave incorrectly. A robust sense-reversing barrier should avoid resetting the shared counter in this manner or use multiple counters.

Comment on lines +40 to +46
T load_optional_symbol(void *handle, const char *name) {
dlerror();
void *sym = dlsym(handle, name);
const char *err = dlerror();
if (err) return nullptr;
return reinterpret_cast<T>(sym);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The implementation of load_optional_symbol unnecessarily queries dlerror(). According to the general rules for this repository, when looking up an optional symbol with dlsym, checking the return value for NULL is the correct approach. The absence of the symbol is intentional and should not be treated as an error by querying dlerror(), which can also be less efficient.

template <typename T>
T load_optional_symbol(void *handle, const char *name) {
    return reinterpret_cast<T>(dlsym(handle, name));
}
References
  1. When using dlsym to look up an optional symbol, checking the return value for NULL is the correct approach. Do not treat a NULL return as an error by checking dlerror(), as the absence of the symbol is intentional for reasons like backward compatibility.

Comment on lines +101 to +105
for (int i = 0; i < 5000; ++i) {
struct stat st;
if (fstat(fd, &st) == 0 && static_cast<size_t>(st.st_size) >= total) break;
usleep(1000);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

In comm_alloc_windows, if the polling loop for the shared memory file size times out, the function continues to mmap (line 111) without verifying if the file was successfully initialized. This could lead to mapping an incomplete region or a subsequent crash (e.g., SIGBUS). Following repository guidelines, return values of resource allocation and setup functions like fstat should be checked, and failures handled gracefully. The function should return an error if the timeout is reached.

        bool ok = false;
        for (int i = 0; i < 5000; ++i) {
            struct stat st;
            if (fstat(fd, &st) == 0 && static_cast<size_t>(st.st_size) >= total) {
                ok = true;
                break;
            }
            usleep(1000);
        }
        if (!ok) {
            std::fprintf(stderr, "comm_sim: timeout waiting for shm initialization\n");
            close(fd);
            return -1;
        }
References
  1. Always check the return values of resource allocation functions (like pthread_key_create) and handle potential failures gracefully, for example, by rolling back partial allocations.

Comment on lines +1411 to +1417
PTO2AsyncPollResult poll_result = async_wait_list.poll_and_complete<false>(
&rt->scheduler, local_bufs,
deferred_release_slot_states, deferred_release_count, MAX_DEFERRED_RELEASES
#if PTO2_SCHED_PROFILING
, thread_idx
#endif
);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The call to poll_and_complete hardcodes the Profiling template parameter to false. This means that deferred-completion tasks will not be correctly profiled even when profiling_enabled is true. Additionally, the Profiling template parameter in poll_and_complete (and pto2_complete_task) appears to be unused in the current implementation, as the logic relies on the PTO2_SCHED_PROFILING macro instead. Consider removing the unused template parameter and ensuring profiling is correctly handled for async tasks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant