feat(slurm): add support for experimental async RPC reply mode#5680
feat(slurm): add support for experimental async RPC reply mode#5680AdarshK15 wants to merge 6 commits into
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces support for the experimental Asynchronous Reply Mode in the Slurm controller. By enabling this feature, the controller can offload socket connections to the kernel network stack, effectively mitigating thread starvation and deadlocks caused by slow clients. The changes include updates to Terraform variables, dynamic configuration generation logic, and comprehensive test coverage to ensure compatibility across different Slurm versions. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces support for Slurm version 25.11 and adds an experimental configuration variable to the Slurm controller module, which currently supports enabling enable_async_reply. The changes involve creating a new version-specific configuration file conf_v2505.py, updating version-checking logic in setup and synchronization scripts, and expanding the test suite. Review feedback suggests refactoring conf_v2505.py to reduce code duplication with conf.py and identifies several opportunities to simplify the test code by removing unnecessary mocks and redundant variable assignments.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces an experimental configuration variable to the Slurm controller, enabling the enable_async_reply feature for Slurm version 25.11. The underlying Python configuration logic was refactored into a class-based generator system with a factory pattern to support multiple Slurm versions (24.11, 25.05, and 25.11). Feedback was provided regarding the renaming of generate_configs_slurm_v2505 to generate_configs_slurm_v2511, which may cause breaking changes for any external integrations relying on the specific function name.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request refactors the Slurm configuration generation logic into a versioned, class-based structure to support Slurm versions 24.11, 25.05, and 25.11. It introduces a factory pattern to select the correct generator and adds an experimental variable to toggle new features, such as enable_async_reply for version 25.11. Feedback was provided to improve the robustness of the experimental parameter handling in conf_v2511.py by using a more defensive approach when modifying the configuration dictionary.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request refactors the Slurm configuration generation logic into a version-aware factory pattern, enabling support for Slurm versions 25.05 and 25.11. It introduces a new experimental variable to toggle features like enable_async_reply for newer Slurm releases. The review feedback suggests improving consistency by using attribute access for configuration parameters and identifies a potentially redundant compatibility shim in conf_v2505.py that could be removed to simplify the codebase.
Summary
This PR adds support for the experimental Asynchronous Reply Mode (enable_async_reply) in the Slurm controller module (schedmd-slurm-gcp-v6-controller). This feature allows Slurm to offload socket connections directly to the Linux kernel network stack, protecting the controller from thread starvation and deadlocks caused by slow clients.
Problem Statement
The Slurm controller processes connection sockets sequentially in user-space. If clients are extremely slow to read, or if there is network lag (like a TCP Window Stall), worker threads can get stuck retrying connections. This consumes CPU cycles and can quickly exhaust the thread pool, making the controller slow or completely unresponsive.
Changes Made
Testing and Results
I tested this by simulating a TCP Window Stall with 250 simultaneous slow-reading connections hitting the controller's port (6820).
System Call Tracing (strace)
This shows the thread (pid 2966) repeatedly context-switching back to file descriptor 3 to manually check for data using recvfrom, stalling on an EAGAIN block while other threads handle local security transactions.
Under async mode in Slurm 25.11, the threads offload the socket teardowns instantly and move directly to the next task.
Unit Tests & Coverage
Documentation
Module READMEs automatically updated via terraform-docs hooks.
Usage Example