Skip to content

feat(slurm): add support for experimental async RPC reply mode#5680

Draft
AdarshK15 wants to merge 6 commits into
GoogleCloudPlatform:developfrom
AdarshK15:async_rpc
Draft

feat(slurm): add support for experimental async RPC reply mode#5680
AdarshK15 wants to merge 6 commits into
GoogleCloudPlatform:developfrom
AdarshK15:async_rpc

Conversation

@AdarshK15
Copy link
Copy Markdown
Member

@AdarshK15 AdarshK15 commented May 18, 2026

Summary

This PR adds support for the experimental Asynchronous Reply Mode (enable_async_reply) in the Slurm controller module (schedmd-slurm-gcp-v6-controller). This feature allows Slurm to offload socket connections directly to the Linux kernel network stack, protecting the controller from thread starvation and deadlocks caused by slow clients.

Problem Statement

The Slurm controller processes connection sockets sequentially in user-space. If clients are extremely slow to read, or if there is network lag (like a TCP Window Stall), worker threads can get stuck retrying connections. This consumes CPU cycles and can quickly exhaust the thread pool, making the controller slow or completely unresponsive.

Changes Made

  • variables.tf & modules/slurm_files/variables.tf: Added an experimental variable block to accept the enable_async_reply flag (default: false).
  • scripts/conf.py: Appends enable_async_reply to SlurmctldParameters in slurm.conf when enabled on Slurm versions 25.11+.
  • scripts/conf_v2505.py (New File): Created a seperate config file for Slurm 25.05 to isolate Slurm 25.11 features and preserve compatibility.
  • scripts/setup.py & scripts/slurmsync.py: Updated dynamic configuration setup to call respective config files for versions 25.11, 25.05, and 24.11.
  • scripts/tests/: Added test_conf_v2505.py and updated test_conf.py / common.py to test configuration rendering for Slurmctld across all versions.

Testing and Results

I tested this by simulating a TCP Window Stall with 250 simultaneous slow-reading connections hitting the controller's port (6820).

System Call Tracing (strace)

  • Without Async Reply: Threads were trapped in CPU-wasting polling loops waiting for stalled clients, processing network requests and checking for credentials sequentially:
[pid  2965] socket(AF_UNIX, SOCK_STREAM, 0) = 27
[pid  2965] connect(27, {sa_family=AF_UNIX, sun_path="/var/run/munge/munge.socket.2"}, 110) = 0
[pid  2966] recvfrom(3, 0x7f012bd7b84e, 1, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)

This shows the thread (pid 2966) repeatedly context-switching back to file descriptor 3 to manually check for data using recvfrom, stalling on an EAGAIN block while other threads handle local security transactions.

  • With Async Reply: Worker threads instantly disengaged by calling an atomic shutdown on both the external client connections and internal security paths, offloading the remaining socket management completely to the kernel:
  connect(3, {sa_family=AF_UNIX, sun_path="/var/run/munge/munge.socket.2"}, 110) = 0
  [pid  2935] shutdown(27, SHUT_RD)        = 0
  [pid  2927] shutdown(3, SHUT_RD)        = 0

Under async mode in Slurm 25.11, the threads offload the socket teardowns instantly and move directly to the next task.

Unit Tests & Coverage

  • Ran pytest on all configuration rendering modules.
  • Validated rendering of SlurmctldParameters=enable_async_reply for Slurm v25.11+ and verified backward compatibility for older Slurm versions (v25.05 and v24.11).

Documentation

Module READMEs automatically updated via terraform-docs hooks.

Usage Example

- group: primary
  modules:
    - source: community/modules/scheduler/schedmd-slurm-gcp-v6-controller
      settings:
        experimental:
          enable_async_reply: true # Enables experimental async reply mode on Slurm 25.11+

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces support for the experimental Asynchronous Reply Mode in the Slurm controller. By enabling this feature, the controller can offload socket connections to the kernel network stack, effectively mitigating thread starvation and deadlocks caused by slow clients. The changes include updates to Terraform variables, dynamic configuration generation logic, and comprehensive test coverage to ensure compatibility across different Slurm versions.

Highlights

  • Experimental Feature Support: Added support for the experimental Asynchronous Reply Mode (enable_async_reply) to improve controller resilience against slow clients.
  • Version-Specific Configuration Routing: Implemented logic to route configuration generation based on Slurm version, ensuring compatibility for 25.11, 25.05, and 24.11.
  • Backward Compatibility: Created a new configuration module (conf_v2505.py) to preserve stable configuration behavior for Slurm 25.05.
  • Test Coverage: Expanded unit tests to validate configuration rendering for the new experimental flag and verified compatibility across multiple Slurm versions.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for Slurm version 25.11 and adds an experimental configuration variable to the Slurm controller module, which currently supports enabling enable_async_reply. The changes involve creating a new version-specific configuration file conf_v2505.py, updating version-checking logic in setup and synchronization scripts, and expanding the test suite. Review feedback suggests refactoring conf_v2505.py to reduce code duplication with conf.py and identifies several opportunities to simplify the test code by removing unnecessary mocks and redundant variable assignments.

@AdarshK15
Copy link
Copy Markdown
Member Author

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an experimental configuration variable to the Slurm controller, enabling the enable_async_reply feature for Slurm version 25.11. The underlying Python configuration logic was refactored into a class-based generator system with a factory pattern to support multiple Slurm versions (24.11, 25.05, and 25.11). Feedback was provided regarding the renaming of generate_configs_slurm_v2505 to generate_configs_slurm_v2511, which may cause breaking changes for any external integrations relying on the specific function name.

@AdarshK15
Copy link
Copy Markdown
Member Author

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the Slurm configuration generation logic into a versioned, class-based structure to support Slurm versions 24.11, 25.05, and 25.11. It introduces a factory pattern to select the correct generator and adds an experimental variable to toggle new features, such as enable_async_reply for version 25.11. Feedback was provided to improve the robustness of the experimental parameter handling in conf_v2511.py by using a more defensive approach when modifying the configuration dictionary.

@AdarshK15
Copy link
Copy Markdown
Member Author

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the Slurm configuration generation logic into a version-aware factory pattern, enabling support for Slurm versions 25.05 and 25.11. It introduces a new experimental variable to toggle features like enable_async_reply for newer Slurm releases. The review feedback suggests improving consistency by using attribute access for configuration parameters and identifies a potentially redundant compatibility shim in conf_v2505.py that could be removed to simplify the codebase.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant