Skip to content

Transitioning to Slurm Native Auth with resilient workbench keys distribution#5695

Draft
arpit974 wants to merge 6 commits into
GoogleCloudPlatform:developfrom
arpit974:munge_to_sauth
Draft

Transitioning to Slurm Native Auth with resilient workbench keys distribution#5695
arpit974 wants to merge 6 commits into
GoogleCloudPlatform:developfrom
arpit974:munge_to_sauth

Conversation

@arpit974
Copy link
Copy Markdown
Contributor

@arpit974 arpit974 commented May 21, 2026

Objective

This PR transitions the SchedMD Slurm-GCP v6 scheduler controller and associated tooling from legacy MUNGE-based authentication (auth/munge) to the secure, modern Slurm Native Authentication (auth/slurm) standard.

Support for MUNGE is deprecating and scheduled for complete removal on July 31, 2026. This PR implements a production-grade transition runway with complete backward compatibility for legacy clusters and daily testing pipelines.


Technical Highlights & Key Gaps Resolved

1. Resilient Key Distribution for Workbenches (OFE Frontend)

  • The Problem: A persistent NFS mount to the controller's key directory hangs and freezes the workbench instance if the controller reboots or GCE experiences a transient network drop.
  • Our Resolution (Resilient Temp Mount Copy): Modernized the workbench startup script (workbenchinfo.py) to build stable Slurm 24.05 (RPC-compatible). It mounts SchedMD's exports target /slurm/key_distribution to a temporary directory (/mnt/clusterkey), securely copies the slurm.key locally, sets strict 0400 permissions owned by slurm:slurm, and instantly unmounts and deletes the temporary mount point.
  • Upgrade Safety Shield: Defaulted the workbench database fallbacks strictly to False (getattr(..., False)). Upgrades to the dashboard will not force Native Auth on old clusters, protecting legacy users from NFS mount startup crashes.

2. Upgraded Go Validator Settings Resolution (Pruned Blueprints)

  • Injected global variables scoping resolution fallbacks inside Gcluster's compiler settings parser (metadata_validator_helpers.go).
  • If a setting is omitted from a module Settings block in the YAML, Gcluster automatically resolves it to the global vars: block.
  • Outcome: Bypasses the need to modify explicit templates blocks. All 53 reference blueprints are preserved 100% in their pristine master templates baselines.

3. Unified Dynamic CI/CD Pipeline Playbooks

  • Stripped the dead, static key_type: "munge" variables out of integration playbooks (slurm-integration-test.yml).
  • Implemented a unified, non-blocking check:
    test -f /etc/slurm/slurm.key || test -S /var/run/munge/munge.socket.2
  • Automated pipelines evaluate either key existence (Native Auth) or socket activity (MUNGE) dynamically, ensuring 100% backward compatibility for legacy runs without 10-minute timeout failures.

Submission Checklist

NOTE: Community submissions can take up to 2 weeks to be reviewed.

Please take the following actions before submitting this pull request.

  • Fork your PR branch from the Toolkit "develop" branch (not main)
  • Test all changes with pre-commit in a local branch #
  • Confirm that "make tests" passes all tests
  • Add or modify unit tests to cover code changes
  • Ensure that unit test coverage remains above 80%
  • Update all applicable documentation
  • Follow Cluster Toolkit Contribution guidelines #

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request initiates the transition from legacy MUNGE-based authentication to Slurm Native Authentication. It updates the infrastructure-as-code blueprints, frontend forms, and deployment logic to support this new standard. To ensure a smooth transition, the PR includes a detailed migration guide and deprecation warnings for existing clusters, while also updating internal testing pipelines to validate the new authentication model.

Highlights

  • Slurm Native Authentication Support: Introduced the 'enable_slurm_auth' variable across numerous blueprint examples and frontend configurations to support native authentication.
  • Startup Script Updates: Updated workbench startup scripts to compile and configure Slurm Native Authentication when enabled, replacing the legacy MUNGE installation.
  • Deprecation Notice: Added deprecation warnings and metadata validation to notify users that MUNGE-based authentication is scheduled for removal on July 31, 2026.
  • Migration Documentation: Created a comprehensive migration guide to assist administrators in transitioning existing clusters to the new authentication standard.
  • Validator Enhancements: Improved Go-based metadata validators to support implicit variable binding and provide clearer error messaging.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces Slurm Native Authentication as a secure alternative to MUNGE, including a migration guide and deprecation notices for MUNGE scheduled for 2026. The changes affect numerous blueprints, the frontend workbench, Terraform modules, and the Go-based validator logic, which now supports implicit variable binding. Feedback highlights critical issues in the workbench implementation, specifically the risks of compiling Slurm from source within a startup script and the use of hardcoded hostname patterns. Additionally, improvements to shell script error handling were suggested, and the new implicit variable binding in the Go validator was flagged for better documentation and adherence to repository rules regarding programmatic injection.

Comment thread community/front-end/ofe/website/ghpcfe/cluster_manager/workbenchinfo.py Outdated
Comment thread pkg/validators/metadata_validator_helpers.go
@arpit974 arpit974 changed the title munge to sauth migratio prereqs. Transitioning to Slurm Native Auth with resilient workbench keys distribution May 21, 2026
@arpit974 arpit974 added the release-breaking-changes Prevents "smooth" re-deploy across versions label May 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-breaking-changes Prevents "smooth" re-deploy across versions

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant