Skip to content

Fix the startup script for A4X MAX BM#5624

Draft
LAVEEN wants to merge 1 commit into
GoogleCloudPlatform:developfrom
LAVEEN:fixa4xstartup
Draft

Fix the startup script for A4X MAX BM#5624
LAVEEN wants to merge 1 commit into
GoogleCloudPlatform:developfrom
LAVEEN:fixa4xstartup

Conversation

@LAVEEN
Copy link
Copy Markdown
Collaborator

@LAVEEN LAVEEN commented May 7, 2026

Submission Checklist

NOTE: Community submissions can take up to 2 weeks to be reviewed.

Please take the following actions before submitting this pull request.

  • Fork your PR branch from the Toolkit "develop" branch (not main)
  • Test all changes with pre-commit in a local branch #
  • Confirm that "make tests" passes all tests
  • Add or modify unit tests to cover code changes
  • Ensure that unit test coverage remains above 80%
  • Update all applicable documentation
  • Follow Cluster Toolkit Contribution guidelines #

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request updates the startup configuration for the A4X MAX BM blueprint to ensure better compatibility with newer kernel versions and cleaner environment management. The changes focus on hardening the package installation process, updating core dependencies, and optimizing the image build process by disabling automatic background updates.

Highlights

  • Dependency Updates: Updated the startup script to ensure DKMS and kernel headers are installed, and upgraded the DOCA version to 3.3.0 to improve compatibility with kernel 6.17.
  • Security and Permissions: Improved script security by using sudo with tee for file operations and added cloud-config to disable automatic APT updates during image creation.
  • Cleanup: Removed the redundant stop_packer_early.sh shell script from the deployment configuration.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the Slurm blueprint for A4X Max GPU metal instances by installing kernel headers, upgrading DOCA to version 3.3.0, and disabling automatic APT updates. Feedback highlights the need to extract complex inline scripts into separate files, ensure apt-get update is called before installations, and clean up temporary download files. Additionally, the removal of the Packer completion signal should be reviewed for potential build timeouts, and the referenced kernel version (6.17) and GPG keyring file extensions should be corrected for accuracy.

# mft depends on dkms
apt install -y dkms
# Ensure DKMS and headers for the current kernel are installed
sudo apt-get install -y dkms linux-headers-\$(uname -r)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

It is highly recommended to run sudo apt-get update before installing packages to ensure the package index is current. Additionally, as this inline script is performing complex logic (installing headers, downloading external tools, and configuring repositories), it should be extracted into a separate script file to improve readability and maintainability.

          sudo apt-get update && sudo apt-get install -y dkms linux-headers-\\$(uname -r)
References
  1. Extract complex inline shell scripts from YAML into separate script files to improve readability and maintainability.

Comment on lines 348 to -354
sudo systemctl enable openibd
- type: shell
destination: stop_packer_early.sh
content: |
#!/bin/bash
BASEMETADATAURL=http://metadata.google.internal/computeMetadata/v1/instance/
rm \$(curl -f -H "Metadata-Flavor: Google" ${BASEMETADATAURL}/attributes/startup-script-log-dest 2> /dev/null)
gcloud compute instances add-metadata \$(hostname -s) --metadata "startup-script-status"="done" --zone $(vars.zone)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Removing the stop_packer_early.sh runner eliminates the completion signal (startup-script-status=done) sent to the instance metadata. If the image build process relies on polling this metadata to determine when the startup script has finished, this change will cause those processes to time out. Please verify if this signal is truly no longer needed.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not needed anymore as packer is getting stopped.

./install.sh
cd ..
# remove
rm -rf mft-4.34.0-145-aarch64-deb
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The script removes the extracted directory but leaves the downloaded tarball mft-4.34.0-145-aarch64-deb.tgz in /opt/src. It is better to clean up all temporary files to keep the resulting image size minimal. Furthermore, consider moving this complex installation logic into a separate script file.

          rm -rf mft-4.34.0-145-aarch64-deb*
References
  1. Extract complex inline shell scripts from YAML into separate script files to improve readability and maintainability.

# Update to a newer DOCA version if available to support kernel 6.17
BASE_URL=https://linux.mellanox.com/public/repo/doca
DOCA_SUFFIX=3.2.0/ubuntu24.04/arm64-sbsa/
DOCA_VERSION="3.3.0" # Verify compatibility with kernel 6.17
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The comment # Verify compatibility with kernel 6.17 suggests that the chosen DOCA version might not be fully validated. It is recommended to confirm compatibility and remove the speculative comment. Additionally, verify if the kernel version is indeed 6.17, as Ubuntu 24.04 typically defaults to a 6.8 kernel.

Comment on lines +324 to +325
curl $BASE_URL/GPG-KEY-Mellanox.pub | gpg --dearmor | sudo tee /etc/apt/trusted.gpg.d/GPG-KEY-Mellanox.pub > /dev/null
echo "deb [signed-by=/etc/apt/trusted.gpg.d/GPG-KEY-Mellanox.pub] $DOCA_URL ./" | sudo tee /etc/apt/sources.list.d/doca.list
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

When using gpg --dearmor, the resulting file is binary. The conventional extension for binary keyrings in /etc/apt/trusted.gpg.d/ is .gpg. Using .pub for a binary file is misleading. Additionally, this complex sequence of commands should be moved to a separate script file to align with repository standards.

          curl $BASE_URL/GPG-KEY-Mellanox.pub | gpg --dearmor | sudo tee /etc/apt/trusted.gpg.d/GPG-KEY-Mellanox.gpg > /dev/null
          echo "deb [signed-by=/etc/apt/trusted.gpg.d/GPG-KEY-Mellanox.gpg] $DOCA_URL ./" | sudo tee /etc/apt/sources.list.d/doca.list
References
  1. Extract complex inline shell scripts from YAML into separate script files to improve readability and maintainability.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant