Fix the startup script for A4X MAX BM#5624
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request updates the startup configuration for the A4X MAX BM blueprint to ensure better compatibility with newer kernel versions and cleaner environment management. The changes focus on hardening the package installation process, updating core dependencies, and optimizing the image build process by disabling automatic background updates. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request updates the Slurm blueprint for A4X Max GPU metal instances by installing kernel headers, upgrading DOCA to version 3.3.0, and disabling automatic APT updates. Feedback highlights the need to extract complex inline scripts into separate files, ensure apt-get update is called before installations, and clean up temporary download files. Additionally, the removal of the Packer completion signal should be reviewed for potential build timeouts, and the referenced kernel version (6.17) and GPG keyring file extensions should be corrected for accuracy.
| # mft depends on dkms | ||
| apt install -y dkms | ||
| # Ensure DKMS and headers for the current kernel are installed | ||
| sudo apt-get install -y dkms linux-headers-\$(uname -r) |
There was a problem hiding this comment.
It is highly recommended to run sudo apt-get update before installing packages to ensure the package index is current. Additionally, as this inline script is performing complex logic (installing headers, downloading external tools, and configuring repositories), it should be extracted into a separate script file to improve readability and maintainability.
sudo apt-get update && sudo apt-get install -y dkms linux-headers-\\$(uname -r)References
- Extract complex inline shell scripts from YAML into separate script files to improve readability and maintainability.
| sudo systemctl enable openibd | ||
| - type: shell | ||
| destination: stop_packer_early.sh | ||
| content: | | ||
| #!/bin/bash | ||
| BASEMETADATAURL=http://metadata.google.internal/computeMetadata/v1/instance/ | ||
| rm \$(curl -f -H "Metadata-Flavor: Google" ${BASEMETADATAURL}/attributes/startup-script-log-dest 2> /dev/null) | ||
| gcloud compute instances add-metadata \$(hostname -s) --metadata "startup-script-status"="done" --zone $(vars.zone) |
There was a problem hiding this comment.
Removing the stop_packer_early.sh runner eliminates the completion signal (startup-script-status=done) sent to the instance metadata. If the image build process relies on polling this metadata to determine when the startup script has finished, this change will cause those processes to time out. Please verify if this signal is truly no longer needed.
There was a problem hiding this comment.
this is not needed anymore as packer is getting stopped.
| ./install.sh | ||
| cd .. | ||
| # remove | ||
| rm -rf mft-4.34.0-145-aarch64-deb |
There was a problem hiding this comment.
The script removes the extracted directory but leaves the downloaded tarball mft-4.34.0-145-aarch64-deb.tgz in /opt/src. It is better to clean up all temporary files to keep the resulting image size minimal. Furthermore, consider moving this complex installation logic into a separate script file.
rm -rf mft-4.34.0-145-aarch64-deb*References
- Extract complex inline shell scripts from YAML into separate script files to improve readability and maintainability.
| # Update to a newer DOCA version if available to support kernel 6.17 | ||
| BASE_URL=https://linux.mellanox.com/public/repo/doca | ||
| DOCA_SUFFIX=3.2.0/ubuntu24.04/arm64-sbsa/ | ||
| DOCA_VERSION="3.3.0" # Verify compatibility with kernel 6.17 |
There was a problem hiding this comment.
The comment # Verify compatibility with kernel 6.17 suggests that the chosen DOCA version might not be fully validated. It is recommended to confirm compatibility and remove the speculative comment. Additionally, verify if the kernel version is indeed 6.17, as Ubuntu 24.04 typically defaults to a 6.8 kernel.
| curl $BASE_URL/GPG-KEY-Mellanox.pub | gpg --dearmor | sudo tee /etc/apt/trusted.gpg.d/GPG-KEY-Mellanox.pub > /dev/null | ||
| echo "deb [signed-by=/etc/apt/trusted.gpg.d/GPG-KEY-Mellanox.pub] $DOCA_URL ./" | sudo tee /etc/apt/sources.list.d/doca.list |
There was a problem hiding this comment.
When using gpg --dearmor, the resulting file is binary. The conventional extension for binary keyrings in /etc/apt/trusted.gpg.d/ is .gpg. Using .pub for a binary file is misleading. Additionally, this complex sequence of commands should be moved to a separate script file to align with repository standards.
curl $BASE_URL/GPG-KEY-Mellanox.pub | gpg --dearmor | sudo tee /etc/apt/trusted.gpg.d/GPG-KEY-Mellanox.gpg > /dev/null
echo "deb [signed-by=/etc/apt/trusted.gpg.d/GPG-KEY-Mellanox.gpg] $DOCA_URL ./" | sudo tee /etc/apt/sources.list.d/doca.listReferences
- Extract complex inline shell scripts from YAML into separate script files to improve readability and maintainability.
Submission Checklist
NOTE: Community submissions can take up to 2 weeks to be reviewed.
Please take the following actions before submitting this pull request.