Update base image driver to 580 in a3m and a4x blueprint#5484
Update base image driver to 580 in a3m and a4x blueprint#5484saara-tyagi27 wants to merge 5 commits into
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request migrates the A3 Mega GPU infrastructure to Ubuntu 24.04. It includes necessary updates to NVIDIA software stacks, introduces specific workarounds for Enroot and DMABUF compatibility on newer kernels, and optimizes the deployment process by adding an early completion signal. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
|
|
||
| - type: ansible-local | ||
| destination: install_dmabuf.yml | ||
| - type: shell |
There was a problem hiding this comment.
Why are we switching this to a shell script?
| -i localhost, --limit localhost --connection=local \ | ||
| -e @/var/tmp/slurm_vars.json \ | ||
| ansible/playbook.yml | ||
| - type: ansible-local |
There was a problem hiding this comment.
Please confirm that this can be removed and that a gve-dkms package exists already in the image.
There was a problem hiding this comment.
I confirmed the gve driver is built-in (intree: Y) on both images. I wonder if the install step was originally there because older kernels needed it for missing features.
Since kernel 6.17 likely (assuming here) has these features built-in and the old 1.4.3 package fails to compile on this new kernel, I propose removing the step. Does that sound correct, or was there a specific reason for version 1.4.3 I'm missing?
Summary
This pull request migrates the A3 Mega and A4x high GPU infrastructures to Ubuntu 24.04 and NVIDIA driver 580.
Key changes