Setup:
With custom change of GPU Operator
NVIDIA/gpu-operator@master...Dragoncell:gpu-operator:master-gke
Using below command to install the GPU Operator using CDI enabled with COS installed GPU driver
helm upgrade -i --create-namespace --namespace gpu-operator noperator deployments/gpu-operator --set driver.enabled=false --set cdi.enabled=true --set cdi.default=true --set operator.runtimeClass=nvidia-cdi --set hostRoot=/ --set driverRoot=/home/kubernetes/bin/nvidia --set devRoot=/ --set operator.repository=gcr.io/jiamingxu-gke-dev --set operator.version=v0422_04 --set toolkit.installDir=/home/kubernetes/bin/nvidia --set toolkit.repository=gcr.io/jiamingxu-gke-dev --set toolkit.version=v4 --set validator.repository=gcr.io/jiamingxu-gke-dev --set validator.version=v0417_1 --set devicePlugin.version=v0422_4 --set devicePlugin.repository=gcr.io/jiamingxu-gke-dev
During the CDI creation either in toolkit container for management cdi spec, or in k8s device plugin for workload cdi spec, there are a few warning level logs.
Both:
- Could not find ld.so.cache
time="2024-04-22T19:37:03Z" level=warning msg="Could not find ld.so.cache at /host/home/kubernetes/bin/nvidia/etc/ld.so.cache; creating empty cache"
time="2024-04-22T19:37:03Z" level=info msg="Using driver version 535.129.03"
time="2024-04-22T19:37:03Z" level=warning msg="Could not find ld.so.cache at /host/home/kubernetes/bin/nvidia/etc/ld.so.cache; creating empty cache"
- Feature related stuff
time="2024-04-22T19:37:03Z" level=warning msg="Could not locate /nvidia-persistenced/socket: pattern /nvidia-persistenced/socket not found"
time="2024-04-22T19:37:03Z" level=warning msg="Could not locate /nvidia-fabricmanager/socket: pattern /nvidia-fabricmanager/socket not found"
time="2024-04-22T19:37:03Z" level=warning msg="Could not locate /tmp/nvidia-mps: pattern /tmp/nvidia-mps not found"
time="2024-04-22T19:37:03Z" level=warning msg="Could not locate nvidia/535.129.03/gsp*.bin: pattern nvidia/535.129.03/gsp*.bin not found"
k8s device plugin only
time="2024-04-22T19:37:22Z" level=warning msg="Could not locate glvnd/egl_vendor.d/10_nvidia.json: pattern glvnd/egl_vendor.d/10_nvidia.json not found"
time="2024-04-22T19:37:22Z" level=warning msg="Could not locate vulkan/icd.d/nvidia_icd.json: pattern vulkan/icd.d/nvidia_icd.json not found"
time="2024-04-22T19:37:22Z" level=warning msg="Could not locate vulkan/icd.d/nvidia_layers.json: pattern vulkan/icd.d/nvidia_layers.json not found"
time="2024-04-22T19:37:22Z" level=warning msg="Could not locate vulkan/implicit_layer.d/nvidia_layers.json: pattern vulkan/implicit_layer.d/nvidia_layers.json not found"
time="2024-04-22T19:37:22Z" level=warning msg="Could not locate egl/egl_external_platform.d/15_nvidia_gbm.json: pattern egl/egl_external_platform.d/15_nvidia_gbm.json not found"
time="2024-04-22T19:37:22Z" level=warning msg="Could not locate egl/egl_external_platform.d/10_nvidia_wayland.json: pattern egl/egl_external_platform.d/10_nvidia_wayland.json not found"
time="2024-04-22T19:37:22Z" level=warning msg="Could not locate nvidia/nvoptix.bin: pattern nvidia/nvoptix.bin not found"
time="2024-04-22T19:37:22Z" level=warning msg="Could not locate nvidia/xorg/nvidia_drv.so: pattern nvidia/xorg/nvidia_drv.so not found"
time="2024-04-22T19:37:22Z" level=warning msg="Could not locate nvidia/xorg/libglxserver_nvidia.so.535.129.03: pattern nvidia/xorg/libglxserver_nvidia.so.535.129.03 not found"
time="2024-04-22T19:37:22Z" level=warning msg="Could not locate X11/xorg.conf.d/10-nvidia.conf: pattern X11/xorg.conf.d/10-nvidia.conf not found"
....
time="2024-04-22T19:37:22Z" level=warning msg="Could not locate nvidia/xorg/nvidia_drv.so: pattern nvidia/xorg/nvidia_drv.so not found"
time="2024-04-22T19:37:22Z" level=warning msg="Could not locate nvidia/xorg/libglxserver_nvidia.so.535.129.03: pattern nvidia/xorg/libglxserver_nvidia.so.535.129.03 not found"
Wondering is there any warning worth further investigation ? For example vulkan/icd.d/nvidia_icd.json, it is actually under like
/home/kubernetes/bin/nvidia/vulkan/icd.d $ ls
nvidia_icd.json
Setup:
With custom change of GPU Operator
NVIDIA/gpu-operator@master...Dragoncell:gpu-operator:master-gke
Using below command to install the GPU Operator using CDI enabled with COS installed GPU driver
During the CDI creation either in toolkit container for management cdi spec, or in k8s device plugin for workload cdi spec, there are a few warning level logs.
Both:
k8s device plugin only
Wondering is there any warning worth further investigation ? For example
vulkan/icd.d/nvidia_icd.json, it is actually under like