For monitoring pod gpu usage with cadvisor, we need to mount the access to NVML library path (/usr/lib/nvidia-418 for example) to cadvisor.
Currently, I create a daemonset on k8s cluster to deploy cadvisor on each node.
However, I need to support multiple versions of NVML library path. For example, some servers use /usr/lib/nvidia-418 while others use /usr/lib/nvidia-410. Directly specifying nvml path becomes impossible.
So what is the best practice in this case?
I have some ideas but I am not sure which is the best.
2.creat a soft link on every server, link /usr/lib/nvidia-418/* to /usr/lib/nvmlpath .
3.add a init job before cadvisor start, create soft link in the job.but I am not sure it will work.
4.add a sidecar of cadvisor to create soft link, but it can not guarantee sidecar finish before cadvisor get the path of nvml path.
I am not sure whether this is the best way, but in order to save headaches.
i would build cadvisor with nvidia-docker, then also set the docker daemon to use nvidia-container-runtime as default.
The only thing then require by your different servers is nvidia driver which should be okay.