what is the best practice of monitoring servers with different gpu driver using cadvisor


For monitoring pod gpu usage with cadvisor, we need to mount the access to NVML library path (/usr/lib/nvidia-418 for example) to cadvisor.

Currently, I create a daemonset on k8s cluster to deploy cadvisor on each node.

However, I need to support multiple versions of NVML library path. For example, some servers use /usr/lib/nvidia-418 while others use /usr/lib/nvidia-410. Directly specifying nvml path becomes impossible.

So what is the best practice in this case?

I have some ideas but I am not sure which is the best.

  1. divide servers by nvml path, all severs in one cluster use same nvml library version.

2.creat a soft link on every server, link /usr/lib/nvidia-418/* to /usr/lib/nvmlpath .

3.add a init job before cadvisor start, create soft link in the job.but I am not sure it will work.

4.add a sidecar of cadvisor to create soft link, but it can not guarantee sidecar finish before cadvisor get the path of nvml path.

  1. build a docker image based on cadvisor ,add soft link process in CMD.
-- curtank

1 Answer


I am not sure whether this is the best way, but in order to save headaches.

i would build cadvisor with nvidia-docker, then also set the docker daemon to use nvidia-container-runtime as default.

The only thing then require by your different servers is nvidia driver which should be okay.

-- Matthew Yeung
Source: StackOverflow