what is the best practice of monitoring servers with different gpu driver using cadvisor

10/25/2019

For monitoring pod gpu usage with cadvisor, we need to mount the access to NVML library path (/usr/lib/nvidia-418 for example) to cadvisor.

Currently, I create a daemonset on k8s cluster to deploy cadvisor on each node.

However, I need to support multiple versions of NVML library path. For example, some servers use /usr/lib/nvidia-418 while others use /usr/lib/nvidia-410. Directly specifying nvml path becomes impossible.

So what is the best practice in this case?

I have some ideas but I am not sure which is the best.

  1. divide servers by nvml path, all severs in one cluster use same nvml library version.

2.creat a soft link on every server, link /usr/lib/nvidia-418/* to /usr/lib/nvmlpath .

3.add a init job before cadvisor start, create soft link in the job.but I am not sure it will work.

4.add a sidecar of cadvisor to create soft link, but it can not guarantee sidecar finish before cadvisor get the path of nvml path.

  1. build a docker image based on cadvisor ,add soft link process in CMD.
-- curtank
cadvisor
gpu
kubernetes
nvidia

1 Answer

10/25/2019

I am not sure whether this is the best way, but in order to save headaches.

i would build cadvisor with nvidia-docker, then also set the docker daemon to use nvidia-container-runtime as default.

The only thing then require by your different servers is nvidia driver which should be okay.

-- Matthew Yeung
Source: StackOverflow