GKE - GPU nvidia - cuda drivers dont work

11/8/2019

I have setup a kubernetes node with a nvidia tesla k80 and followed this tutorial to try to run a pytorch docker image with nvidia drivers and cuda drivers working.

I have managed to install the nvidia daemonsets and i can now see the following pods:

nvidia-driver-installer-gmvgt
nvidia-gpu-device-plugin-lmj84

The problem is that even while using the recommendend image nvidia/cuda:10.0-runtime-ubuntu18.04 i still can't find the nvidia drivers inside my pod:

root@pod-name-5f6f776c77-87qgq:/app# ls /usr/local/
bin  cuda  cuda-10.0  etc  games  include  lib  man  sbin  share  src

But the tutorial mention:

CUDA libraries and debug utilities are made available inside the container at /usr/local/nvidia/lib64 and /usr/local/nvidia/bin, respectively.

I have also tried to test if cuda was working through torch.cuda.is_available() but i get False as a return value.

Many help in advance for your help

-- lifeguru42
google-kubernetes-engine
gpu
nvidia
pytorch

1 Answer

11/8/2019

Ok so i finally made nvidia drivers work.

It is mandatory to set a ressource limit to access the nvidia driver, which is weird considering either way my pod was on the right node with the nvidia drivers installed..

This made the nvidia folder accessible, but im'still unable to make the cuda install work with pytorch 1.3.0 .. [ issue here ]

-- lifeguru42
Source: StackOverflow