I have setup a kubernetes node with a nvidia tesla k80 and followed this tutorial to try to run a pytorch docker image with nvidia drivers and cuda drivers working.
I have managed to install the nvidia daemonsets and i can now see the following pods:
nvidia-driver-installer-gmvgt
nvidia-gpu-device-plugin-lmj84
The problem is that even while using the recommendend image nvidia/cuda:10.0-runtime-ubuntu18.04
i still can't find the nvidia drivers inside my pod:
root@pod-name-5f6f776c77-87qgq:/app# ls /usr/local/
bin cuda cuda-10.0 etc games include lib man sbin share src
But the tutorial mention:
CUDA libraries and debug utilities are made available inside the container at
/usr/local/nvidia/lib64
and/usr/local/nvidia/bin
, respectively.
I have also tried to test if cuda was working through torch.cuda.is_available()
but i get False as a return value.
Many help in advance for your help
Ok so i finally made nvidia drivers work.
It is mandatory to set a ressource limit to access the nvidia driver, which is weird considering either way my pod was on the right node with the nvidia drivers installed..
This made the nvidia folder accessible, but im'still unable to make the cuda install work with pytorch 1.3.0 .. [ issue here ]