GKE - Unable to make cuda work with pytorch

11/8/2019

I have setup a kubernetes node with a nvidia tesla k80 and followed this tutorial to try to run a pytorch docker image with nvidia drivers and cuda drivers working.

My nvidia drivers and cuda drivers are all accessible inside my pod at /usr/local:

gt; ls /usr/local
bin cuda cuda-10.0 etc games include lib man nvidia sbin share src

And my GPU is also recongnized by my image nvidia/cuda:10.0-runtime-ubuntu18.04:

gt; /usr/local/nvidia/bin/nvidia-smi
Fri Nov 8 16:24:35 2019 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 410.79 Driver Version: 410.79 CUDA Version: 10.0 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla K80 Off | 00000000:00:04.0 Off | 0 | | N/A 73C P8 35W / 149W | 0MiB / 11441MiB | 0% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+

But after installing pytorch 1.3.0 i'm not able to make pytorch recognize my cuda installation even with LD_LIBRARY_PATH set to /usr/local/nvidia/lib64:/usr/local/cuda/lib64:

gt;
python3 -c "import torch; print(torch.cuda.is_available())" False
gt;
python3 Python 3.6.8 (default, Oct 7 2019, 12:59:55) [GCC 8.3.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import torch >>> print ('\t\ttorch.cuda.current_device() =', torch.cuda.current_device()) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/local/lib/python3.6/dist-packages/torch/cuda/__init__.py", line 386, in current_device _lazy_init() File "/usr/local/lib/python3.6/dist-packages/torch/cuda/__init__.py", line 192, in _lazy_init _check_driver() File "/usr/local/lib/python3.6/dist-packages/torch/cuda/__init__.py", line 111, in _check_driver of the CUDA driver.""".format(str(torch._C._cuda_getDriverVersion()))) AssertionError: The NVIDIA driver on your system is too old (found version 10000). Please update your GPU driver by downloading and installing a new version from the URL: http://www.nvidia.com/Download/index.aspx Alternatively, go to: https://pytorch.org to install a PyTorch version that has been compiled with your version of the CUDA driver.

The error above is strange because my cuda version for my image is 10.0 and Google GKE mentions that:

The latest supported CUDA version is 10.0

Also, it's GKE's daemonsets that automatically installs NVIDIA drivers

After adding GPU nodes to your cluster, you need to install NVIDIA's device drivers to the nodes.

Google provides a DaemonSet that automatically installs the drivers for you. Refer to the section below for installation instructions for Container-Optimized OS (COS) and Ubuntu nodes.

To deploy the installation DaemonSet, run the following command: kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml

I have tried everything i could think of, without success...

-- lifeguru42
google-cloud-platform
google-kubernetes-engine
kubernetes
pytorch

1 Answer

11/13/2019

I have resolved my problem by downgrading my pytorch version by buildling my docker images from pytorch/pytorch:1.2-cuda10.0-cudnn7-devel.

I still don't really know why before it was not working as it should otherwise then by guessing that pytorch 1.3.0 is not compatible with cuda 10.0.

-- lifeguru42
Source: StackOverflow