I have setup a kubernetes node with a nvidia tesla k80 and followed this tutorial to try to run a pytorch docker image with nvidia drivers and cuda drivers working.
My nvidia drivers and cuda drivers are all accessible inside my pod at /usr/local
:
gt; ls /usr/local
bin cuda cuda-10.0 etc games include lib man nvidia sbin share src
And my GPU is also recongnized by my image nvidia/cuda:10.0-runtime-ubuntu18.04
:
gt; /usr/local/nvidia/bin/nvidia-smi
Fri Nov 8 16:24:35 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.79 Driver Version: 410.79 CUDA Version: 10.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 Off | 00000000:00:04.0 Off | 0 |
| N/A 73C P8 35W / 149W | 0MiB / 11441MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
But after installing pytorch 1.3.0 i'm not able to make pytorch recognize my cuda installation even with LD_LIBRARY_PATH
set to /usr/local/nvidia/lib64:/usr/local/cuda/lib64
:
gt; python3 -c "import torch; print(torch.cuda.is_available())"
False
gt; python3
Python 3.6.8 (default, Oct 7 2019, 12:59:55)
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> print ('\t\ttorch.cuda.current_device() =', torch.cuda.current_device())
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.6/dist-packages/torch/cuda/__init__.py", line 386, in current_device
_lazy_init()
File "/usr/local/lib/python3.6/dist-packages/torch/cuda/__init__.py", line 192, in _lazy_init
_check_driver()
File "/usr/local/lib/python3.6/dist-packages/torch/cuda/__init__.py", line 111, in _check_driver
of the CUDA driver.""".format(str(torch._C._cuda_getDriverVersion())))
AssertionError:
The NVIDIA driver on your system is too old (found version 10000).
Please update your GPU driver by downloading and installing a new
version from the URL: http://www.nvidia.com/Download/index.aspx
Alternatively, go to: https://pytorch.org to install
a PyTorch version that has been compiled with your version
of the CUDA driver.
The error above is strange because my cuda version for my image is 10.0 and Google GKE mentions that:
The latest supported CUDA version is 10.0
Also, it's GKE's daemonsets that automatically installs NVIDIA drivers
After adding GPU nodes to your cluster, you need to install NVIDIA's device drivers to the nodes.
Google provides a DaemonSet that automatically installs the drivers for you. Refer to the section below for installation instructions for Container-Optimized OS (COS) and Ubuntu nodes.
To deploy the installation DaemonSet, run the following command: kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml
I have tried everything i could think of, without success...
I have resolved my problem by downgrading my pytorch version by buildling my docker images from pytorch/pytorch:1.2-cuda10.0-cudnn7-devel
.
I still don't really know why before it was not working as it should otherwise then by guessing that pytorch 1.3.0
is not compatible with cuda 10.0
.