I am trying to deploy a ML app on Kubernetes engine with GPU. I created the docker image using nvidia/cuda:9.0-runtime and built my app above it. When I deploy the image to Kubernetes Engine I get an error saying that it could not import libcuda.so.1.
ImportError: libcuda.so.1: cannot open shared object file: No such file or directory
I looked at few solutions posted but none of them seem to work.
When trying those solutions I also found that
the paths mentioned by
echo $LD_LIBRARY_PATH
which gives
/usr/local/nvidia/lib:/usr/local/nvidia/lib64
do not seem to exist.
As well as there isn't a file with the name libcuda.so.1 (or any number) anywhere within the file system. And the /usr/lib/cuda/lib64 contains the shared libraries. Am I currently implementing anything wrong here
You are facing that issue as you have not installed the CUDA drivers on the cluster. Please follow the installing drivers section in this link. For verifying the installation you can run this command and check.
kubectl logs -n kube-system ds/nvidia-driver-installer -c nvidia-driver-installer
I'm assuming you went through the how to documentation about GPUs on Google Cloud website. They go on and describe the whole process of creating the new cluster with GPUs, installing the drivers and configuring the Pods.
Looks like you incorrectly installed the library or it somehow got broken.
As for your image you should use one from here.
About the CUDA libraries
CUDA® is NVIDIA's parallel computing platform and programming model for GPUs. The NVIDIA device drivers you install in your cluster include the CUDA libraries.
CUDA libraries and debug utilities are made available inside the container at
/usr/local/nvidia/lib64
and/usr/local/nvidia/bin
, respectively.CUDA applications running in Pods consuming NVIDIA GPUs need to dynamically discover CUDA libraries. This requires including
/usr/local/nvidia/lib64
in theLD_LIBRARY_PATH
environment variable.You should use Ubuntu-based CUDA Docker base images for CUDA applications in GKE, where
LD_LIBRARY_PATH
is already set appropriately. The latest supported CUDA version is9.0
.
The missing libcuda.so
library issue on Kubernetes is most commonly associated with using the incorrect container image to run GPU workloads. Considering that you are already using the CUDA docker image, try changing your CUDA version to one that is compatible with your workload. I have encountered issues with a workload requiring 10.0 throwing the library not found with the CUDA9.0 base image.
Most cloud providers use containerd/Docker to run their CPU workloads, and nvidia-docker to provide the GPU support. The nvidia-docker is a thin layer that runs on top of the NVIDIA drivers, and is CUDA-agnostic. All of the CUDA library files and resources are solely contained in your container.
Hope this helps!