Kubernetes Deployment with GPU

4/11/2019

I am trying to deploy a ML app on Kubernetes engine with GPU. I created the docker image using nvidia/cuda:9.0-runtime and built my app above it. When I deploy the image to Kubernetes Engine I get an error saying that it could not import libcuda.so.1.

ImportError: libcuda.so.1: cannot open shared object file: No such file or directory

I looked at few solutions posted but none of them seem to work.

When trying those solutions I also found that

the paths mentioned by

echo $LD_LIBRARY_PATH

which gives

/usr/local/nvidia/lib:/usr/local/nvidia/lib64

do not seem to exist.

As well as there isn't a file with the name libcuda.so.1 (or any number) anywhere within the file system. And the /usr/lib/cuda/lib64 contains the shared libraries. Am I currently implementing anything wrong here

-- Ranika Nisal
docker
kubernetes
tensorflow

3 Answers

6/6/2019

You are facing that issue as you have not installed the CUDA drivers on the cluster. Please follow the installing drivers section in this link. For verifying the installation you can run this command and check.

kubectl logs -n kube-system ds/nvidia-driver-installer -c nvidia-driver-installer 
-- Sai Raghuram Kaligotla
Source: StackOverflow

4/11/2019

I'm assuming you went through the how to documentation about GPUs on Google Cloud website. They go on and describe the whole process of creating the new cluster with GPUs, installing the drivers and configuring the Pods.

Looks like you incorrectly installed the library or it somehow got broken.

As for your image you should use one from here.

About the CUDA libraries

CUDA® is NVIDIA's parallel computing platform and programming model for GPUs. The NVIDIA device drivers you install in your cluster include the CUDA libraries.

CUDA libraries and debug utilities are made available inside the container at /usr/local/nvidia/lib64 and /usr/local/nvidia/bin, respectively.

CUDA applications running in Pods consuming NVIDIA GPUs need to dynamically discover CUDA libraries. This requires including /usr/local/nvidia/lib64 in the LD_LIBRARY_PATH environment variable.

You should use Ubuntu-based CUDA Docker base images for CUDA applications in GKE, where LD_LIBRARY_PATH is already set appropriately. The latest supported CUDA version is 9.0.

-- Crou
Source: StackOverflow

4/11/2019

The missing libcuda.so library issue on Kubernetes is most commonly associated with using the incorrect container image to run GPU workloads. Considering that you are already using the CUDA docker image, try changing your CUDA version to one that is compatible with your workload. I have encountered issues with a workload requiring 10.0 throwing the library not found with the CUDA9.0 base image.

Most cloud providers use containerd/Docker to run their CPU workloads, and nvidia-docker to provide the GPU support. The nvidia-docker is a thin layer that runs on top of the NVIDIA drivers, and is CUDA-agnostic. All of the CUDA library files and resources are solely contained in your container.

Hope this helps!

-- Frank Yucheng Gu
Source: StackOverflow