GPU deployment in GKE: tensorflow_model_server: error while loading shared libraries: /usr/lib/x86_64-linux-gnu/libcuda.so.1: file too short

7/15/2019

I'm trying to deploy a model on GKE with tensorflow model serving using a GPU. I created a container with docker and it works great on a cloud VM. I'm trying to scale using GKE but the deployment exists with above error.

I created the GKE cluster with only 1 node, with a GPU (Tesla T4). I installed the drivers according to the docs

It seems to be successful as much as I understand (a pod named nvidia-driver-installer-tckv4 was added to the pods list in the node, and it's running without errors)

Next I created the deployment:

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: reph-deployment
spec:
  replicas: 1
  template:
    metadata:
      labels:
        app: reph
    spec:
      containers:
      - name: reph-container
        image: gcr.io/<project-id>/reph_serving_gpu
        resources:
          limits:
            nvidia.com/gpu: 1
        ports:
        - containerPort: 8500
        args:
        - "--runtime=nvidia"

Then I ran kubectl create -f d1.yaml and the container exited with the above error in the logs.

I also tried to switch the os from cos to ubuntu and run an example from the docs

I installed the drivers as above, this time for ubuntu. and applied this yaml taken from the GKE docs (only changed number of gpus to consume):

apiVersion: v1
kind: Pod
metadata:
  name: my-gpu-pod
spec:
  containers:
  - name: my-gpu-container
    image: nvidia/cuda:10.0-runtime-ubuntu18.04
    resources:
      limits:
       nvidia.com/gpu: 1

This time i'm getting CrashLoopBackOff without anything more in the logs.

Any idea wha'ts wrong? I'm a total newcomer to kubernetes and docker so I may be missing something trivial, but I really tried to stick to the GKE docs.

-- RT36
deployment
docker
google-kubernetes-engine
kubernetes

1 Answer

7/15/2019

ok, I think the docs aren't clear enough on this, but it seems that what was missing was including /usr/local/nvidia/lib64 in the LD_LIBRARY_PATH environment variable. The following yaml file runs successfully:

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: reph-deployment
spec:
  replicas: 1
  template:
    metadata:
      labels:
        app: reph
    spec:
      containers:
      - name: reph-container
        env: 
        - name: LD_LIBRARY_PATH
          value: "$LD_LIBRARY_PATH:/usr/local/nvidia/lib64"
        image: gcr.io/<project-id>/reph_serving_gpu
        imagePullPolicy: IfNotPresent
        resources:
          limits:
            nvidia.com/gpu: 1
          requests:
            nvidia.com/gpu: 1
        ports:
        - containerPort: 8500
        args:
        - "--runtime=nvidia"

Here's the relevant part in the GKE docs

-- RT36
Source: StackOverflow