I'm trying to deploy a model on GKE with tensorflow model serving using a GPU. I created a container with docker and it works great on a cloud VM. I'm trying to scale using GKE but the deployment exists with above error.
I created the GKE cluster with only 1 node, with a GPU (Tesla T4). I installed the drivers according to the docs
It seems to be successful as much as I understand (a pod named nvidia-driver-installer-tckv4
was added to the pods list in the node, and it's running without errors)
Next I created the deployment:
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: reph-deployment
spec:
replicas: 1
template:
metadata:
labels:
app: reph
spec:
containers:
- name: reph-container
image: gcr.io/<project-id>/reph_serving_gpu
resources:
limits:
nvidia.com/gpu: 1
ports:
- containerPort: 8500
args:
- "--runtime=nvidia"
Then I ran kubectl create -f d1.yaml and the container exited with the above error in the logs.
I also tried to switch the os from cos to ubuntu and run an example from the docs
I installed the drivers as above, this time for ubuntu. and applied this yaml taken from the GKE docs (only changed number of gpus to consume):
apiVersion: v1
kind: Pod
metadata:
name: my-gpu-pod
spec:
containers:
- name: my-gpu-container
image: nvidia/cuda:10.0-runtime-ubuntu18.04
resources:
limits:
nvidia.com/gpu: 1
This time i'm getting CrashLoopBackOff without anything more in the logs.
Any idea wha'ts wrong? I'm a total newcomer to kubernetes and docker so I may be missing something trivial, but I really tried to stick to the GKE docs.
ok, I think the docs aren't clear enough on this, but it seems that what was missing was including /usr/local/nvidia/lib64
in the LD_LIBRARY_PATH
environment variable. The following yaml file runs successfully:
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: reph-deployment
spec:
replicas: 1
template:
metadata:
labels:
app: reph
spec:
containers:
- name: reph-container
env:
- name: LD_LIBRARY_PATH
value: "$LD_LIBRARY_PATH:/usr/local/nvidia/lib64"
image: gcr.io/<project-id>/reph_serving_gpu
imagePullPolicy: IfNotPresent
resources:
limits:
nvidia.com/gpu: 1
requests:
nvidia.com/gpu: 1
ports:
- containerPort: 8500
args:
- "--runtime=nvidia"
Here's the relevant part in the GKE docs