I am getting two errors after deploying my object detection model for prediction using GPUs:
1.PodUnschedulable Cannot schedule pods: Insufficient nvidia
2.PodUnschedulable Cannot schedule pods: com/gpu.
I have two node pools. One of them is configured to have Tesla K80 GPU and auto scaling enabled. When I deploy the serving component using a ksonnet app (described in here :https://github.com/kubeflow/examples/blob/master/object_detection/tf_serving_gpu.md#deploy-serving-component.
This is the output of the kubectl describe pods
command:
Name: xyz-v1-5c5b57cf9c-kvjxn
Namespace: default
Node: <none>
Labels: app=xyz
pod-template-hash=1716137957
version=v1
Annotations: <none>
Status: Pending
IP:
Controlled By: ReplicaSet/xyz-v1-5c5b57cf9c
Containers:
aadhar:
Image: tensorflow/serving:1.11.1-gpu
Port: 9000/TCP
Host Port: 0/TCP
Command:
/usr/bin/tensorflow_model_server
Args:
--port=9000
--model_name=xyz
--model_base_path=gs://xyz_kuber_app-xyz-identification/export/
Limits:
cpu: 4
memory: 4Gi
nvidia.com/gpu: 1
Requests:
cpu: 1
memory: 1Gi
nvidia.com/gpu: 1
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-b6dpn (ro)
aadhar-http-proxy:
Image: gcr.io/kubeflow-images-public/tf-model-server-http-proxy:v20180606-9dfda4f2
Port: 8000/TCP
Host Port: 0/TCP
Command:
python
/usr/src/app/server.py
--port=8000
--rpc_port=9000
--rpc_timeout=10.0
Limits:
cpu: 1
memory: 1Gi
Requests:
cpu: 500m
memory: 500Mi
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-b6dpn (ro)
Conditions:
Type Status
PodScheduled False
Volumes:
default-token-b6dpn:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-b6dpn
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
nvidia.com/gpu:NoSchedule
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 20m (x5 over 21m) default-scheduler 0/2 nodes are available: 1 Insufficient nvidia.com/gpu, 1 node(s) were unschedulable.
Warning FailedScheduling 20m (x2 over 20m) default-scheduler 0/2 nodes are available: 1 Insufficient nvidia.com/gpu, 1 node(s) were not ready, 1 node(s) were out of disk space, 1 node(s) were unschedulable.
Warning FailedScheduling 16m (x9 over 19m) default-scheduler 0/1 nodes are available: 1 Insufficient nvidia.com/gpu.
Normal NotTriggerScaleUp 15m (x26 over 20m) cluster-autoscaler pod didn't trigger scale-up (it wouldn't fit if a new node is added)
Warning FailedScheduling 2m42s (x54 over 23m) default-scheduler 0/2 nodes are available: 2 Insufficient nvidia.com/gpu.
Normal TriggeredScaleUp 13s cluster-autoscaler pod triggered scale-up: [{https://content.googleapis.com/compute/v1/projects/xyz-identification/zones/us-central1-a/instanceGroups/gke-kuberflow-xyz-pool-1-9753107b-grp 1->2 (max: 10)}]
Name: mnist-deploy-gcp-b4dd579bf-sjwj7
Namespace: default
Node: gke-kuberflow-xyz-default-pool-ab1fa086-w6q3/10.128.0.8
Start Time: Thu, 14 Feb 2019 14:44:08 +0530
Labels: app=xyz-object
pod-template-hash=608813569
version=v1
Annotations: sidecar.istio.io/inject:
Status: Running
IP: 10.36.4.18
Controlled By: ReplicaSet/mnist-deploy-gcp-b4dd579bf
Containers:
xyz-object:
Container ID: docker://921717d82b547a023034e7c8be78216493beeb55dca57f4eddb5968122e36c16
Image: tensorflow/serving:1.11.1
Image ID: docker-pullable://tensorflow/serving@sha256:a01c6475c69055c583aeda185a274942ced458d178aaeb84b4b842ae6917a0bc
Ports: 9000/TCP, 8500/TCP
Host Ports: 0/TCP, 0/TCP
Command:
/usr/bin/tensorflow_model_server
Args:
--port=9000
--rest_api_port=8500
--model_name=xyz-object
--model_base_path=gs://xyz_kuber_app-xyz-identification/export
--monitoring_config_file=/var/config/monitoring_config.txt
State: Running
Started: Thu, 14 Feb 2019 14:48:21 +0530
Last State: Terminated
Reason: Error
Exit Code: 137
Started: Thu, 14 Feb 2019 14:45:58 +0530
Finished: Thu, 14 Feb 2019 14:48:21 +0530
Ready: True
Restart Count: 1
Limits:
cpu: 4
memory: 4Gi
Requests:
cpu: 1
memory: 1Gi
Liveness: tcp-socket :9000 delay=30s timeout=1s period=30s #success=1 #failure=3
Environment:
GOOGLE_APPLICATION_CREDENTIALS: /secret/gcp-credentials/user-gcp-sa.json
Mounts:
/secret/gcp-credentials from gcp-credentials (rw)
/var/config/ from config-volume (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-b6dpn (ro)
Conditions:
Type Status
Initialized True
Ready True
PodScheduled True
Volumes:
config-volume:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: mnist-deploy-gcp-config
Optional: false
gcp-credentials:
Type: Secret (a volume populated by a Secret)
SecretName: user-gcp-sa
Optional: false
default-token-b6dpn:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-b6dpn
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events: <none>
The output of kubectl describe pods | grep gpu
is :
Image: tensorflow/serving:1.11.1-gpu
nvidia.com/gpu: 1
nvidia.com/gpu: 1
nvidia.com/gpu:NoSchedule
Warning FailedScheduling 28m (x5 over 29m) default-scheduler 0/2 nodes are available: 1 Insufficient nvidia.com/gpu, 1 node(s) were unschedulable.
Warning FailedScheduling 28m (x2 over 28m) default-scheduler 0/2 nodes are available: 1 Insufficient nvidia.com/gpu, 1 node(s) were not ready, 1 node(s) were out of disk space, 1 node(s) were unschedulable.
Warning FailedScheduling 24m (x9 over 27m) default-scheduler 0/1 nodes are available: 1 Insufficient nvidia.com/gpu.
Warning FailedScheduling 11m (x54 over 31m) default-scheduler 0/2 nodes are available: 2 Insufficient nvidia.com/gpu.
Warning FailedScheduling 48s (x23 over 6m57s) default-scheduler 0/3 nodes are available: 3 Insufficient nvidia.com/gpu.
I am new to kubernetes and am not able to understand what is going wrong here.
Update: I did have an extra pod running that I was experimenting with earlier. I shut that after @Paul Annett pointed it out but I still have the same error.
Name: aadhar-v1-5c5b57cf9c-q8cd8
Namespace: default
Node: <none>
Labels: app=aadhar
pod-template-hash=1716137957
version=v1
Annotations: <none>
Status: Pending
IP:
Controlled By: ReplicaSet/aadhar-v1-5c5b57cf9c
Containers:
aadhar:
Image: tensorflow/serving:1.11.1-gpu
Port: 9000/TCP
Host Port: 0/TCP
Command:
/usr/bin/tensorflow_model_server
Args:
--port=9000
--model_name=aadhar
--model_base_path=gs://xyz_kuber_app-xyz-identification/export/
Limits:
cpu: 4
memory: 4Gi
nvidia.com/gpu: 1
Requests:
cpu: 1
memory: 1Gi
nvidia.com/gpu: 1
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-b6dpn (ro)
aadhar-http-proxy:
Image: gcr.io/kubeflow-images-public/tf-model-server-http-proxy:v20180606-9dfda4f2
Port: 8000/TCP
Host Port: 0/TCP
Command:
python
/usr/src/app/server.py
--port=8000
--rpc_port=9000
--rpc_timeout=10.0
Limits:
cpu: 1
memory: 1Gi
Requests:
cpu: 500m
memory: 500Mi
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-b6dpn (ro)
Conditions:
Type Status
PodScheduled False
Volumes:
default-token-b6dpn:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-b6dpn
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
nvidia.com/gpu:NoSchedule
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal TriggeredScaleUp 3m3s cluster-autoscaler pod triggered scale-up: [{https://content.googleapis.com/compute/v1/projects/xyz-identification/zones/us-central1-a/instanceGroups/gke-kuberflow-xyz-pool-1-9753107b-grp 0->1 (max: 10)}]
Warning FailedScheduling 2m42s (x2 over 2m42s) default-scheduler 0/2 nodes are available: 1 Insufficient nvidia.com/gpu, 1 node(s) were not ready, 1 node(s) were out of disk space.
Warning FailedScheduling 42s (x10 over 3m45s) default-scheduler 0/2 nodes are available: 2 Insufficient nvidia.com/gpu.
Update 2: I haven't used nvidia-docker. Although, the kubectl get pods -n=kube-system
command gives me:
NAME READY STATUS RESTARTS AGE
event-exporter-v0.2.3-54f94754f4-vd9l5 2/2 Running 0 16h
fluentd-gcp-scaler-6d7bbc67c5-m8gt6 1/1 Running 0 16h
fluentd-gcp-v3.1.0-4wnv9 2/2 Running 0 16h
fluentd-gcp-v3.1.0-r6bd5 2/2 Running 0 51m
heapster-v1.5.3-75bdcc556f-8z4x8 3/3 Running 0 41m
kube-dns-788979dc8f-59ftr 4/4 Running 0 16h
kube-dns-788979dc8f-zrswj 4/4 Running 0 51m
kube-dns-autoscaler-79b4b844b9-9xg69 1/1 Running 0 16h
kube-proxy-gke-kuberflow-aadhaar-pool-1-57d75875-8f88 1/1 Running 0 16h
kube-proxy-gke-kuberflow-aadhaar-pool-2-10d7e787-66n3 1/1 Running 0 51m
l7-default-backend-75f847b979-2plm4 1/1 Running 0 16h
metrics-server-v0.2.1-7486f5bd67-mj99g 2/2 Running 0 16h
nvidia-device-plugin-daemonset-wkcqt 1/1 Running 0 16h
nvidia-device-plugin-daemonset-zvzlb 1/1 Running 0 51m
nvidia-driver-installer-p8qqj 0/1 Init:CrashLoopBackOff 13 51m
nvidia-gpu-device-plugin-nnpx7 1/1 Running 0 51m
Looks like an issue with nvidia driver installer.
Update 3: Added nvidia driver installer log. Describing the pod: kubectl describe pods nvidia-driver-installer-p8qqj -n=kube-system
Name: nvidia-driver-installer-p8qqj
Namespace: kube-system
Node: gke-kuberflow-aadhaar-pool-2-10d7e787-66n3/10.128.0.30
Start Time: Fri, 15 Feb 2019 11:22:42 +0530
Labels: controller-revision-hash=1137413470
k8s-app=nvidia-driver-installer
name=nvidia-driver-installer
pod-template-generation=1
Annotations: <none>
Status: Pending
IP: 10.36.5.4
Controlled By: DaemonSet/nvidia-driver-installer
Init Containers:
nvidia-driver-installer:
Container ID: docker://a0b18bc13dad0d470b601ad2cafdf558a192b3a5d9ace264fd22d5b3e6130241
Image: gke-nvidia-installer:fixed
Image ID: docker-pullable://gcr.io/cos-cloud/cos-gpu-installer@sha256:e7bf3b4c77ef0d43fedaf4a244bd6009e8f524d0af4828a0996559b7f5dca091
Port: <none>
Host Port: <none>
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 32
Started: Fri, 15 Feb 2019 13:06:04 +0530
Finished: Fri, 15 Feb 2019 13:06:33 +0530
Ready: False
Restart Count: 23
Requests:
cpu: 150m
Environment: <none>
Mounts:
/boot from boot (rw)
/dev from dev (rw)
/root from root-mount (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-n5t8z (ro)
Containers:
pause:
Container ID:
Image: gcr.io/google-containers/pause:2.0
Image ID:
Port: <none>
Host Port: <none>
State: Waiting
Reason: PodInitializing
Ready: False
Restart Count: 0
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-n5t8z (ro)
Conditions:
Type Status
Initialized False
Ready False
PodScheduled True
Volumes:
dev:
Type: HostPath (bare host directory volume)
Path: /dev
HostPathType:
boot:
Type: HostPath (bare host directory volume)
Path: /boot
HostPathType:
root-mount:
Type: HostPath (bare host directory volume)
Path: /
HostPathType:
default-token-n5t8z:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-n5t8z
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations:
node.kubernetes.io/disk-pressure:NoSchedule
node.kubernetes.io/memory-pressure:NoSchedule
node.kubernetes.io/not-ready:NoExecute
node.kubernetes.io/unreachable:NoExecute
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning BackOff 3m36s (x437 over 107m) kubelet, gke-kuberflow-aadhaar-pool-2-10d7e787-66n3 Back-off restarting failed container
Error log from the pod kubectl logs nvidia-driver-installer-p8qqj -n=kube-system
:
Error from server (BadRequest): container "pause" in pod "nvidia-driver-installer-p8qqj" is waiting to start: PodInitializing
It got fixed after I deleted all of the nvidia pods and deleted the node and recreated it and installed the nvidia drivers and plugin again. This didn't happen in the first try though.
Issue seems to be with the resources not being available to run the pod. the pod contains two containers that needs min 1.5Gi Memory and 1.5 cpu and max 5GB memenory and 5 cpu.
controller is not able to identify the node that meets the resource requirements for running the pod and hence it is not getting scheduled.
see if you can reduce the resource limits that can be matched with one of the node. i also see from the logs one of the node is out of disk space. check those issues reported from ( kubectl describe po ) and take action on those items
Limits:
cpu: 4
memory: 4Gi
nvidia.com/gpu: 1
Requests:
cpu: 1
memory: 1Gi
nvidia.com/gpu: 1
Limits:
cpu: 1
memory: 1Gi
Requests:
cpu: 500m
memory: 500Mi
i see the pod is using a node affinity.
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: cloud.google.com/gke-accelerator
operator: Exists
can you check the node where the pod is deployed has the below label
cloud.google.com/gke-accelerator
alternatively remove the nodeaffinity section and see if the pods gets deployed and shows running