Pods unschedulable error while deploying tensorflow serving model to kubernetes using GPUs

2/14/2019

I am getting two errors after deploying my object detection model for prediction using GPUs:

1.PodUnschedulable Cannot schedule pods: Insufficient nvidia

2.PodUnschedulable Cannot schedule pods: com/gpu.

I have two node pools. One of them is configured to have Tesla K80 GPU and auto scaling enabled. When I deploy the serving component using a ksonnet app (described in here :https://github.com/kubeflow/examples/blob/master/object_detection/tf_serving_gpu.md#deploy-serving-component.

This is the output of the kubectl describe pods command:

  Name:           xyz-v1-5c5b57cf9c-kvjxn
  Namespace:      default
  Node:           <none>
  Labels:         app=xyz
                  pod-template-hash=1716137957
                  version=v1
  Annotations:    <none>
  Status:         Pending
  IP:             
  Controlled By:  ReplicaSet/xyz-v1-5c5b57cf9c
  Containers:
    aadhar:
      Image:      tensorflow/serving:1.11.1-gpu
      Port:       9000/TCP
      Host Port:  0/TCP
      Command:
        /usr/bin/tensorflow_model_server
      Args:
        --port=9000
        --model_name=xyz
        --model_base_path=gs://xyz_kuber_app-xyz-identification/export/
      Limits:
        cpu:             4
        memory:          4Gi
        nvidia.com/gpu:  1
      Requests:
        cpu:             1
        memory:          1Gi
        nvidia.com/gpu:  1
      Environment:       <none>
      Mounts:
        /var/run/secrets/kubernetes.io/serviceaccount from default-token-b6dpn (ro)
    aadhar-http-proxy:
      Image:      gcr.io/kubeflow-images-public/tf-model-server-http-proxy:v20180606-9dfda4f2
      Port:       8000/TCP
      Host Port:  0/TCP
      Command:
        python
        /usr/src/app/server.py
        --port=8000
        --rpc_port=9000
        --rpc_timeout=10.0
      Limits:
        cpu:     1
        memory:  1Gi
      Requests:
        cpu:        500m
        memory:     500Mi
      Environment:  <none>
      Mounts:
        /var/run/secrets/kubernetes.io/serviceaccount from default-token-b6dpn (ro)
  Conditions:
    Type           Status
    PodScheduled   False 
  Volumes:
    default-token-b6dpn:
      Type:        Secret (a volume populated by a Secret)
      SecretName:  default-token-b6dpn
      Optional:    false
  QoS Class:       Burstable
  Node-Selectors:  <none>
  Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                   node.kubernetes.io/unreachable:NoExecute for 300s
                   nvidia.com/gpu:NoSchedule
  Events:
    Type     Reason             Age                   From                Message
    ----     ------             ----                  ----                -------
    Warning  FailedScheduling   20m (x5 over 21m)     default-scheduler   0/2 nodes are available: 1 Insufficient nvidia.com/gpu, 1 node(s) were unschedulable.
    Warning  FailedScheduling   20m (x2 over 20m)     default-scheduler   0/2 nodes are available: 1 Insufficient nvidia.com/gpu, 1 node(s) were not ready, 1 node(s) were out of disk space, 1 node(s) were unschedulable.
    Warning  FailedScheduling   16m (x9 over 19m)     default-scheduler   0/1 nodes are available: 1 Insufficient nvidia.com/gpu.
    Normal   NotTriggerScaleUp  15m (x26 over 20m)    cluster-autoscaler  pod didn't trigger scale-up (it wouldn't fit if a new node is added)
    Warning  FailedScheduling   2m42s (x54 over 23m)  default-scheduler   0/2 nodes are available: 2 Insufficient nvidia.com/gpu.
    Normal   TriggeredScaleUp   13s                   cluster-autoscaler  pod triggered scale-up: [{https://content.googleapis.com/compute/v1/projects/xyz-identification/zones/us-central1-a/instanceGroups/gke-kuberflow-xyz-pool-1-9753107b-grp 1->2 (max: 10)}]


  Name:           mnist-deploy-gcp-b4dd579bf-sjwj7
  Namespace:      default
  Node:           gke-kuberflow-xyz-default-pool-ab1fa086-w6q3/10.128.0.8
  Start Time:     Thu, 14 Feb 2019 14:44:08 +0530
  Labels:         app=xyz-object
                  pod-template-hash=608813569
                  version=v1
  Annotations:    sidecar.istio.io/inject: 
  Status:         Running
  IP:             10.36.4.18
  Controlled By:  ReplicaSet/mnist-deploy-gcp-b4dd579bf
  Containers:
    xyz-object:
      Container ID:  docker://921717d82b547a023034e7c8be78216493beeb55dca57f4eddb5968122e36c16
      Image:         tensorflow/serving:1.11.1
      Image ID:      docker-pullable://tensorflow/serving@sha256:a01c6475c69055c583aeda185a274942ced458d178aaeb84b4b842ae6917a0bc
      Ports:         9000/TCP, 8500/TCP
      Host Ports:    0/TCP, 0/TCP
      Command:
        /usr/bin/tensorflow_model_server
      Args:
        --port=9000
        --rest_api_port=8500
        --model_name=xyz-object
        --model_base_path=gs://xyz_kuber_app-xyz-identification/export
        --monitoring_config_file=/var/config/monitoring_config.txt
      State:          Running
        Started:      Thu, 14 Feb 2019 14:48:21 +0530
      Last State:     Terminated
        Reason:       Error
        Exit Code:    137
        Started:      Thu, 14 Feb 2019 14:45:58 +0530
        Finished:     Thu, 14 Feb 2019 14:48:21 +0530
      Ready:          True
      Restart Count:  1
      Limits:
        cpu:     4
        memory:  4Gi
      Requests:
        cpu:     1
        memory:  1Gi
      Liveness:  tcp-socket :9000 delay=30s timeout=1s period=30s #success=1 #failure=3
      Environment:
        GOOGLE_APPLICATION_CREDENTIALS:  /secret/gcp-credentials/user-gcp-sa.json
      Mounts:
        /secret/gcp-credentials from gcp-credentials (rw)
        /var/config/ from config-volume (rw)
        /var/run/secrets/kubernetes.io/serviceaccount from default-token-b6dpn (ro)
  Conditions:
    Type           Status
    Initialized    True 
    Ready          True 
    PodScheduled   True 
  Volumes:
    config-volume:
      Type:      ConfigMap (a volume populated by a ConfigMap)
      Name:      mnist-deploy-gcp-config
      Optional:  false
    gcp-credentials:
      Type:        Secret (a volume populated by a Secret)
      SecretName:  user-gcp-sa
      Optional:    false
    default-token-b6dpn:
      Type:        Secret (a volume populated by a Secret)
      SecretName:  default-token-b6dpn
      Optional:    false
  QoS Class:       Burstable
  Node-Selectors:  <none>
  Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                   node.kubernetes.io/unreachable:NoExecute for 300s
  Events:          <none>

The output of kubectl describe pods | grep gpuis :

    Image:      tensorflow/serving:1.11.1-gpu
      nvidia.com/gpu:  1
      nvidia.com/gpu:  1
                 nvidia.com/gpu:NoSchedule
  Warning  FailedScheduling   28m (x5 over 29m)     default-scheduler   0/2 nodes are available: 1 Insufficient nvidia.com/gpu, 1 node(s) were unschedulable.
  Warning  FailedScheduling   28m (x2 over 28m)     default-scheduler   0/2 nodes are available: 1 Insufficient nvidia.com/gpu, 1 node(s) were not ready, 1 node(s) were out of disk space, 1 node(s) were unschedulable.
  Warning  FailedScheduling   24m (x9 over 27m)     default-scheduler   0/1 nodes are available: 1 Insufficient nvidia.com/gpu.
  Warning  FailedScheduling   11m (x54 over 31m)    default-scheduler   0/2 nodes are available: 2 Insufficient nvidia.com/gpu.
  Warning  FailedScheduling   48s (x23 over 6m57s)  default-scheduler   0/3 nodes are available: 3 Insufficient nvidia.com/gpu.

I am new to kubernetes and am not able to understand what is going wrong here.

Update: I did have an extra pod running that I was experimenting with earlier. I shut that after @Paul Annett pointed it out but I still have the same error.

Name:           aadhar-v1-5c5b57cf9c-q8cd8
Namespace:      default
Node:           <none>
Labels:         app=aadhar
                pod-template-hash=1716137957
                version=v1
Annotations:    <none>
Status:         Pending
IP:             
Controlled By:  ReplicaSet/aadhar-v1-5c5b57cf9c
Containers:
  aadhar:
    Image:      tensorflow/serving:1.11.1-gpu
    Port:       9000/TCP
    Host Port:  0/TCP
    Command:
      /usr/bin/tensorflow_model_server
    Args:
      --port=9000
      --model_name=aadhar
      --model_base_path=gs://xyz_kuber_app-xyz-identification/export/
    Limits:
      cpu:             4
      memory:          4Gi
      nvidia.com/gpu:  1
    Requests:
      cpu:             1
      memory:          1Gi
      nvidia.com/gpu:  1
    Environment:       <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-b6dpn (ro)
  aadhar-http-proxy:
    Image:      gcr.io/kubeflow-images-public/tf-model-server-http-proxy:v20180606-9dfda4f2
    Port:       8000/TCP
    Host Port:  0/TCP
    Command:
      python
      /usr/src/app/server.py
      --port=8000
      --rpc_port=9000
      --rpc_timeout=10.0
    Limits:
      cpu:     1
      memory:  1Gi
    Requests:
      cpu:        500m
      memory:     500Mi
    Environment:  <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-b6dpn (ro)
Conditions:
  Type           Status
  PodScheduled   False 
Volumes:
  default-token-b6dpn:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-b6dpn
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
                 nvidia.com/gpu:NoSchedule
Events:
  Type     Reason            Age                    From                Message
  ----     ------            ----                   ----                -------
  Normal   TriggeredScaleUp  3m3s                   cluster-autoscaler  pod triggered scale-up: [{https://content.googleapis.com/compute/v1/projects/xyz-identification/zones/us-central1-a/instanceGroups/gke-kuberflow-xyz-pool-1-9753107b-grp 0->1 (max: 10)}]
  Warning  FailedScheduling  2m42s (x2 over 2m42s)  default-scheduler   0/2 nodes are available: 1 Insufficient nvidia.com/gpu, 1 node(s) were not ready, 1 node(s) were out of disk space.
  Warning  FailedScheduling  42s (x10 over 3m45s)   default-scheduler   0/2 nodes are available: 2 Insufficient nvidia.com/gpu.

Update 2: I haven't used nvidia-docker. Although, the kubectl get pods -n=kube-system command gives me:

NAME                                                    READY   STATUS                  RESTARTS   AGE
event-exporter-v0.2.3-54f94754f4-vd9l5                  2/2     Running                 0          16h
fluentd-gcp-scaler-6d7bbc67c5-m8gt6                     1/1     Running                 0          16h
fluentd-gcp-v3.1.0-4wnv9                                2/2     Running                 0          16h
fluentd-gcp-v3.1.0-r6bd5                                2/2     Running                 0          51m
heapster-v1.5.3-75bdcc556f-8z4x8                        3/3     Running                 0          41m
kube-dns-788979dc8f-59ftr                               4/4     Running                 0          16h
kube-dns-788979dc8f-zrswj                               4/4     Running                 0          51m
kube-dns-autoscaler-79b4b844b9-9xg69                    1/1     Running                 0          16h
kube-proxy-gke-kuberflow-aadhaar-pool-1-57d75875-8f88   1/1     Running                 0          16h
kube-proxy-gke-kuberflow-aadhaar-pool-2-10d7e787-66n3   1/1     Running                 0          51m
l7-default-backend-75f847b979-2plm4                     1/1     Running                 0          16h
metrics-server-v0.2.1-7486f5bd67-mj99g                  2/2     Running                 0          16h
nvidia-device-plugin-daemonset-wkcqt                    1/1     Running                 0          16h
nvidia-device-plugin-daemonset-zvzlb                    1/1     Running                 0          51m
nvidia-driver-installer-p8qqj                           0/1     Init:CrashLoopBackOff   13         51m
nvidia-gpu-device-plugin-nnpx7                          1/1     Running                 0          51m

Looks like an issue with nvidia driver installer.

Update 3: Added nvidia driver installer log. Describing the pod: kubectl describe pods nvidia-driver-installer-p8qqj -n=kube-system

Name:           nvidia-driver-installer-p8qqj
Namespace:      kube-system
Node:           gke-kuberflow-aadhaar-pool-2-10d7e787-66n3/10.128.0.30
Start Time:     Fri, 15 Feb 2019 11:22:42 +0530
Labels:         controller-revision-hash=1137413470
                k8s-app=nvidia-driver-installer
                name=nvidia-driver-installer
                pod-template-generation=1
Annotations:    <none>
Status:         Pending
IP:             10.36.5.4
Controlled By:  DaemonSet/nvidia-driver-installer
Init Containers:
  nvidia-driver-installer:
    Container ID:   docker://a0b18bc13dad0d470b601ad2cafdf558a192b3a5d9ace264fd22d5b3e6130241
    Image:          gke-nvidia-installer:fixed
    Image ID:       docker-pullable://gcr.io/cos-cloud/cos-gpu-installer@sha256:e7bf3b4c77ef0d43fedaf4a244bd6009e8f524d0af4828a0996559b7f5dca091
    Port:           <none>
    Host Port:      <none>
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    32
      Started:      Fri, 15 Feb 2019 13:06:04 +0530
      Finished:     Fri, 15 Feb 2019 13:06:33 +0530
    Ready:          False
    Restart Count:  23
    Requests:
      cpu:        150m
    Environment:  <none>
    Mounts:
      /boot from boot (rw)
      /dev from dev (rw)
      /root from root-mount (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-n5t8z (ro)
Containers:
  pause:
    Container ID:   
    Image:          gcr.io/google-containers/pause:2.0
    Image ID:       
    Port:           <none>
    Host Port:      <none>
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-n5t8z (ro)
Conditions:
  Type           Status
  Initialized    False 
  Ready          False 
  PodScheduled   True 
Volumes:
  dev:
    Type:          HostPath (bare host directory volume)
    Path:          /dev
    HostPathType:  
  boot:
    Type:          HostPath (bare host directory volume)
    Path:          /boot
    HostPathType:  
  root-mount:
    Type:          HostPath (bare host directory volume)
    Path:          /
    HostPathType:  
  default-token-n5t8z:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-n5t8z
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     
                 node.kubernetes.io/disk-pressure:NoSchedule
                 node.kubernetes.io/memory-pressure:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute
                 node.kubernetes.io/unreachable:NoExecute
Events:
  Type     Reason   Age                     From                                                 Message
  ----     ------   ----                    ----                                                 -------
  Warning  BackOff  3m36s (x437 over 107m)  kubelet, gke-kuberflow-aadhaar-pool-2-10d7e787-66n3  Back-off restarting failed container

Error log from the pod kubectl logs nvidia-driver-installer-p8qqj -n=kube-system :

Error from server (BadRequest): container "pause" in pod "nvidia-driver-installer-p8qqj" is waiting to start: PodInitializing
-- zinngg
google-cloud-platform
google-kubernetes-engine
kubernetes
nvidia
tensorflow

2 Answers

2/18/2019

It got fixed after I deleted all of the nvidia pods and deleted the node and recreated it and installed the nvidia drivers and plugin again. This didn't happen in the first try though.

-- zinngg
Source: StackOverflow

2/14/2019

Issue seems to be with the resources not being available to run the pod. the pod contains two containers that needs min 1.5Gi Memory and 1.5 cpu and max 5GB memenory and 5 cpu.

controller is not able to identify the node that meets the resource requirements for running the pod and hence it is not getting scheduled.

see if you can reduce the resource limits that can be matched with one of the node. i also see from the logs one of the node is out of disk space. check those issues reported from ( kubectl describe po ) and take action on those items

    Limits:
      cpu:             4
      memory:          4Gi
      nvidia.com/gpu:  1
    Requests:
      cpu:             1
      memory:          1Gi
      nvidia.com/gpu:  1
    Limits:
      cpu:     1
      memory:  1Gi
    Requests:
      cpu:        500m
      memory:     500Mi

i see the pod is using a node affinity.

      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: cloud.google.com/gke-accelerator
                operator: Exists

can you check the node where the pod is deployed has the below label

cloud.google.com/gke-accelerator

alternatively remove the nodeaffinity section and see if the pods gets deployed and shows running

-- P Ekambaram
Source: StackOverflow