Using GPU with Kubernetes GKE and Node auto-provisioning

3/22/2021

I try to do something fairly simple: To run a GPU machine in a k8s cluster using auto-provisioning. When deploying the Pod with a limits: nvidia.com/gpu specification the auto-provisioning is correctly creating a node-pool and scaling up an appropriate node. However, the Pod stays at Pending with the following message:

Warning FailedScheduling 59s (x5 over 2m46s) default-scheduler 0/10 nodes are available: 10 Insufficient nvidia.com/gpu.

It seems like taints and tolerations are added correctly by gke. It just doesnt scale up.

Ive followed the instructions here: https://cloud.google.com/kubernetes-engine/docs/how-to/gpus#installing_drivers

To reproduce: 1. Create a new cluster in a zone with auto-provisioning that includes gpu (I have replaced my own project name with MYPROJECT). This command is what comes out of the console when these changes are done:

gcloud beta container --project "MYPROJECT" clusters create "cluster-2" --zone "europe-west4-a" --no-enable-basic-auth --cluster-version "1.18.12-gke.1210" --release-channel "regular" --machine-type "e2-medium" --image-type "COS" --disk-type "pd-standard" --disk-size "100" --metadata disable-legacy-endpoints=true --scopes "https://www.googleapis.com/auth/devstorage.read_only","https://www.googleapis.com/auth/logging.write","https://www.googleapis.com/auth/monitoring","https://www.googleapis.com/auth/servicecontrol","https://www.googleapis.com/auth/service.management.readonly","https://www.googleapis.com/auth/trace.append" --num-nodes "1" --enable-stackdriver-kubernetes --enable-ip-alias --network "projects/MYPROJECT/global/networks/default" --subnetwork "projects/MYPROJECT/regions/europe-west4/subnetworks/default" --default-max-pods-per-node "110" --no-enable-master-authorized-networks --addons HorizontalPodAutoscaling,HttpLoadBalancing,GcePersistentDiskCsiDriver --enable-autoupgrade --enable-autorepair --max-surge-upgrade 1 --max-unavailable-upgrade 0 --enable-autoprovisioning --min-cpu 1 --max-cpu 20 --min-memory 1 --max-memory 50 --max-accelerator type="nvidia-tesla-p100",count=1 --enable-autoprovisioning-autorepair --enable-autoprovisioning-autoupgrade --autoprovisioning-max-surge-upgrade 1 --autoprovisioning-max-unavailable-upgrade 0 --enable-vertical-pod-autoscaling --enable-shielded-nodes --node-locations "europe-west4-a"
  1. Install NVIDIA drivers by installing DaemonSet: kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml

  2. Deploy pod that requests GPU:

my-gpu-pod.yaml:

apiVersion: v1
kind: Pod
metadata:
  name: my-gpu-pod
spec:
  containers:
  - name: my-gpu-container
    image: nvidia/cuda:11.0-runtime-ubuntu18.04
    command: ["/bin/bash", "-c", "--"]
    args: ["while true; do sleep 600; done;"]
    resources:
      limits:
            nvidia.com/gpu: 1

kubectl apply -f my-gpu-pod.yaml

Help would be really appreciated as Ive spent quite some time on this now :)

Edit: Here is the running Pod and Node specifications (the node that was auto-scaled):

Name:         my-gpu-pod
Namespace:    default
Priority:     0
Node:         <none>
Labels:       <none>
Annotations:  <none>
Status:       Pending
IP:
IPs:          <none>
Containers:
  my-gpu-container:
    Image:      nvidia/cuda:11.0-runtime-ubuntu18.04
    Port:       <none>
    Host Port:  <none>
    Command:
      /bin/bash
      -c
      --
    Args:
      while true; do sleep 600; done;
    Limits:
      nvidia.com/gpu:  1
    Requests:
      nvidia.com/gpu:  1
    Environment:       <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-9rvjz (ro)
Conditions:
  Type           Status
  PodScheduled   False
Volumes:
  default-token-9rvjz:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-9rvjz
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
                 nvidia.com/gpu:NoSchedule op=Exists
Events:
  Type     Reason             Age                  From                Message
  ----     ------             ----                 ----                -------
  Normal   NotTriggerScaleUp  11m                  cluster-autoscaler  pod didn't trigger scale-up (it wouldn't fit if a new node is added):
  Warning  FailedScheduling   5m54s (x6 over 11m)  default-scheduler   0/1 nodes are available: 1 Insufficient nvidia.com/gpu.
  Warning  FailedScheduling   54s (x7 over 5m37s)  default-scheduler   0/2 nodes are available: 2 Insufficient nvidia.com/gpu.
Name:               gke-cluster-1-nap-n1-standard-1-gpu1--39fe3143-s8x2
Roles:              <none>
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/instance-type=n1-standard-1
                    beta.kubernetes.io/os=linux
                    cloud.google.com/gke-accelerator=nvidia-tesla-p100
                    cloud.google.com/gke-boot-disk=pd-standard
                    cloud.google.com/gke-nodepool=nap-n1-standard-1-gpu1-18jc7z9w
                    cloud.google.com/gke-os-distribution=cos
                    cloud.google.com/machine-family=n1
                    failure-domain.beta.kubernetes.io/region=europe-west4
                    failure-domain.beta.kubernetes.io/zone=europe-west4-a
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=gke-cluster-1-nap-n1-standard-1-gpu1--39fe3143-s8x2
                    kubernetes.io/os=linux
                    node.kubernetes.io/instance-type=n1-standard-1
                    topology.gke.io/zone=europe-west4-a
                    topology.kubernetes.io/region=europe-west4
                    topology.kubernetes.io/zone=europe-west4-a
Annotations:        container.googleapis.com/instance_id: 7877226485154959129
                    csi.volume.kubernetes.io/nodeid:
                      {"pd.csi.storage.gke.io":"projects/exor-arctic/zones/europe-west4-a/instances/gke-cluster-1-nap-n1-standard-1-gpu1--39fe3143-s8x2"}
                    node.alpha.kubernetes.io/ttl: 0
                    node.gke.io/last-applied-node-labels:
                      cloud.google.com/gke-accelerator=nvidia-tesla-p100,cloud.google.com/gke-boot-disk=pd-standard,cloud.google.com/gke-nodepool=nap-n1-standar...
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Mon, 22 Mar 2021 11:32:17 +0100
Taints:             nvidia.com/gpu=present:NoSchedule
Unschedulable:      false
Lease:
  HolderIdentity:  gke-cluster-1-nap-n1-standard-1-gpu1--39fe3143-s8x2
  AcquireTime:     <unset>
  RenewTime:       Mon, 22 Mar 2021 11:38:58 +0100
Conditions:
  Type                          Status  LastHeartbeatTime                 LastTransitionTime                Reason                          Message
  ----                          ------  -----------------                 ------------------                ------                          -------
  KernelDeadlock                False   Mon, 22 Mar 2021 11:37:25 +0100   Mon, 22 Mar 2021 11:32:23 +0100   KernelHasNoDeadlock             kernel has no deadlock
  ReadonlyFilesystem            False   Mon, 22 Mar 2021 11:37:25 +0100   Mon, 22 Mar 2021 11:32:23 +0100   FilesystemIsNotReadOnly         Filesystem is not read-only
  CorruptDockerOverlay2         False   Mon, 22 Mar 2021 11:37:25 +0100   Mon, 22 Mar 2021 11:32:23 +0100   NoCorruptDockerOverlay2         docker overlay2 is functioning properly
  FrequentUnregisterNetDevice   False   Mon, 22 Mar 2021 11:37:25 +0100   Mon, 22 Mar 2021 11:32:23 +0100   NoFrequentUnregisterNetDevice   node is functioning properly
  FrequentKubeletRestart        False   Mon, 22 Mar 2021 11:37:25 +0100   Mon, 22 Mar 2021 11:32:23 +0100   NoFrequentKubeletRestart        kubelet is functioning properly
  FrequentDockerRestart         False   Mon, 22 Mar 2021 11:37:25 +0100   Mon, 22 Mar 2021 11:32:23 +0100   NoFrequentDockerRestart         docker is functioning properly
  FrequentContainerdRestart     False   Mon, 22 Mar 2021 11:37:25 +0100   Mon, 22 Mar 2021 11:32:23 +0100   NoFrequentContainerdRestart     containerd is functioning properly
  NetworkUnavailable            False   Mon, 22 Mar 2021 11:32:18 +0100   Mon, 22 Mar 2021 11:32:18 +0100   RouteCreated                    NodeController create implicit route
  MemoryPressure                False   Mon, 22 Mar 2021 11:37:49 +0100   Mon, 22 Mar 2021 11:32:17 +0100   KubeletHasSufficientMemory      kubelet has sufficient memory available
  DiskPressure                  False   Mon, 22 Mar 2021 11:37:49 +0100   Mon, 22 Mar 2021 11:32:17 +0100   KubeletHasNoDiskPressure        kubelet has no disk pressure
  PIDPressure                   False   Mon, 22 Mar 2021 11:37:49 +0100   Mon, 22 Mar 2021 11:32:17 +0100   KubeletHasSufficientPID         kubelet has sufficient PID available
  Ready                         True    Mon, 22 Mar 2021 11:37:49 +0100   Mon, 22 Mar 2021 11:32:19 +0100   KubeletReady                    kubelet is posting ready status. AppArmor enabled
Addresses:
  InternalIP:   10.164.0.16
  ExternalIP:   35.204.55.105
  InternalDNS:  gke-cluster-1-nap-n1-standard-1-gpu1--39fe3143-s8x2.c.exor-arctic.internal
  Hostname:     gke-cluster-1-nap-n1-standard-1-gpu1--39fe3143-s8x2.c.exor-arctic.internal
Capacity:
  attachable-volumes-gce-pd:  127
  cpu:                        1
  ephemeral-storage:          98868448Ki
  hugepages-2Mi:              0
  memory:                     3776196Ki
  pods:                       110
Allocatable:
  attachable-volumes-gce-pd:  127
  cpu:                        940m
  ephemeral-storage:          47093746742
  hugepages-2Mi:              0
  memory:                     2690756Ki
  pods:                       110
System Info:
  Machine ID:                 307671eefc01914a7bfacf17a48e087e
  System UUID:                307671ee-fc01-914a-7bfa-cf17a48e087e
  Boot ID:                    acd58f3b-1659-494c-b83d-427f834d23a6
  Kernel Version:             5.4.49+
  OS Image:                   Container-Optimized OS from Google
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  docker://19.3.9
  Kubelet Version:            v1.18.12-gke.1210
  Kube-Proxy Version:         v1.18.12-gke.1210
PodCIDR:                      10.100.1.0/24
PodCIDRs:                     10.100.1.0/24
ProviderID:                   gce://exor-arctic/europe-west4-a/gke-cluster-1-nap-n1-standard-1-gpu1--39fe3143-s8x2
Non-terminated Pods:          (6 in total)
  Namespace                   Name                                                              CPU Requests  CPU Limits  Memory Requests  Memory Limits  AGE
  ---------                   ----                                                              ------------  ----------  ---------------  -------------  ---
  kube-system                 fluentbit-gke-k22gv                                               100m (10%)    0 (0%)      200Mi (7%)       500Mi (19%)    6m46s
  kube-system                 gke-metrics-agent-5fblx                                           3m (0%)       0 (0%)      50Mi (1%)        50Mi (1%)      6m47s
  kube-system                 kube-proxy-gke-cluster-1-nap-n1-standard-1-gpu1--39fe3143-s8x2    100m (10%)    0 (0%)      0 (0%)           0 (0%)         6m44s
  kube-system                 nvidia-driver-installer-vmw8r                                     150m (15%)    0 (0%)      0 (0%)           0 (0%)         6m45s
  kube-system                 nvidia-gpu-device-plugin-8vqsl                                    50m (5%)      50m (5%)    10Mi (0%)        10Mi (0%)      6m45s
  kube-system                 pdcsi-node-k9brg                                                  0 (0%)        0 (0%)      0 (0%)           0 (0%)         6m47s
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                   Requests    Limits
  --------                   --------    ------
  cpu                        403m (42%)  50m (5%)
  memory                     260Mi (9%)  560Mi (21%)
  ephemeral-storage          0 (0%)      0 (0%)
  hugepages-2Mi              0 (0%)      0 (0%)
  attachable-volumes-gce-pd  0           0
Events:
  Type     Reason                   Age                    From             Message
  ----     ------                   ----                   ----             -------
  Normal   Starting                 6m47s                  kubelet          Starting kubelet.
  Normal   NodeAllocatableEnforced  6m47s                  kubelet          Updated Node Allocatable limit across pods
  Normal   NodeHasSufficientMemory  6m46s (x4 over 6m47s)  kubelet          Node gke-cluster-1-nap-n1-standard-1-gpu1--39fe3143-s8x2 status is now: NodeHasSufficientMemory
  Normal   NodeHasNoDiskPressure    6m46s (x4 over 6m47s)  kubelet          Node gke-cluster-1-nap-n1-standard-1-gpu1--39fe3143-s8x2 status is now: NodeHasNoDiskPressure
  Normal   NodeHasSufficientPID     6m46s (x4 over 6m47s)  kubelet          Node gke-cluster-1-nap-n1-standard-1-gpu1--39fe3143-s8x2 status is now: NodeHasSufficientPID
  Normal   NodeReady                6m45s                  kubelet          Node gke-cluster-1-nap-n1-standard-1-gpu1--39fe3143-s8x2 status is now: NodeReady
  Normal   Starting                 6m44s                  kube-proxy       Starting kube-proxy.
  Warning  NodeSysctlChange         6m41s                  sysctl-monitor
  Warning  ContainerdStart          6m41s                  systemd-monitor  Starting containerd container runtime...
  Warning  DockerStart              6m41s (x2 over 6m41s)  systemd-monitor  Starting Docker Application Container Engine...
  Warning  KubeletStart             6m41s                  systemd-monitor  Started Kubernetes kubelet.
-- Nemis
google-kubernetes-engine
gpu
kubernetes

3 Answers

3/26/2021

As per the Kubernetes Documentation https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/#nvidia-gpu-device-plugin-used-by-gce, we are supposed to use https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/stable/daemonset.yaml.

So can you run

kubectl delete -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml

kubectl create -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/stable/daemonset.yaml
-- Sagar Velankar
Source: StackOverflow

4/1/2021

A common error related with GKE is about project Quotas limiting resources, this could lead to the nodes not auto-provisioning or scaling up due to not being able to assign the resources.

Maybe your project Quotas for GPU (or specifically for nvidia-tesla-p100) are set to 0 or to a number way below to the requested one.

In this link there's more information about how to check it and how to request more resources for your quota.

Also, I see that you're making use of shared-core E2 instances, which are not compatible with accelerators. It shouldn't be an issue as GKE should automatically change the machine type to N1 if it detects the workload contains a GPU, as seen in this link, but still maybe attempt to run the cluster with other machine types such as N1.

-- verdier
Source: StackOverflow

3/21/2022

You might be having a problem with the scopes.

When using node auto-provisioning with GPUs, the auto-provisioned node pools by default do not have sufficient scopes to run the installation DaemonSet. You need to manually change the default autoprovisioning scopes to enable that.

In this case the documented scopes that are required at the time of writting are:

[ "https://www.googleapis.com/auth/logging.write",
"https://www.googleapis.com/auth/monitoring",
"https://www.googleapis.com/auth/devstorage.read_only",
"https://www.googleapis.com/auth/compute"
]

this article mentions this very issue: https://cloud.google.com/kubernetes-engine/docs/how-to/gpus#using_node_auto-provisioning_with_gpus

you might just have to expand them and retry. Manually it works because you have the necessary scopes.

-- bjornaer
Source: StackOverflow