I try to do something fairly simple: To run a GPU machine in a k8s cluster using auto-provisioning. When deploying the Pod with a limits: nvidia.com/gpu specification the auto-provisioning is correctly creating a node-pool and scaling up an appropriate node. However, the Pod stays at Pending with the following message:
Warning FailedScheduling 59s (x5 over 2m46s) default-scheduler 0/10 nodes are available: 10 Insufficient nvidia.com/gpu.
It seems like taints and tolerations are added correctly by gke. It just doesnt scale up.
Ive followed the instructions here: https://cloud.google.com/kubernetes-engine/docs/how-to/gpus#installing_drivers
To reproduce: 1. Create a new cluster in a zone with auto-provisioning that includes gpu (I have replaced my own project name with MYPROJECT). This command is what comes out of the console when these changes are done:
gcloud beta container --project "MYPROJECT" clusters create "cluster-2" --zone "europe-west4-a" --no-enable-basic-auth --cluster-version "1.18.12-gke.1210" --release-channel "regular" --machine-type "e2-medium" --image-type "COS" --disk-type "pd-standard" --disk-size "100" --metadata disable-legacy-endpoints=true --scopes "https://www.googleapis.com/auth/devstorage.read_only","https://www.googleapis.com/auth/logging.write","https://www.googleapis.com/auth/monitoring","https://www.googleapis.com/auth/servicecontrol","https://www.googleapis.com/auth/service.management.readonly","https://www.googleapis.com/auth/trace.append" --num-nodes "1" --enable-stackdriver-kubernetes --enable-ip-alias --network "projects/MYPROJECT/global/networks/default" --subnetwork "projects/MYPROJECT/regions/europe-west4/subnetworks/default" --default-max-pods-per-node "110" --no-enable-master-authorized-networks --addons HorizontalPodAutoscaling,HttpLoadBalancing,GcePersistentDiskCsiDriver --enable-autoupgrade --enable-autorepair --max-surge-upgrade 1 --max-unavailable-upgrade 0 --enable-autoprovisioning --min-cpu 1 --max-cpu 20 --min-memory 1 --max-memory 50 --max-accelerator type="nvidia-tesla-p100",count=1 --enable-autoprovisioning-autorepair --enable-autoprovisioning-autoupgrade --autoprovisioning-max-surge-upgrade 1 --autoprovisioning-max-unavailable-upgrade 0 --enable-vertical-pod-autoscaling --enable-shielded-nodes --node-locations "europe-west4-a"
Install NVIDIA drivers by installing DaemonSet:
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml
Deploy pod that requests GPU:
my-gpu-pod.yaml:
apiVersion: v1
kind: Pod
metadata:
name: my-gpu-pod
spec:
containers:
- name: my-gpu-container
image: nvidia/cuda:11.0-runtime-ubuntu18.04
command: ["/bin/bash", "-c", "--"]
args: ["while true; do sleep 600; done;"]
resources:
limits:
nvidia.com/gpu: 1
kubectl apply -f my-gpu-pod.yaml
Help would be really appreciated as Ive spent quite some time on this now :)
Edit: Here is the running Pod and Node specifications (the node that was auto-scaled):
Name: my-gpu-pod
Namespace: default
Priority: 0
Node: <none>
Labels: <none>
Annotations: <none>
Status: Pending
IP:
IPs: <none>
Containers:
my-gpu-container:
Image: nvidia/cuda:11.0-runtime-ubuntu18.04
Port: <none>
Host Port: <none>
Command:
/bin/bash
-c
--
Args:
while true; do sleep 600; done;
Limits:
nvidia.com/gpu: 1
Requests:
nvidia.com/gpu: 1
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-9rvjz (ro)
Conditions:
Type Status
PodScheduled False
Volumes:
default-token-9rvjz:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-9rvjz
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
nvidia.com/gpu:NoSchedule op=Exists
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal NotTriggerScaleUp 11m cluster-autoscaler pod didn't trigger scale-up (it wouldn't fit if a new node is added):
Warning FailedScheduling 5m54s (x6 over 11m) default-scheduler 0/1 nodes are available: 1 Insufficient nvidia.com/gpu.
Warning FailedScheduling 54s (x7 over 5m37s) default-scheduler 0/2 nodes are available: 2 Insufficient nvidia.com/gpu.
Name: gke-cluster-1-nap-n1-standard-1-gpu1--39fe3143-s8x2
Roles: <none>
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/instance-type=n1-standard-1
beta.kubernetes.io/os=linux
cloud.google.com/gke-accelerator=nvidia-tesla-p100
cloud.google.com/gke-boot-disk=pd-standard
cloud.google.com/gke-nodepool=nap-n1-standard-1-gpu1-18jc7z9w
cloud.google.com/gke-os-distribution=cos
cloud.google.com/machine-family=n1
failure-domain.beta.kubernetes.io/region=europe-west4
failure-domain.beta.kubernetes.io/zone=europe-west4-a
kubernetes.io/arch=amd64
kubernetes.io/hostname=gke-cluster-1-nap-n1-standard-1-gpu1--39fe3143-s8x2
kubernetes.io/os=linux
node.kubernetes.io/instance-type=n1-standard-1
topology.gke.io/zone=europe-west4-a
topology.kubernetes.io/region=europe-west4
topology.kubernetes.io/zone=europe-west4-a
Annotations: container.googleapis.com/instance_id: 7877226485154959129
csi.volume.kubernetes.io/nodeid:
{"pd.csi.storage.gke.io":"projects/exor-arctic/zones/europe-west4-a/instances/gke-cluster-1-nap-n1-standard-1-gpu1--39fe3143-s8x2"}
node.alpha.kubernetes.io/ttl: 0
node.gke.io/last-applied-node-labels:
cloud.google.com/gke-accelerator=nvidia-tesla-p100,cloud.google.com/gke-boot-disk=pd-standard,cloud.google.com/gke-nodepool=nap-n1-standar...
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Mon, 22 Mar 2021 11:32:17 +0100
Taints: nvidia.com/gpu=present:NoSchedule
Unschedulable: false
Lease:
HolderIdentity: gke-cluster-1-nap-n1-standard-1-gpu1--39fe3143-s8x2
AcquireTime: <unset>
RenewTime: Mon, 22 Mar 2021 11:38:58 +0100
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
KernelDeadlock False Mon, 22 Mar 2021 11:37:25 +0100 Mon, 22 Mar 2021 11:32:23 +0100 KernelHasNoDeadlock kernel has no deadlock
ReadonlyFilesystem False Mon, 22 Mar 2021 11:37:25 +0100 Mon, 22 Mar 2021 11:32:23 +0100 FilesystemIsNotReadOnly Filesystem is not read-only
CorruptDockerOverlay2 False Mon, 22 Mar 2021 11:37:25 +0100 Mon, 22 Mar 2021 11:32:23 +0100 NoCorruptDockerOverlay2 docker overlay2 is functioning properly
FrequentUnregisterNetDevice False Mon, 22 Mar 2021 11:37:25 +0100 Mon, 22 Mar 2021 11:32:23 +0100 NoFrequentUnregisterNetDevice node is functioning properly
FrequentKubeletRestart False Mon, 22 Mar 2021 11:37:25 +0100 Mon, 22 Mar 2021 11:32:23 +0100 NoFrequentKubeletRestart kubelet is functioning properly
FrequentDockerRestart False Mon, 22 Mar 2021 11:37:25 +0100 Mon, 22 Mar 2021 11:32:23 +0100 NoFrequentDockerRestart docker is functioning properly
FrequentContainerdRestart False Mon, 22 Mar 2021 11:37:25 +0100 Mon, 22 Mar 2021 11:32:23 +0100 NoFrequentContainerdRestart containerd is functioning properly
NetworkUnavailable False Mon, 22 Mar 2021 11:32:18 +0100 Mon, 22 Mar 2021 11:32:18 +0100 RouteCreated NodeController create implicit route
MemoryPressure False Mon, 22 Mar 2021 11:37:49 +0100 Mon, 22 Mar 2021 11:32:17 +0100 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Mon, 22 Mar 2021 11:37:49 +0100 Mon, 22 Mar 2021 11:32:17 +0100 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Mon, 22 Mar 2021 11:37:49 +0100 Mon, 22 Mar 2021 11:32:17 +0100 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Mon, 22 Mar 2021 11:37:49 +0100 Mon, 22 Mar 2021 11:32:19 +0100 KubeletReady kubelet is posting ready status. AppArmor enabled
Addresses:
InternalIP: 10.164.0.16
ExternalIP: 35.204.55.105
InternalDNS: gke-cluster-1-nap-n1-standard-1-gpu1--39fe3143-s8x2.c.exor-arctic.internal
Hostname: gke-cluster-1-nap-n1-standard-1-gpu1--39fe3143-s8x2.c.exor-arctic.internal
Capacity:
attachable-volumes-gce-pd: 127
cpu: 1
ephemeral-storage: 98868448Ki
hugepages-2Mi: 0
memory: 3776196Ki
pods: 110
Allocatable:
attachable-volumes-gce-pd: 127
cpu: 940m
ephemeral-storage: 47093746742
hugepages-2Mi: 0
memory: 2690756Ki
pods: 110
System Info:
Machine ID: 307671eefc01914a7bfacf17a48e087e
System UUID: 307671ee-fc01-914a-7bfa-cf17a48e087e
Boot ID: acd58f3b-1659-494c-b83d-427f834d23a6
Kernel Version: 5.4.49+
OS Image: Container-Optimized OS from Google
Operating System: linux
Architecture: amd64
Container Runtime Version: docker://19.3.9
Kubelet Version: v1.18.12-gke.1210
Kube-Proxy Version: v1.18.12-gke.1210
PodCIDR: 10.100.1.0/24
PodCIDRs: 10.100.1.0/24
ProviderID: gce://exor-arctic/europe-west4-a/gke-cluster-1-nap-n1-standard-1-gpu1--39fe3143-s8x2
Non-terminated Pods: (6 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits AGE
--------- ---- ------------ ---------- --------------- ------------- ---
kube-system fluentbit-gke-k22gv 100m (10%) 0 (0%) 200Mi (7%) 500Mi (19%) 6m46s
kube-system gke-metrics-agent-5fblx 3m (0%) 0 (0%) 50Mi (1%) 50Mi (1%) 6m47s
kube-system kube-proxy-gke-cluster-1-nap-n1-standard-1-gpu1--39fe3143-s8x2 100m (10%) 0 (0%) 0 (0%) 0 (0%) 6m44s
kube-system nvidia-driver-installer-vmw8r 150m (15%) 0 (0%) 0 (0%) 0 (0%) 6m45s
kube-system nvidia-gpu-device-plugin-8vqsl 50m (5%) 50m (5%) 10Mi (0%) 10Mi (0%) 6m45s
kube-system pdcsi-node-k9brg 0 (0%) 0 (0%) 0 (0%) 0 (0%) 6m47s
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 403m (42%) 50m (5%)
memory 260Mi (9%) 560Mi (21%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
attachable-volumes-gce-pd 0 0
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Starting 6m47s kubelet Starting kubelet.
Normal NodeAllocatableEnforced 6m47s kubelet Updated Node Allocatable limit across pods
Normal NodeHasSufficientMemory 6m46s (x4 over 6m47s) kubelet Node gke-cluster-1-nap-n1-standard-1-gpu1--39fe3143-s8x2 status is now: NodeHasSufficientMemory
Normal NodeHasNoDiskPressure 6m46s (x4 over 6m47s) kubelet Node gke-cluster-1-nap-n1-standard-1-gpu1--39fe3143-s8x2 status is now: NodeHasNoDiskPressure
Normal NodeHasSufficientPID 6m46s (x4 over 6m47s) kubelet Node gke-cluster-1-nap-n1-standard-1-gpu1--39fe3143-s8x2 status is now: NodeHasSufficientPID
Normal NodeReady 6m45s kubelet Node gke-cluster-1-nap-n1-standard-1-gpu1--39fe3143-s8x2 status is now: NodeReady
Normal Starting 6m44s kube-proxy Starting kube-proxy.
Warning NodeSysctlChange 6m41s sysctl-monitor
Warning ContainerdStart 6m41s systemd-monitor Starting containerd container runtime...
Warning DockerStart 6m41s (x2 over 6m41s) systemd-monitor Starting Docker Application Container Engine...
Warning KubeletStart 6m41s systemd-monitor Started Kubernetes kubelet.
As per the Kubernetes Documentation https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/#nvidia-gpu-device-plugin-used-by-gce, we are supposed to use https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/stable/daemonset.yaml.
So can you run
kubectl delete -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml
kubectl create -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/stable/daemonset.yaml
A common error related with GKE is about project Quotas limiting resources, this could lead to the nodes not auto-provisioning or scaling up due to not being able to assign the resources.
Maybe your project Quotas for GPU (or specifically for nvidia-tesla-p100) are set to 0 or to a number way below to the requested one.
In this link there's more information about how to check it and how to request more resources for your quota.
Also, I see that you're making use of shared-core E2 instances, which are not compatible with accelerators. It shouldn't be an issue as GKE should automatically change the machine type to N1 if it detects the workload contains a GPU, as seen in this link, but still maybe attempt to run the cluster with other machine types such as N1.
You might be having a problem with the scopes.
When using node auto-provisioning with GPUs, the auto-provisioned node pools by default do not have sufficient scopes to run the installation DaemonSet. You need to manually change the default autoprovisioning scopes to enable that.
In this case the documented scopes that are required at the time of writting are:
[ "https://www.googleapis.com/auth/logging.write",
"https://www.googleapis.com/auth/monitoring",
"https://www.googleapis.com/auth/devstorage.read_only",
"https://www.googleapis.com/auth/compute"
]
this article mentions this very issue: https://cloud.google.com/kubernetes-engine/docs/how-to/gpus#using_node_auto-provisioning_with_gpus
you might just have to expand them and retry. Manually it works because you have the necessary scopes.