GCP kubernetes nodes with GPU gets preempted too soon

11/6/2019

I've got a kubeflow k8s cluster with custom GPU-powered preemptible node pool at us-central1-a: enter image description here

I run a kubeflow notebook server on these GPU nodes. By some mysterious reason nodes get compute.instances.preempted message very soon after start (5-10 minutes): enter image description here

Why is this happening?

-- orkenstein
google-cloud-platform
google-kubernetes-engine
kubeflow
kubernetes

1 Answer

11/6/2019

Since you have created a pool of preemptible nodes, this is pretty much expected behavior. GCE can terminate preemptible instances at any time, and the only real guarantee you have is that you won't be charged for the instance (but you will be charged for any requested premium OS -- of which COS is not one) if they run for less than a minute (and, of course, that they will always be preempted after 24 hours).

GPU nodes are likely to be in high demand, and as with other preemptible instances this will be subject to the particular zone and time of day. If you need the instances to stay available, you should use full price instances. Using GKE, there is a way to autoscale GPU nodes to help control costs.

-- robsiemb
Source: StackOverflow