K
Q

Question

What's the recommended way to replace a bad GKE node instance?

4/19/2016

Using gcloud container clusters resize I can easily scale up and down a cluster. However I find no way to target a specific compute instance vm for removal when resizing down.

Scenario: Our Compute Engine logs indicate that one instance suffers from failure to dismount a volume, from a Kubernetes pod that is since long gone. The cluster is appropriately sized, and the malfunctioning node serves containers properly but is on maximum CPU load.

Obviously I'd want a new Kubernetes node to be ready before I kill off the old one. Is it safe to simply resize up and then delete the instance using gcloud compute, or is there some container-aware way to do this?

-- solsson

google-kubernetes-engine

3 Answers

1/31/2020

I'm not sure that it's guaranteed, but both times that I tried it, when scaling down, the autoscaler chooses drained nodes. So to replace a node, I scaled up, drained the node and then scaled down.

-- Bryan Larsen

Source: StackOverflow

3/4/2018

We use multi-zone clusters now which means I needed a new way to get the instance group name. Current shell commands:

BAD_INSTANCE=[your node name from kubectl get nodes]

kubectl cordon $BAD_INSTANCE

kubectl drain $BAD_INSTANCE

gcloud compute instances describe --format='value[](metadata.items.created-by)' $BAD_INSTANCE

gcloud compute instance-groups managed delete-instances --instances=$BAD_INSTANCE --zone=[from describe output] [grp from describe output]

-- solsson

Source: StackOverflow

4/19/2016

However I find no way to target a specific compute instance vm for removal when resizing down.

There isn't a way to specify which VM to remove using the GKE API, but you can use the managed instance groups API to delete individual instances from the group (this will shrink your number of nodes by the number of instances that you delete, so if you want to replace the nodes, you will then want to scale your cluster up to compensate). You can find the instance group name by running:

$ gcloud container clusters describe CLUSTER | grep instanceGroupManagers

Is it safe to simply resize up and then delete the instance using gcloud compute, or is there some container-aware way to do this?

If you delete an instance, the managed instance group will replace it with a new one (so this will leave you with an extra node if you scale up by one, then delete the troublesome instance). If you were not concerned about the temporary loss of capacity, you could just delete the VM and let it get recreated.

Before removing an instance, you can run kubectl drain to remove the workload from the instance. This will result in a faster rescheduling of pods than if you simply deleting the instance and wait for the controllers to notice that it is gone.

-- Robert Bailey

Source: StackOverflow

KQ

What's the recommended way to replace a bad GKE node instance?

Similar Questions

3 Answers

K
Q