We've experienced 4 AUTO_REPAIR_NODES
events(revealed by the command gcloud container operations list
) on our GKE cluster during the past 1 month. The consequence of node-auto-repair is that the node gets recreated and gets attached a new external IP, and the new external IP, which was not whitelisted by third-party services, eventually caused failure of services running on that the new node.
I noticed that we have "Automatic node repair" enabled in our Kubernetes cluster and felt tempted to disable that, but before I do that, I need to know more about the situation.
My questions are:
The confusion lies here in that there are 'Ready' and 'NotReady' states that are shown when you run kubectl get nodes
which are reported by the kube-apiserver. But these are independent and unclear from the docs how they relate to the kubelet states described here You can also see the kubelet states (in events) when you run kubectl describe nodes
To answer some parts of the questions:
As reported by the kube-apiserver
For these, the kubelet will start evicting or not scheduling pods except for Ready (https://kubernetes.io/docs/tasks/administer-cluster/out-of-resource/). Unclear from the docs how these get reported from the kubeapi-server.
Hope it helps!