Kubernetes Engine: Node keeps getting unhealthy and rebooted for no apparent reason

6/5/2019

My Kubernetes Engine cluster keeps rebooting one of my nodes, even though all pods on the node are "well-behaved". I've tried to look at the cluster's Stackdriver logs, but was not able to find a reason. After a while, the continuous reboots usually stop, only to occur again a few hours or days later.

Usually only one single node is affected, while the other nodes are fine, but deleting that node and creating a new one in its place only helps temporarily.

I have already disabled node auto-repair to see if that makes a difference (it was turned on before), and if I recall correctly this started after upgrading my cluster to Kubernetes 1.13 (specifically version 1.13.5-gke). The issue has persisted after upgrading to 1.13.6-gke.0. Even creating a new node pool and migrating to it had no effect.

The cluster consists of four nodes with 1 CPU and 3 GB RAM each. I know that's small for a k8s cluster, but this has worked fine in the past.

I am using the new Stackdriver Kubernetes Monitoring as well as Istio on GKE.

Any pointers as to what could be the reason or where I look for possible causes would be appreciated.

Screenshots of the Node event list (happy to provide other logs; couldn't find anything meaningful in Stackdriver Logging yet):

node event list node event list

-- MrMage
google-kubernetes-engine
istio
kubernetes

0 Answers