How to debug node health errors on GKE?

8/9/2019

I run a small GKE cluster with a couple of node pools (2-8 nodes in each, some preemptible). I am beginning to see a lot of health issues with the nodes themselves and experiencing pod operations taking a very long time (30+ mins). This includes terminating pods, starting pods, starting initContainers in pods, starting main containers in pods, etc. Examples below. The cluster runs some NodeJS, PHP and Nginx containers, and a single Elastic, Redis and NFS pod. Also, a few PHP-based CronJobs. Together, they make up a website which sits behind a CDN.

  • My question is: How do I go about debugging this on GKE, and what can be the cause?

I've tried to SSH into the VM instances backing the nodes to check logs, but my SSH connection always times out, not sure if this is normal.

Symptom: Nodes flapping between Ready and NotReady:

$ kubectl get nodes
NAME                                    STATUS     ROLES    AGE     VERSION
gke-cluster-default-pool-4fa127c-l3xt   Ready      <none>   62d     v1.13.6-gke.13
gke-cluster-default-pool-791e6c2-7b01   NotReady   <none>   45d     v1.13.6-gke.13
gke-cluster-preemptible-0f81875-cc5q    Ready      <none>   3h40m   v1.13.6-gke.13
gke-cluster-preemptible-0f81875-krqk    NotReady   <none>   22h     v1.13.6-gke.13
gke-cluster-preemptible-0f81875-mb05    Ready      <none>   5h42m   v1.13.6-gke.13
gke-cluster-preemptible-2453785-1c4v    Ready      <none>   22h     v1.13.6-gke.13
gke-cluster-preemptible-2453785-nv9q    Ready      <none>   134m    v1.13.6-gke.13
gke-cluster-preemptible-2453785-s7r2    NotReady   <none>   22h     v1.13.6-gke.13

Symptom: Nodes are sometimes rebooted:

2019-08-09 14:23:54.000 CEST
Node gke-cluster-preemptible-0f81875-mb05 has been rebooted, boot id: e601f182-2eab-46b0-a953-7787f95d438

Symptom: Cluster is unhealthy:

2019-08-09T11:29:03Z Cluster is unhealthy 
2019-08-09T11:33:25Z Cluster is unhealthy 
2019-08-09T11:41:08Z Cluster is unhealthy 
2019-08-09T11:45:10Z Cluster is unhealthy 
2019-08-09T11:49:11Z Cluster is unhealthy 
2019-08-09T11:53:23Z Cluster is unhealthy 

Symptom: Various PLEG health errors in Node logs (there are many, many, many entries of this type):

12:53:10.573176 1315163 kubelet.go:1854] skipping pod synchronization - [PLEG is not healthy: pleg was last seen active 3m26.30454685s ago; threshold is 3m0s] 
12:53:18.126428 1036 setters.go:520] Node became not ready: {Type:Ready Status:False LastHeartbeatTime:2019-08-09 12:53:18.126363615 +0000 UTC m=+3924434.187952856 LastTransitionTime:2019-08-09 12:53:18.126363615 +0000 UTC m=+3924434.187952856 Reason:KubeletNotReady Message:PLEG is not healthy: pleg was last seen active 3m5.837134315s ago; threshold is 3m0s}
12:53:38.627284 1036 kubelet.go:1854] skipping pod synchronization - [PLEG is not healthy: pleg was last seen active 3m26.338024015s ago; threshold is 3m0s]

Symptom: Pods are issuing 'Network not ready' errors:

2019-08-09T12:42:45Z network is not ready: [runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized] 
2019-08-09T12:42:47Z network is not ready: [runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized] 
2019-08-09T12:42:49Z network is not ready: [runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized] 

Symptom: Pods complaining about "context deadline exceeded":

2019-08-09T08:04:07Z error determining status: rpc error: code = DeadlineExceeded desc = context deadline exceeded 
2019-08-09T08:04:15Z error determining status: rpc error: code = DeadlineExceeded desc = context deadline exceeded 
2019-08-09T08:04:20Z error determining status: rpc error: code = DeadlineExceeded desc = context deadline exceeded 
2019-08-09T08:04:26Z error determining status: rpc error: code = DeadlineExceeded desc = context deadline exceeded 

There is obviously something particularly odd going on, but with a fairly trivial number of IOPS, ingress requests, cpu/memory saturation .. I would expect some symptoms that pointed me in some direction where I could debug this further. But it seems like these errors are all over the place.

-- Achton
google-kubernetes-engine
kubernetes

1 Answer

8/9/2019

Given that GKE is a managed solution and there are many systems involved in its operation, I think it might be best for you to reach out to the GCP support team.

They have specific tools to locate issues on the nodes (if any) and can dig a bit deeper into logging to determine the root cause of this.

As of now, the logs you've showed may point to this old issue apparently related to Docker and also an issue with the CNI not ready, preventing the Nodes from reporting to the master, which deems them as unready.

Please consider this as mere speculation as the support team would be able to dig deeper and provide more accurate advise.

-- yyyyahir
Source: StackOverflow