I run a small GKE cluster with a couple of node pools (2-8 nodes in each, some preemptible). I am beginning to see a lot of health issues with the nodes themselves and experiencing pod operations taking a very long time (30+ mins). This includes terminating pods, starting pods, starting initContainers in pods, starting main containers in pods, etc. Examples below. The cluster runs some NodeJS, PHP and Nginx containers, and a single Elastic, Redis and NFS pod. Also, a few PHP-based CronJobs. Together, they make up a website which sits behind a CDN.
I've tried to SSH into the VM instances backing the nodes to check logs, but my SSH connection always times out, not sure if this is normal.
Symptom: Nodes flapping between Ready
and NotReady
:
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
gke-cluster-default-pool-4fa127c-l3xt Ready <none> 62d v1.13.6-gke.13
gke-cluster-default-pool-791e6c2-7b01 NotReady <none> 45d v1.13.6-gke.13
gke-cluster-preemptible-0f81875-cc5q Ready <none> 3h40m v1.13.6-gke.13
gke-cluster-preemptible-0f81875-krqk NotReady <none> 22h v1.13.6-gke.13
gke-cluster-preemptible-0f81875-mb05 Ready <none> 5h42m v1.13.6-gke.13
gke-cluster-preemptible-2453785-1c4v Ready <none> 22h v1.13.6-gke.13
gke-cluster-preemptible-2453785-nv9q Ready <none> 134m v1.13.6-gke.13
gke-cluster-preemptible-2453785-s7r2 NotReady <none> 22h v1.13.6-gke.13
Symptom: Nodes are sometimes rebooted:
2019-08-09 14:23:54.000 CEST
Node gke-cluster-preemptible-0f81875-mb05 has been rebooted, boot id: e601f182-2eab-46b0-a953-7787f95d438
Symptom: Cluster is unhealthy:
2019-08-09T11:29:03Z Cluster is unhealthy
2019-08-09T11:33:25Z Cluster is unhealthy
2019-08-09T11:41:08Z Cluster is unhealthy
2019-08-09T11:45:10Z Cluster is unhealthy
2019-08-09T11:49:11Z Cluster is unhealthy
2019-08-09T11:53:23Z Cluster is unhealthy
Symptom: Various PLEG health errors in Node logs (there are many, many, many entries of this type):
12:53:10.573176 1315163 kubelet.go:1854] skipping pod synchronization - [PLEG is not healthy: pleg was last seen active 3m26.30454685s ago; threshold is 3m0s]
12:53:18.126428 1036 setters.go:520] Node became not ready: {Type:Ready Status:False LastHeartbeatTime:2019-08-09 12:53:18.126363615 +0000 UTC m=+3924434.187952856 LastTransitionTime:2019-08-09 12:53:18.126363615 +0000 UTC m=+3924434.187952856 Reason:KubeletNotReady Message:PLEG is not healthy: pleg was last seen active 3m5.837134315s ago; threshold is 3m0s}
12:53:38.627284 1036 kubelet.go:1854] skipping pod synchronization - [PLEG is not healthy: pleg was last seen active 3m26.338024015s ago; threshold is 3m0s]
Symptom: Pods are issuing 'Network not ready' errors:
2019-08-09T12:42:45Z network is not ready: [runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized]
2019-08-09T12:42:47Z network is not ready: [runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized]
2019-08-09T12:42:49Z network is not ready: [runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized]
Symptom: Pods complaining about "context deadline exceeded":
2019-08-09T08:04:07Z error determining status: rpc error: code = DeadlineExceeded desc = context deadline exceeded
2019-08-09T08:04:15Z error determining status: rpc error: code = DeadlineExceeded desc = context deadline exceeded
2019-08-09T08:04:20Z error determining status: rpc error: code = DeadlineExceeded desc = context deadline exceeded
2019-08-09T08:04:26Z error determining status: rpc error: code = DeadlineExceeded desc = context deadline exceeded
There is obviously something particularly odd going on, but with a fairly trivial number of IOPS, ingress requests, cpu/memory saturation .. I would expect some symptoms that pointed me in some direction where I could debug this further. But it seems like these errors are all over the place.
Given that GKE is a managed solution and there are many systems involved in its operation, I think it might be best for you to reach out to the GCP support team.
They have specific tools to locate issues on the nodes (if any) and can dig a bit deeper into logging to determine the root cause of this.
As of now, the logs you've showed may point to this old issue apparently related to Docker and also an issue with the CNI not ready, preventing the Nodes from reporting to the master, which deems them as unready.
Please consider this as mere speculation as the support team would be able to dig deeper and provide more accurate advise.