Why do Kubernetes worker nodes become NodeNotReady?

2/20/2017

Worker nodes were unexpectedly dropped from cluster by master, for unknown reason.

The cluster has the following setup:

  • AWS
  • Multi-az configured
  • Clustered masters, clusters (across AZs)
  • Flannel networking
  • Provisioned using CoreOS's kube-aws

An incident of unknown origin occurred, wherein during the span of seconds, all worker nodes were dropped from the master. The only relevant log entry that we could find was for kube-controller-manager:

I0217 14:19:11.432691 1 event.go:217] Event(api.ObjectReference{Kind:"Node", Namespace:"", Name:"ip-XX-XX-XX-XX.ec2.internal", UID:"XXX", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'NodeNotReady' Node ip-XX-XX-XX-XX.ec2.internal status is now: NodeNotReady

The nodes returned to "ready" approximately 10 minutes later.

We have yet to locate the cause of why the node transitioned to NodeNotReady.

We have so far looked through logs of various system components including:

  • flannel
  • kubelet
  • etcd
  • controller-manager

One potential noteworthy item is that the active master of the cluster currently resides in a different AZ from the nodes. This should be OK, but could be the source of network connectivity problems. That being said, we have seen no indication in logs / monitoring of inter-AZ connection problems.

Checking kubelet logs, there was no clear logging event of the nodes changing their state to "not ready or otherwise. Additionally no clear indication of any fatal events either.

One item that could be noteworthy, is that all kubelets logged after the outage:

Error updating node status, will retry: error getting node "ip-XX-XX-XX-XX.ec2.internal": Get https://master/api/v1/nodes?fieldSelector=metadata.name%3Dip-XX-XX-XX-XX.ec2.internal&resourceVersion=0: read tcp 10.X.X.X:52534->10.Y.Y.Y:443: read: no route to host".

Again please note, these log messages were logged after the nodes had re-joined the cluster (there was a clear ~10min window between cluster collapse and nodes rejoining).

-- Chris Willmore
amazon-web-services
coreos
kubernetes

0 Answers