Minions can't rejoin cluster on reboot of AWS instance

8/10/2016

The kubernetes cluster using v1.3.4 starts a master and 2 minions

The cluster starts fine and pods can be started and controlled without issue

As soon as one of the minions is rebooted, or any of the dependent services, such as kubelet is restarted, the minions will not rejoin the cluster

The error from the kubelet service is of the form:

Aug 08 08:21:15 ip-10-16-1-20 kubelet[911]: E0808 08:21:15.955309     911 kubelet.go:2875] Error updating node status, will retry: error getting node "ip-10-16-1-20.us-west-2.compute.internal": nodes "ip-10-16-1-20.us-west-2.compute.internal" not found

The only way, that we can see to rectify this issue at the moment is to tear down the whole cluster and rebuild it

UPDATE: I had a look at the controller manager log and got the following

W0815 13:36:39.087991       1 nodecontroller.go:433] Unable to find Node: ip-10-16-1-25.us-west-2.compute.internal, deleting all assigned Pods.
W0815 13:37:39.123811       1 nodecontroller.go:433] Unable to find Node: ip-10-16-1-25.us-west-2.compute.internal, deleting all assigned Pods.
E0815 13:37:39.133045       1 nodecontroller.go:434] pods "kube-proxy-ip-10-16-1-25.us-west-2.compute.internal" not found
-- Kevin Taylor
kubernetes

1 Answer

8/25/2016

This is actually a coreos issue, although it is difficult to ascertain what the problem actually is. It is more than likely the low level os host resolution code being called from the aws go layers, but that is purely a guess. By upgrading the coreos ami to a later version solved many of the issues we were facing.

-- Kevin Taylor
Source: StackOverflow