The kubernetes cluster using v1.3.4 starts a master and 2 minions
The cluster starts fine and pods can be started and controlled without issue
As soon as one of the minions is rebooted, or any of the dependent services, such as kubelet is restarted, the minions will not rejoin the cluster
The error from the kubelet service is of the form:
Aug 08 08:21:15 ip-10-16-1-20 kubelet[911]: E0808 08:21:15.955309 911 kubelet.go:2875] Error updating node status, will retry: error getting node "ip-10-16-1-20.us-west-2.compute.internal": nodes "ip-10-16-1-20.us-west-2.compute.internal" not found
The only way, that we can see to rectify this issue at the moment is to tear down the whole cluster and rebuild it
UPDATE: I had a look at the controller manager log and got the following
W0815 13:36:39.087991 1 nodecontroller.go:433] Unable to find Node: ip-10-16-1-25.us-west-2.compute.internal, deleting all assigned Pods.
W0815 13:37:39.123811 1 nodecontroller.go:433] Unable to find Node: ip-10-16-1-25.us-west-2.compute.internal, deleting all assigned Pods.
E0815 13:37:39.133045 1 nodecontroller.go:434] pods "kube-proxy-ip-10-16-1-25.us-west-2.compute.internal" not found
This is actually a coreos issue, although it is difficult to ascertain what the problem actually is. It is more than likely the low level os host resolution code being called from the aws go layers, but that is purely a guess. By upgrading the coreos ami to a later version solved many of the issues we were facing.