Kubernetes node fails (CoreOS/AWS/Kubernetes stack)

7/22/2016

We have a small testing Kubernetes cluster running on AWS, using CoreOS, as per the instructions here. Currently this consists of only a master and a worker node. In the past couple of weeks we've been running this cluster we've noticed that the worker instance occasionally fails. The first time this happened the instance was subsequently killed and restarted by the auto-scaling group it is in. Today the same thing happened, but we were able to login to the instance before it was shut down and retrieve some information, but it remains unclear to me exactly what has caused this problem.

The node failure seems to happen on an irregular basis, and there is no evidence that there is anything abnormal happening which would precipitate this (external load etc).

Subsquent to the failure (kubernetes node status Not Ready) the instance was still running, but had inactive kubelet and docker services (start failed with result 'dependency'). The flanneld service was running, but with a restart time after the time the node failure was seen.

Logs from around the time of the node failure don't seem to show anything clearly pointing to a cause of the failure. There's a couple of kubelet-wrapper errors at about the time the failure was seen:

`Jul 22 07:25:33 ip-10-0-0-92.ec2.internal kubelet-wrapper[1204]: E0722 07:25:33.121506    1204 kubelet.go:2745] Error updating node status, will retry: nodes "ip-10-0-0-92.ec2.internal" cannot be updated: the object has been modified; please apply your changes to the latest version and try again`

`Jul 22 07:25:34 ip-10-0-0-92.ec2.internal kubelet-wrapper[1204]: E0722 07:25:34.557047    1204 event.go:193] Server rejected event '&api.Event{TypeMeta:unversioned.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:api.ObjectMeta{Name:"ip-10-0-0-92.ec2.internal.1462693ef85b56d8", GenerateName:"", Namespace:"default", SelfLink:"", UID:"", ResourceVersion:"4687622", Generation:0, CreationTimestamp:unversioned.Time{Time:time.Time{sec:0, nsec:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*unversioned.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil)}, InvolvedObject:api.ObjectReference{Kind:"Node", Namespace:"", Name:"ip-10-0-0-92.ec2.internal", UID:"ip-10-0-0-92.ec2.internal", APIVersion:"", ResourceVersion:"", FieldPath:""}, Reason:"NodeHasSufficientDisk", Message:"Node ip-10-0-0-92.ec2.internal status is now: NodeHasSufficientDisk", Source:api.EventSource{Component:"kubelet", Host:"ip-10-0-0-92.ec2.internal"}, FirstTimestamp:unversioned.Time{Time:time.Time{sec:63604448947, nsec:0, loc:(*time.Location)(0x3b1a5c0)}}, LastTimestamp:unversioned.Time{Time:time.Time{sec:63604769134, nsec:388015022, loc:(*time.Location)(0x3b1a5c0)}}, Count:2, Type:"Normal"}': 'events "ip-10-0-0-92.ec2.internal.1462693ef85b56d8" not found' (will not retry!)
Jul 22 07:25:34 ip-10-0-0-92.ec2.internal kubelet-wrapper[1204]: E0722 07:25:34.560636    1204 event.go:193] Server rejected event '&api.Event{TypeMeta:unversioned.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:api.ObjectMeta{Name:"ip-10-0-0-92.ec2.internal.14626941554cc358", GenerateName:"", Namespace:"default", SelfLink:"", UID:"", ResourceVersion:"4687645", Generation:0, CreationTimestamp:unversioned.Time{Time:time.Time{sec:0, nsec:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*unversioned.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil)}, InvolvedObject:api.ObjectReference{Kind:"Node", Namespace:"", Name:"ip-10-0-0-92.ec2.internal", UID:"ip-10-0-0-92.ec2.internal", APIVersion:"", ResourceVersion:"", FieldPath:""}, Reason:"NodeReady", Message:"Node ip-10-0-0-92.ec2.internal status is now: NodeReady", Source:api.EventSource{Component:"kubelet", Host:"ip-10-0-0-92.ec2.internal"}, FirstTimestamp:unversioned.Time{Time:time.Time{sec:63604448957, nsec:0, loc:(*time.Location)(0x3b1a5c0)}}, LastTimestamp:unversioned.Time{Time:time.Time{sec:63604769134, nsec:388022975, loc:(*time.Location)(0x3b1a5c0)}}, Count:2, Type:"Normal"}': 'events "ip-10-0-0-92.ec2.internal.14626941554cc358" not found' (will not retry!)`

followed by what looks like some etcd errors:

`Jul 22 07:27:04 ip-10-0-0-92.ec2.internal rkt[1214]: 2016-07-22 07:27:04,721 [WARNING][1305/140149086452400] calico.etcddriver.driver 810: etcd watch returned bad HTTP status topoll on index 5237916: 400
Jul 22 07:27:04 ip-10-0-0-92.ec2.internal rkt[1214]: 2016-07-22 07:27:04,721 [ERROR][1305/140149086452400] calico.etcddriver.driver 852: Error from etcd for index 5237916: {u'errorCode': 401, u'index': 5239005, u'message': u'The event in requested index is outdated and cleared', u'cause': u'the requested history has been cleared [5238006/5237916]'}; triggering a resync.
Jul 22 07:27:04 ip-10-0-0-92.ec2.internal rkt[1214]: 2016-07-22 07:27:04,721 [INFO][1305/140149086452400] calico.etcddriver.driver 916: STAT: Final watcher etcd response time: 0 in 630.6s (0.000/s) min=0.000ms mean=0.000ms max=0.000ms
Jul 22 07:27:04 ip-10-0-0-92.ec2.internal rkt[1214]: 2016-07-22 07:27:04,721 [INFO][1305/140149086452400] calico.etcddriver.driver 916: STAT: Final watcher processing time: 7 in 630.6s (0.011/s) min=90066.312ms mean=90078.569ms max=90092.505ms
Jul 22 07:27:04 ip-10-0-0-92.ec2.internal rkt[1214]: 2016-07-22 07:27:04,721 [INFO][1305/140149086452400] calico.etcddriver.driver 919: Watcher thread finished. Signalled to resync thread. Was at index 5237916.  Queue length is 1.
Jul 22 07:27:04 ip-10-0-0-92.ec2.internal rkt[1214]: 2016-07-22 07:27:04,743 [WARNING][1305/140149192694448] calico.etcddriver.driver 291: Watcher died; resyncing.`

and a few minutes later a large number of failed connections to the master (10.0.0.50):

`Jul 22 07:36:41 ip-10-0-0-92.ec2.internal rkt[1214]: 2016-07-22 07:36:37,641 [WARNING][1305/140149086452400] urllib3.connectionpool 647: Retrying (Retry(total=2, connect=None, read=None, redirect=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f7700b85b90>: Failed to establish a new connection: [Errno 113] Host is unreachable',)': http://10.0.0.50:2379/v2/keys/calico/v1?waitIndex=5239006&recursive=true&wait=true
Jul 22 07:36:41 ip-10-0-0-92.ec2.internal rkt[1214]: 2016-07-22 07:36:37,641 [INFO][1305/140149086452400] urllib3.connectionpool 213: Starting new HTTP connection (2): 10.0.0.50`

Although these errors are presumably related to the node/instance failure, these don't really mean a lot to me, and certainly don't seem to suggest the underlying cause - but if anyone can see anything here that would suggest a possible cause of the node/instance failure (and how we can go about rectifying this) that would be greatly appreciated!

-- Michael
amazon-web-services
coreos
kubernetes

1 Answer

4/13/2017

Something in your description and log confuse me, you said that you use docker runtime which there is rkt in your log; you said that you use flannel in your cluster which there is calico in your log...

Anyway, from the log you provide, it's more like your etcd is down... which makes kubelet and calico can't update their state, and apiserver will regard they are down. There is not enough information here, I could only suggest that you need to backup etcd's log next time you see this...

Another suggestion is that better not use the same etcd for both kubenetes cluster and calico...

-- Crazykev
Source: StackOverflow