K8s node fails into 'Not Ready' state

6/8/2018

For 2 days I've noticed node failing and going into Not Ready State.

I have a 3-node k8s cluster.

dc2-k8s-01   Ready     master,node   2d        v1.10.2
dc2-k8s-02   Ready     master,node   2d        v1.10.2
dc2-k8s-03   Ready     node          2d        v1.10.2

dc2-k8s-02 node has been failing.

At node fail time I see many states evicted and unknown pods. Can anybody help to find a reason why it has been happening? Below I'm attaching some syslog from failed node. I would add more logs, but my reputation is low.

kubectl get pod --all-namespaces
NAMESPACE       NAME                                    READY     STATUS     RESTARTS   AGE
clilent-ci        rabbitmq-client-ci-01-0                  1/1       Unknown    0          18h
client-dev       rabbitmq-client-dev-01-0                 1/1       Unknown    4          18h
client-sandbox   rabbitmq-client-sandbox-01-0             1/1       Unknown    4          18h
client-test      rabbitmq-client-test-01-0                1/1       Unknown    5          18h
default         es-master-6c6d546445-2hn2h              0/1       Evicted    0          7h
default         es-master-6c6d546445-55pjj              1/1       Running    0          18h
default         es-master-6c6d546445-7cv66              0/1       Evicted    0          18h
default         es-master-6c6d546445-b4hr2              0/1       Evicted    0          7h
default         es-master-6c6d546445-dk765              1/1       Running    0          18h
default         es-master-6c6d546445-dljwk              0/1       Evicted    0          7h
default         es-master-6c6d546445-dmks2              0/1       Evicted    0          7h
default         es-master-6c6d546445-gftj9              1/1       Unknown    0          18h
default         es-master-6c6d546445-gpx2t              0/1       Evicted    0          18h
default         es-master-6c6d546445-hqd65              0/1       Evicted    0          18h
default         es-master-6c6d546445-j4m7c              0/1       Evicted    0          7h
default         es-master-6c6d546445-jtg6m              0/1       Evicted    0          18h
default         es-master-6c6d546445-jthb5              0/1       Evicted    0          7h
default         es-master-6c6d546445-jz5h9              0/1       Evicted    0          7h
default         es-master-6c6d546445-k22tq              0/1       Evicted    0          18h
default         es-master-6c6d546445-k59d9              0/1       Evicted    0          18h
default         es-master-6c6d546445-njfq6              0/1       Evicted    0          18h
default         es-master-6c6d546445-njr2r              0/1       Evicted    0          7h
default         es-master-6c6d546445-p528v              0/1       Evicted    0          7h
default         es-master-6c6d546445-p8t9w              0/1       Evicted    0          7h
default         es-master-6c6d546445-qg56b              0/1       Evicted    0          7h
default         es-master-6c6d546445-r7rmv              0/1       Evicted    0          7h
default         es-master-6c6d546445-twf76              0/1       Evicted    0          7h
default         es-master-6c6d546445-v7cp7              0/1       Evicted    0          7h
default         es-master-6c6d546445-x6sk7              0/1       Evicted    0          7h
default         es-master-6c6d546445-x7n6k              0/1       Evicted    0          7h
default         es-master-6c6d546445-xqtm7              0/1       Evicted    0          7h
default         es-master-6c6d546445-z4mwr              1/1       Running    0          7h
kube-system     calico-node-2hg88                       1/1       Running    0          7h
kube-system     calico-node-48cm9                       1/1       NodeLost   1          2d
kube-system     calico-node-j8hcj                       1/1       Running    2          2d
kube-system     kube-apiserver-dc2-k8s-01               1/1       Running    2          2d
kube-system     kube-apiserver-dc2-k8s-02               1/1       Unknown    0          18h
kube-system     kube-controller-manager-dc2-k8s-01      1/1       Running    2          2d
kube-system     kube-controller-manager-dc2-k8s-02      1/1       Unknown    1          18h
kube-system     kube-dns-7bd4d5fbb6-bhfh4               3/3       Running    3          2d
kube-system     kube-dns-7bd4d5fbb6-fz6xd               3/3       Running    3          2d
kube-system     kube-proxy-dc2-k8s-01                   1/1       Running    2          2d
kube-system     kube-proxy-dc2-k8s-02                   1/1       Unknown    1          18h
kube-system     kube-proxy-dc2-k8s-03                   1/1       Running    1          2d
kube-system     kube-scheduler-dc2-k8s-01               1/1       Running    2          2d
kube-system     kube-scheduler-dc2-k8s-02               1/1       Unknown    1          18h
kube-system     kubedns-autoscaler-679b8b455-ltkrd      1/1       Running    0          19h
kube-system     kubernetes-dashboard-55fdfd74b4-57lrn   1/1       Running    0          19h
kube-system     nginx-proxy-dc2-k8s-03                  1/1       Running    1          2d
kube-system     tiller-deploy-5c688d5f9b-fwnfc          1/1       Running    0          1d

Jun  8 01:31:08 dc2-k8s-02 kubelet[12922]: I0608 01:31:08.501110   12922 kubelet_node_status.go:434] Using node IP: "10.19.10.81"
Jun  8 01:31:18 dc2-k8s-02 kubelet[12922]: I0608 01:31:18.513356   12922 kubelet_node_status.go:434] Using node IP: "10.19.10.81"
Jun  8 01:31:19 dc2-k8s-02 systemd[1]: Starting Cleanup of Temporary Directories...
Jun  8 01:31:19 dc2-k8s-02 systemd-tmpfiles[55561]: [/usr/lib/tmpfiles.d/var.conf:14] Duplicate line for path "/var/log", ignoring.
Jun  8 01:31:19 dc2-k8s-02 systemd[1]: Started Cleanup of Temporary Directories.
Jun  8 01:31:28 dc2-k8s-02 kubelet[12922]: I0608 01:31:28.522696   12922 kubelet_node_status.go:434] Using node IP: "10.19.10.81"
Jun  8 01:31:33 dc2-k8s-02 kubelet[12922]: I0608 01:31:33.753754   12922 container_manager_linux.go:427] [ContainerManager]: Discovered runtime cgroups name: /system.sl
ice/docker.service
Jun  8 01:31:38 dc2-k8s-02 kubelet[12922]: I0608 01:31:38.536068   12922 kubelet_node_status.go:434] Using node IP: "10.19.10.81"
Jun  8 01:31:48 dc2-k8s-02 kubelet[12922]: I0608 01:31:48.546811   12922 kubelet_node_status.go:434] Using node IP: "10.19.10.81"
Jun  8 01:31:58 dc2-k8s-02 kubelet[12922]: I0608 01:31:58.565057   12922 kubelet_node_status.go:434] Using node IP: "10.19.10.81"
Jun  8 01:32:08 dc2-k8s-02 kubelet[12922]: I0608 01:32:08.592797   12922 kubelet_node_status.go:434] Using node IP: "10.19.10.81"
Jun  8 01:32:18 dc2-k8s-02 kubelet[12922]: I0608 01:32:18.603399   12922 kubelet_node_status.go:434] Using node IP: "10.19.10.81"
Jun  8 01:32:28 dc2-k8s-02 kubelet[12922]: I0608 01:32:28.613465   12922 kubelet_node_status.go:434] Using node IP: "10.19.10.81"
Jun  8 01:32:38 dc2-k8s-02 kubelet[12922]: I0608 01:32:38.623522   12922 kubelet_node_status.go:434] Using node IP: "10.19.10.81"
Jun  8 01:32:48 dc2-k8s-02 etcd[1844]: 2018-06-07 23:32:48.285141 I | wal: segmented wal file /var/lib/etcd/member/wal/0000000000000006-000000000005a524.wal is created
Jun  8 01:32:48 dc2-k8s-02 kubelet[12922]: I0608 01:32:48.632236   12922 kubelet_node_status.go:434] Using node IP: "10.19.10.81"
Jun  8 01:32:53 dc2-k8s-02 kubelet[12922]: W0608 01:32:53.296114   12922 eviction_manager.go:343] eviction manager: attempting to reclaim imagefs
Jun  8 01:32:53 dc2-k8s-02 kubelet[12922]: I0608 01:32:53.296145   12922 container_gc.go:85] attempting to delete unused containers
Jun  8 01:32:53 dc2-k8s-02 kubelet[12922]: I0608 01:32:53.303862   12922 image_gc_manager.go:317] attempting to delete unused images
Jun  8 01:32:53 dc2-k8s-02 kubelet[12922]: I0608 01:32:53.311032   12922 image_gc_manager.go:371] [imageGCManager]: Removing image "sha256:1b3607c7eda67632301ee13b353b4
16a91e62ae72df7ac617e84c8a586915772" to free 37293884 bytes
Jun  8 01:32:53 dc2-k8s-02 dockerd[1443]: time="2018-06-08T01:32:53.312129151+02:00" level=error msg="Handler for DELETE /v1.27/images/sha256:1b3607c7eda67632301ee13b35
3b416a91e62ae72df7ac617e84c8a586915772 returned error: conflict: unable to delete 1b3607c7eda6 (cannot be forced) - image is being used by running container 45a06762830
6"
Jun  8 01:32:53 dc2-k8s-02 kubelet[12922]: E0608 01:32:53.312300   12922 remote_image.go:130] RemoveImage "sha256:1b3607c7eda67632301ee13b353b416a91e62ae72df7ac617e84c8a586915772" from image service failed: rpc error: code = Unknown desc = Error response from daemon: conflict: unable to delete 1b3607c7eda6 (cannot be forced) - image is being used by running container 45a067628306
Jun  8 01:32:53 dc2-k8s-02 kubelet[12922]: E0608 01:32:53.312332   12922 kuberuntime_image.go:122] Remove image "sha256:1b3607c7eda67632301ee13b353b416a91e62ae72df7ac617e84c8a586915772" failed: rpc error: code = Unknown desc = Error response from daemon: conflict: unable to delete 1b3607c7eda6 (cannot be forced) - image is being used by running container 45a067628306
Jun  8 01:32:53 dc2-k8s-02 kubelet[12922]: W0608 01:32:53.312347   12922 eviction_manager.go:445] eviction manager: unexpected error when attempting to reduce imagefs pressure: wanted to free 9223372036854775807 bytes, but freed 0 bytes space with errors in image deletion: rpc error: code = Unknown desc = Error response from daemon: conflict: unable to delete 1b3607c7eda6 (cannot be forced) - image is being used by running container 45a067628306
Jun  8 01:32:53 dc2-k8s-02 kubelet[12922]: I0608 01:32:53.319390   12922 eviction_manager.go:357] eviction manager: must evict pod(s) to reclaim imagefs
Jun  8 01:32:53 dc2-k8s-02 kubelet[12922]: I0608 01:32:53.319705   12922 eviction_manager.go:375] eviction manager: pods ranked for eviction: kube-apiserver-dc2-k8s-02_kube-system(724c3c291ba23d6b4714f3172ed25cd7), es-master-6c6d546445-gftj9_default(f9fc1fd8-6a55-11e8-9eb0-005056011c88), calico-node-48cm9_kube-system(96896bf6-690e-11e8-bfc3-005056011c89), rabbitmq-client-test-01-0_client-test(771cf50c-6a4f-11e8-9eb0-005056011c88), rabbitmq-client-ci-01-0_client-ci(0289b86d-6a53-11e8-9eb0-005056011c88), rabbitmq-client-sandbox-01-0_client-sandbox(77d424e2-6a4f-11e8-9eb0-005056011c88), rabbitmq-client-dev-01-0_client-dev(76912724-6a4f-11e8-9eb0-005056011c88), kube-controller-manager-dc2-k8s-02_kube-system(2a11d956c9572dd61b88505193e7645b), kube-scheduler-dc2-k8s-02_kube-system(ecf642f13297d7d4ed8452208442bdf1), kube-proxy-dc2-k8s-02_kube-system(df3867f1089ebd71c0ffd15708a31f0a)
Jun  8 01:32:53 dc2-k8s-02 kubelet[12922]: I0608 01:32:53.319821   12922 kuberuntime_container.go:547] Killing container "docker://488f9e9f3a84d66be96dd324f73eb84445e0086f1a0ee54c0885e76e5d87e931" with 30 second grace period
Jun  8 01:32:53 dc2-k8s-02 dockerd[1443]: time="2018-06-08T01:32:53.320576513+02:00" level=info msg="Container 488f9e9f3a84d66be96dd324f73eb84445e0086f1a0ee54c0885e76e5d87e931 failed to exit within 0 seconds of signal 15 - using the force"
Jun  8 01:32:53 dc2-k8s-02 kubelet[12922]: I0608 01:32:53.319821   12922 kuberuntime_container.go:547] Killing container "docker://488f9e9f3a84d66be96dd324f73eb84445e0086f1a0ee54c0885e76e5d87e931" with 30 second grace period
Jun  8 01:32:53 dc2-k8s-02 dockerd[1443]: time="2018-06-08T01:32:53.320576513+02:00" level=info msg="Container 488f9e9f3a84d66be96dd324f73eb84445e0086f1a0ee54c0885e76e5d87e931 failed to exit within 0 seconds of signal 15 - using the force"
Jun  8 01:32:53 dc2-k8s-02 kubelet[12922]: I0608 01:32:53.368565   12922 kubelet_pods.go:1121] Killing unwanted pod "kube-apiserver-dc2-k8s-02"
Jun  8 01:32:53 dc2-k8s-02 kubelet[12922]: I0608 01:32:53.371237   12922 kuberuntime_container.go:547] Killing container "docker://488f9e9f3a84d66be96dd324f73eb84445e0086f1a0ee54c0885e76e5d87e931" with 30 second grace period
Jun  8 01:32:53 dc2-k8s-02 kubelet[12922]: W0608 01:32:53.385514   12922 status_manager.go:461] Failed to get status for pod "kube-apiserver-dc2-k8s-02_kube-system(724c3c291ba23d6b4714f3172ed25cd7)": Get https://127.0.0.1:6443/api/v1/namespaces/kube-system/pods/kube-apiserver-dc2-k8s-02: unexpected EOF
Jun  8 01:32:53 dc2-k8s-02 kubelet[12922]: I0608 01:32:53.385546   12922 streamwatcher.go:103] Unexpected EOF during watch stream event decoding: unexpected EOF
Jun  8 01:32:53 dc2-k8s-02 kubelet[12922]: E0608 01:32:53.386504   12922 reflector.go:322] k8s.io/kubernetes/pkg/kubelet/kubelet.go:460: Failed to watch *v1.Node: Get https://127.0.0.1:6443/api/v1/nodes?fieldSelector=metadata.name%3Ddc2-k8s-02&resourceVersion=290288&timeoutSeconds=507&watch=true: dial tcp 127.0.0.1:6443: getsockopt: connection refused
Jun  8 01:32:53 dc2-k8s-02 kubelet[12922]: E0608 01:32:53.386540   12922 event.go:209] Unable to write event: 'Post https://127.0.0.1:6443/api/v1/namespaces/kube-system/events: unexpected EOF; some request body already written' (may retry after sleeping)
Jun  8 01:32:53 dc2-k8s-02 kubelet[12922]: I0608 01:32:53.386553   12922 streamwatcher.go:103] Unexpected EOF during watch stream event decoding: unexpected EOF
Jun  8 01:32:53 dc2-k8s-02 kubelet[12922]: I0608 01:32:53.386681   12922 streamwatcher.go:103] Unexpected EOF during watch stream event decoding: unexpected EOF
Jun  8 01:32:53 dc2-k8s-02 kubelet[12922]: E0608 01:32:53.386920   12922 reflector.go:322] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to watch *v1.Pod: Get https://127.0.0.1:6443/api/v1/pods?fieldSelector=spec.nodeName%3Ddc2-k8s-02&resourceVersion=234367&timeoutSeconds=471&watch=true: dial tcp 127.0.0.1:6443: getsockopt: connection refused
Jun  8 01:32:53 dc2-k8s-02 kubelet[12922]: E0608 01:32:53.386955   12922 reflector.go:322] k8s.io/kubernetes/pkg/kubelet/kubelet.go:451: Failed to watch *v1.Service: Get https://127.0.0.1:6443/api/v1/services?resourceVersion=229986&timeoutSeconds=586&watch=true: dial tcp 127.0.0.1:6443: getsockopt: connection refused
Jun  8 01:32:53 dc2-k8s-02 kubelet[12922]: E0608 01:32:53.426374   12922 kuberuntime_container.go:65] Can't make a ref to pod "kube-apiserver-dc2-k8s-02_kube-system(724c3c291ba23d6b4714f3172ed25cd7)", container kube-apiserver: selfLink was empty, can't make reference
Jun  8 01:32:53 dc2-k8s-02 kubelet[12922]: I0608 01:32:53.521670   12922 eviction_manager.go:157] eviction manager: pods kube-apiserver-dc2-k8s-02_kube-system(724c3c291ba23d6b4714f3172ed25cd7) evicted, waiting for pod to be cleaned up
Jun  8 01:32:53 dc2-k8s-02 kubelet[12922]: I0608 01:32:53.818261   12922 kubelet.go:1901] SyncLoop (PLEG): "kube-apiserver-dc2-k8s-02_kube-system(724c3c291ba23d6b4714f3172ed25cd7)", event: &pleg.PodLifecycleEvent{ID:"724c3c291ba23d6b4714f3172ed25cd7", Type:"ContainerDied", Data:"488f9e9f3a84d66be96dd324f73eb84445e0086f1a0ee54c0885e76e5d87e931"}
Jun  8 01:32:53 dc2-k8s-02 kubelet[12922]: I0608 01:32:53.818336   12922 kubelet.go:1901] SyncLoop (PLEG): "kube-apiserver-dc2-k8s-02_kube-system(724c3c291ba23d6b4714f3172ed25cd7)", event: &pleg.PodLifecycleEvent{ID:"724c3c291ba23d6b4714f3172ed25cd7", Type:"ContainerDied", Data:"5b03e2687193cd797045fbb212d32373e1017eac2d8a984aedad243da1f4012c"}
Jun  8 01:32:53 dc2-k8s-02 kubelet[12922]: W0608 01:32:53.818365   12922 pod_container_deletor.go:77] Container "5b03e2687193cd797045fbb212d32373e1017eac2d8a984aedad243da1f4012c" not found in pod's containers
Jun  8 01:32:54 dc2-k8s-02 kubelet[12922]: E0608 01:32:54.386844   12922 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:460: Failed to list *v1.Node: Get https://127.0.0.1:6443/api/v1/nodes?fieldSelector=metadata.name%3Ddc2-k8s-02&limit=500&resourceVersion=0: dial tcp 
-- kris
kubernetes

1 Answer

6/8/2018

Anybody can help to find a reason why it has been happening?

You definitely have a failing node. There can be several reasons and here are just some first picks, based on your question, that we encountered so far:

  • CPU/Memory/Disk pressure on node so scheduler decides to evict
  • Swap file issues (I guess it is disabled) and node freezes when overloaded and then reboots
  • Network issue, node unavailable for certain period of time
  • Misconfiguration of HA cluster / taints (you have two masters)

To be more specific than this, you would need to give more information about your system, such as taints, size of nodes, maybe output of kubectl decribe node dc2-k8s-02 and maybe logs from schedules can shed more light on your issue: kubectl -n kube-system logs kube-scheduler-dc2-k8s-02 (and kubectl -n kube-system logs kube-scheduler-dc2-k8s-01 as well).

-- Const
Source: StackOverflow