For 2 days I've noticed node failing and going into Not Ready State.
I have a 3-node k8s cluster.
dc2-k8s-01 Ready master,node 2d v1.10.2
dc2-k8s-02 Ready master,node 2d v1.10.2
dc2-k8s-03 Ready node 2d v1.10.2
dc2-k8s-02 node has been failing.
At node fail time I see many states evicted and unknown pods. Can anybody help to find a reason why it has been happening? Below I'm attaching some syslog from failed node. I would add more logs, but my reputation is low.
kubectl get pod --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
clilent-ci rabbitmq-client-ci-01-0 1/1 Unknown 0 18h
client-dev rabbitmq-client-dev-01-0 1/1 Unknown 4 18h
client-sandbox rabbitmq-client-sandbox-01-0 1/1 Unknown 4 18h
client-test rabbitmq-client-test-01-0 1/1 Unknown 5 18h
default es-master-6c6d546445-2hn2h 0/1 Evicted 0 7h
default es-master-6c6d546445-55pjj 1/1 Running 0 18h
default es-master-6c6d546445-7cv66 0/1 Evicted 0 18h
default es-master-6c6d546445-b4hr2 0/1 Evicted 0 7h
default es-master-6c6d546445-dk765 1/1 Running 0 18h
default es-master-6c6d546445-dljwk 0/1 Evicted 0 7h
default es-master-6c6d546445-dmks2 0/1 Evicted 0 7h
default es-master-6c6d546445-gftj9 1/1 Unknown 0 18h
default es-master-6c6d546445-gpx2t 0/1 Evicted 0 18h
default es-master-6c6d546445-hqd65 0/1 Evicted 0 18h
default es-master-6c6d546445-j4m7c 0/1 Evicted 0 7h
default es-master-6c6d546445-jtg6m 0/1 Evicted 0 18h
default es-master-6c6d546445-jthb5 0/1 Evicted 0 7h
default es-master-6c6d546445-jz5h9 0/1 Evicted 0 7h
default es-master-6c6d546445-k22tq 0/1 Evicted 0 18h
default es-master-6c6d546445-k59d9 0/1 Evicted 0 18h
default es-master-6c6d546445-njfq6 0/1 Evicted 0 18h
default es-master-6c6d546445-njr2r 0/1 Evicted 0 7h
default es-master-6c6d546445-p528v 0/1 Evicted 0 7h
default es-master-6c6d546445-p8t9w 0/1 Evicted 0 7h
default es-master-6c6d546445-qg56b 0/1 Evicted 0 7h
default es-master-6c6d546445-r7rmv 0/1 Evicted 0 7h
default es-master-6c6d546445-twf76 0/1 Evicted 0 7h
default es-master-6c6d546445-v7cp7 0/1 Evicted 0 7h
default es-master-6c6d546445-x6sk7 0/1 Evicted 0 7h
default es-master-6c6d546445-x7n6k 0/1 Evicted 0 7h
default es-master-6c6d546445-xqtm7 0/1 Evicted 0 7h
default es-master-6c6d546445-z4mwr 1/1 Running 0 7h
kube-system calico-node-2hg88 1/1 Running 0 7h
kube-system calico-node-48cm9 1/1 NodeLost 1 2d
kube-system calico-node-j8hcj 1/1 Running 2 2d
kube-system kube-apiserver-dc2-k8s-01 1/1 Running 2 2d
kube-system kube-apiserver-dc2-k8s-02 1/1 Unknown 0 18h
kube-system kube-controller-manager-dc2-k8s-01 1/1 Running 2 2d
kube-system kube-controller-manager-dc2-k8s-02 1/1 Unknown 1 18h
kube-system kube-dns-7bd4d5fbb6-bhfh4 3/3 Running 3 2d
kube-system kube-dns-7bd4d5fbb6-fz6xd 3/3 Running 3 2d
kube-system kube-proxy-dc2-k8s-01 1/1 Running 2 2d
kube-system kube-proxy-dc2-k8s-02 1/1 Unknown 1 18h
kube-system kube-proxy-dc2-k8s-03 1/1 Running 1 2d
kube-system kube-scheduler-dc2-k8s-01 1/1 Running 2 2d
kube-system kube-scheduler-dc2-k8s-02 1/1 Unknown 1 18h
kube-system kubedns-autoscaler-679b8b455-ltkrd 1/1 Running 0 19h
kube-system kubernetes-dashboard-55fdfd74b4-57lrn 1/1 Running 0 19h
kube-system nginx-proxy-dc2-k8s-03 1/1 Running 1 2d
kube-system tiller-deploy-5c688d5f9b-fwnfc 1/1 Running 0 1d
Jun 8 01:31:08 dc2-k8s-02 kubelet[12922]: I0608 01:31:08.501110 12922 kubelet_node_status.go:434] Using node IP: "10.19.10.81"
Jun 8 01:31:18 dc2-k8s-02 kubelet[12922]: I0608 01:31:18.513356 12922 kubelet_node_status.go:434] Using node IP: "10.19.10.81"
Jun 8 01:31:19 dc2-k8s-02 systemd[1]: Starting Cleanup of Temporary Directories...
Jun 8 01:31:19 dc2-k8s-02 systemd-tmpfiles[55561]: [/usr/lib/tmpfiles.d/var.conf:14] Duplicate line for path "/var/log", ignoring.
Jun 8 01:31:19 dc2-k8s-02 systemd[1]: Started Cleanup of Temporary Directories.
Jun 8 01:31:28 dc2-k8s-02 kubelet[12922]: I0608 01:31:28.522696 12922 kubelet_node_status.go:434] Using node IP: "10.19.10.81"
Jun 8 01:31:33 dc2-k8s-02 kubelet[12922]: I0608 01:31:33.753754 12922 container_manager_linux.go:427] [ContainerManager]: Discovered runtime cgroups name: /system.sl
ice/docker.service
Jun 8 01:31:38 dc2-k8s-02 kubelet[12922]: I0608 01:31:38.536068 12922 kubelet_node_status.go:434] Using node IP: "10.19.10.81"
Jun 8 01:31:48 dc2-k8s-02 kubelet[12922]: I0608 01:31:48.546811 12922 kubelet_node_status.go:434] Using node IP: "10.19.10.81"
Jun 8 01:31:58 dc2-k8s-02 kubelet[12922]: I0608 01:31:58.565057 12922 kubelet_node_status.go:434] Using node IP: "10.19.10.81"
Jun 8 01:32:08 dc2-k8s-02 kubelet[12922]: I0608 01:32:08.592797 12922 kubelet_node_status.go:434] Using node IP: "10.19.10.81"
Jun 8 01:32:18 dc2-k8s-02 kubelet[12922]: I0608 01:32:18.603399 12922 kubelet_node_status.go:434] Using node IP: "10.19.10.81"
Jun 8 01:32:28 dc2-k8s-02 kubelet[12922]: I0608 01:32:28.613465 12922 kubelet_node_status.go:434] Using node IP: "10.19.10.81"
Jun 8 01:32:38 dc2-k8s-02 kubelet[12922]: I0608 01:32:38.623522 12922 kubelet_node_status.go:434] Using node IP: "10.19.10.81"
Jun 8 01:32:48 dc2-k8s-02 etcd[1844]: 2018-06-07 23:32:48.285141 I | wal: segmented wal file /var/lib/etcd/member/wal/0000000000000006-000000000005a524.wal is created
Jun 8 01:32:48 dc2-k8s-02 kubelet[12922]: I0608 01:32:48.632236 12922 kubelet_node_status.go:434] Using node IP: "10.19.10.81"
Jun 8 01:32:53 dc2-k8s-02 kubelet[12922]: W0608 01:32:53.296114 12922 eviction_manager.go:343] eviction manager: attempting to reclaim imagefs
Jun 8 01:32:53 dc2-k8s-02 kubelet[12922]: I0608 01:32:53.296145 12922 container_gc.go:85] attempting to delete unused containers
Jun 8 01:32:53 dc2-k8s-02 kubelet[12922]: I0608 01:32:53.303862 12922 image_gc_manager.go:317] attempting to delete unused images
Jun 8 01:32:53 dc2-k8s-02 kubelet[12922]: I0608 01:32:53.311032 12922 image_gc_manager.go:371] [imageGCManager]: Removing image "sha256:1b3607c7eda67632301ee13b353b4
16a91e62ae72df7ac617e84c8a586915772" to free 37293884 bytes
Jun 8 01:32:53 dc2-k8s-02 dockerd[1443]: time="2018-06-08T01:32:53.312129151+02:00" level=error msg="Handler for DELETE /v1.27/images/sha256:1b3607c7eda67632301ee13b35
3b416a91e62ae72df7ac617e84c8a586915772 returned error: conflict: unable to delete 1b3607c7eda6 (cannot be forced) - image is being used by running container 45a06762830
6"
Jun 8 01:32:53 dc2-k8s-02 kubelet[12922]: E0608 01:32:53.312300 12922 remote_image.go:130] RemoveImage "sha256:1b3607c7eda67632301ee13b353b416a91e62ae72df7ac617e84c8a586915772" from image service failed: rpc error: code = Unknown desc = Error response from daemon: conflict: unable to delete 1b3607c7eda6 (cannot be forced) - image is being used by running container 45a067628306
Jun 8 01:32:53 dc2-k8s-02 kubelet[12922]: E0608 01:32:53.312332 12922 kuberuntime_image.go:122] Remove image "sha256:1b3607c7eda67632301ee13b353b416a91e62ae72df7ac617e84c8a586915772" failed: rpc error: code = Unknown desc = Error response from daemon: conflict: unable to delete 1b3607c7eda6 (cannot be forced) - image is being used by running container 45a067628306
Jun 8 01:32:53 dc2-k8s-02 kubelet[12922]: W0608 01:32:53.312347 12922 eviction_manager.go:445] eviction manager: unexpected error when attempting to reduce imagefs pressure: wanted to free 9223372036854775807 bytes, but freed 0 bytes space with errors in image deletion: rpc error: code = Unknown desc = Error response from daemon: conflict: unable to delete 1b3607c7eda6 (cannot be forced) - image is being used by running container 45a067628306
Jun 8 01:32:53 dc2-k8s-02 kubelet[12922]: I0608 01:32:53.319390 12922 eviction_manager.go:357] eviction manager: must evict pod(s) to reclaim imagefs
Jun 8 01:32:53 dc2-k8s-02 kubelet[12922]: I0608 01:32:53.319705 12922 eviction_manager.go:375] eviction manager: pods ranked for eviction: kube-apiserver-dc2-k8s-02_kube-system(724c3c291ba23d6b4714f3172ed25cd7), es-master-6c6d546445-gftj9_default(f9fc1fd8-6a55-11e8-9eb0-005056011c88), calico-node-48cm9_kube-system(96896bf6-690e-11e8-bfc3-005056011c89), rabbitmq-client-test-01-0_client-test(771cf50c-6a4f-11e8-9eb0-005056011c88), rabbitmq-client-ci-01-0_client-ci(0289b86d-6a53-11e8-9eb0-005056011c88), rabbitmq-client-sandbox-01-0_client-sandbox(77d424e2-6a4f-11e8-9eb0-005056011c88), rabbitmq-client-dev-01-0_client-dev(76912724-6a4f-11e8-9eb0-005056011c88), kube-controller-manager-dc2-k8s-02_kube-system(2a11d956c9572dd61b88505193e7645b), kube-scheduler-dc2-k8s-02_kube-system(ecf642f13297d7d4ed8452208442bdf1), kube-proxy-dc2-k8s-02_kube-system(df3867f1089ebd71c0ffd15708a31f0a)
Jun 8 01:32:53 dc2-k8s-02 kubelet[12922]: I0608 01:32:53.319821 12922 kuberuntime_container.go:547] Killing container "docker://488f9e9f3a84d66be96dd324f73eb84445e0086f1a0ee54c0885e76e5d87e931" with 30 second grace period
Jun 8 01:32:53 dc2-k8s-02 dockerd[1443]: time="2018-06-08T01:32:53.320576513+02:00" level=info msg="Container 488f9e9f3a84d66be96dd324f73eb84445e0086f1a0ee54c0885e76e5d87e931 failed to exit within 0 seconds of signal 15 - using the force"
Jun 8 01:32:53 dc2-k8s-02 kubelet[12922]: I0608 01:32:53.319821 12922 kuberuntime_container.go:547] Killing container "docker://488f9e9f3a84d66be96dd324f73eb84445e0086f1a0ee54c0885e76e5d87e931" with 30 second grace period
Jun 8 01:32:53 dc2-k8s-02 dockerd[1443]: time="2018-06-08T01:32:53.320576513+02:00" level=info msg="Container 488f9e9f3a84d66be96dd324f73eb84445e0086f1a0ee54c0885e76e5d87e931 failed to exit within 0 seconds of signal 15 - using the force"
Jun 8 01:32:53 dc2-k8s-02 kubelet[12922]: I0608 01:32:53.368565 12922 kubelet_pods.go:1121] Killing unwanted pod "kube-apiserver-dc2-k8s-02"
Jun 8 01:32:53 dc2-k8s-02 kubelet[12922]: I0608 01:32:53.371237 12922 kuberuntime_container.go:547] Killing container "docker://488f9e9f3a84d66be96dd324f73eb84445e0086f1a0ee54c0885e76e5d87e931" with 30 second grace period
Jun 8 01:32:53 dc2-k8s-02 kubelet[12922]: W0608 01:32:53.385514 12922 status_manager.go:461] Failed to get status for pod "kube-apiserver-dc2-k8s-02_kube-system(724c3c291ba23d6b4714f3172ed25cd7)": Get https://127.0.0.1:6443/api/v1/namespaces/kube-system/pods/kube-apiserver-dc2-k8s-02: unexpected EOF
Jun 8 01:32:53 dc2-k8s-02 kubelet[12922]: I0608 01:32:53.385546 12922 streamwatcher.go:103] Unexpected EOF during watch stream event decoding: unexpected EOF
Jun 8 01:32:53 dc2-k8s-02 kubelet[12922]: E0608 01:32:53.386504 12922 reflector.go:322] k8s.io/kubernetes/pkg/kubelet/kubelet.go:460: Failed to watch *v1.Node: Get https://127.0.0.1:6443/api/v1/nodes?fieldSelector=metadata.name%3Ddc2-k8s-02&resourceVersion=290288&timeoutSeconds=507&watch=true: dial tcp 127.0.0.1:6443: getsockopt: connection refused
Jun 8 01:32:53 dc2-k8s-02 kubelet[12922]: E0608 01:32:53.386540 12922 event.go:209] Unable to write event: 'Post https://127.0.0.1:6443/api/v1/namespaces/kube-system/events: unexpected EOF; some request body already written' (may retry after sleeping)
Jun 8 01:32:53 dc2-k8s-02 kubelet[12922]: I0608 01:32:53.386553 12922 streamwatcher.go:103] Unexpected EOF during watch stream event decoding: unexpected EOF
Jun 8 01:32:53 dc2-k8s-02 kubelet[12922]: I0608 01:32:53.386681 12922 streamwatcher.go:103] Unexpected EOF during watch stream event decoding: unexpected EOF
Jun 8 01:32:53 dc2-k8s-02 kubelet[12922]: E0608 01:32:53.386920 12922 reflector.go:322] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to watch *v1.Pod: Get https://127.0.0.1:6443/api/v1/pods?fieldSelector=spec.nodeName%3Ddc2-k8s-02&resourceVersion=234367&timeoutSeconds=471&watch=true: dial tcp 127.0.0.1:6443: getsockopt: connection refused
Jun 8 01:32:53 dc2-k8s-02 kubelet[12922]: E0608 01:32:53.386955 12922 reflector.go:322] k8s.io/kubernetes/pkg/kubelet/kubelet.go:451: Failed to watch *v1.Service: Get https://127.0.0.1:6443/api/v1/services?resourceVersion=229986&timeoutSeconds=586&watch=true: dial tcp 127.0.0.1:6443: getsockopt: connection refused
Jun 8 01:32:53 dc2-k8s-02 kubelet[12922]: E0608 01:32:53.426374 12922 kuberuntime_container.go:65] Can't make a ref to pod "kube-apiserver-dc2-k8s-02_kube-system(724c3c291ba23d6b4714f3172ed25cd7)", container kube-apiserver: selfLink was empty, can't make reference
Jun 8 01:32:53 dc2-k8s-02 kubelet[12922]: I0608 01:32:53.521670 12922 eviction_manager.go:157] eviction manager: pods kube-apiserver-dc2-k8s-02_kube-system(724c3c291ba23d6b4714f3172ed25cd7) evicted, waiting for pod to be cleaned up
Jun 8 01:32:53 dc2-k8s-02 kubelet[12922]: I0608 01:32:53.818261 12922 kubelet.go:1901] SyncLoop (PLEG): "kube-apiserver-dc2-k8s-02_kube-system(724c3c291ba23d6b4714f3172ed25cd7)", event: &pleg.PodLifecycleEvent{ID:"724c3c291ba23d6b4714f3172ed25cd7", Type:"ContainerDied", Data:"488f9e9f3a84d66be96dd324f73eb84445e0086f1a0ee54c0885e76e5d87e931"}
Jun 8 01:32:53 dc2-k8s-02 kubelet[12922]: I0608 01:32:53.818336 12922 kubelet.go:1901] SyncLoop (PLEG): "kube-apiserver-dc2-k8s-02_kube-system(724c3c291ba23d6b4714f3172ed25cd7)", event: &pleg.PodLifecycleEvent{ID:"724c3c291ba23d6b4714f3172ed25cd7", Type:"ContainerDied", Data:"5b03e2687193cd797045fbb212d32373e1017eac2d8a984aedad243da1f4012c"}
Jun 8 01:32:53 dc2-k8s-02 kubelet[12922]: W0608 01:32:53.818365 12922 pod_container_deletor.go:77] Container "5b03e2687193cd797045fbb212d32373e1017eac2d8a984aedad243da1f4012c" not found in pod's containers
Jun 8 01:32:54 dc2-k8s-02 kubelet[12922]: E0608 01:32:54.386844 12922 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:460: Failed to list *v1.Node: Get https://127.0.0.1:6443/api/v1/nodes?fieldSelector=metadata.name%3Ddc2-k8s-02&limit=500&resourceVersion=0: dial tcp
Anybody can help to find a reason why it has been happening?
You definitely have a failing node. There can be several reasons and here are just some first picks, based on your question, that we encountered so far:
To be more specific than this, you would need to give more information about your system, such as taints, size of nodes, maybe output of kubectl decribe node dc2-k8s-02
and maybe logs from schedules can shed more light on your issue: kubectl -n kube-system logs kube-scheduler-dc2-k8s-02
(and kubectl -n kube-system logs kube-scheduler-dc2-k8s-01
as well).