Troubleshooting a NotReady node

5/6/2019

I have one node that is giving me some trouble at the moment. Not found a solution as of yet but that might be a skill level problem, Google coming up empty, or I have found some unsolvable issue. The latter is highly unlikely.

kubectl version v1.8.5
docker version 1.12.6

Doing some normal maintenance on my nodes I noticed the following:

NAME                            STATUS   ROLES     AGE       VERSION
ip-192-168-4-14.ourdomain.pro   Ready    master    213d      v1.8.5
ip-192-168-4-143.ourdomain.pro  Ready    master    213d      v1.8.5
ip-192-168-4-174.ourdomain.pro  Ready    <none>    213d      v1.8.5
ip-192-168-4-182.ourdomain.pro  Ready    <none>    46d       v1.8.5
ip-192-168-4-221.ourdomain.pro  Ready    <none>    213d      v1.8.5
ip-192-168-4-249.ourdomain.pro  Ready    master    213d      v1.8.5
ip-192-168-4-251.ourdomain.pro  NotReady <none>    206d      v1.8.5

On the NotReady node, I am unable to attach or exec myself in which seems normal when in a NotReady state unless I am misreading it. Not able to look at any specific logs on that node for the same reason.

At this point, I restarted kubelet and attached myself to the logs simultaneously to see if anything out of the ordinary would appear.

I have attached the things I spent a day Googling but I can not confirm is the actually connected to the problem.

ERROR 1

unable to connect to Rkt api service

We are not using this so I put this on the ignore list.

ERROR 2

unable to connect to CRI-O api service

We are not using this so I put this on the ignore list.

ERROR 3

Image garbage collection failed once. Stats initialization may not have completed yet: failed to get imageFs info: unable to find data for container /

I have not been able to exclude this as potential pitfall but the things I have found thus far do not seem to relate to the version I am running.

ERROR 4

skipping pod synchronization - [container runtime is down PLEG is not healthy

I do not have an answer for this one except for the fact that the garbage collection error above appears a second time after this message.

ERROR 5

Registration of the rkt container factory failed

Not using this so it should fail unless I am mistaken.

ERROR 6

Registration of the crio container factory failed

Not using this so it should fail unless, again, I am mistaken.

ERROR 7

28087 docker_sandbox.go:343] failed to read pod IP from plugin/docker: NetworkPlugin cni failed on the status hook for pod "kube-dns-545bc4bfd4-rt7qp_kube-system": CNI failed to retrieve network namespace path: Cannot find network namespace for the terminated container

Found a Github ticket for this one but seems it's fixed so not sure how it relates.

ERROR 8

28087 kubelet_node_status.go:791] Node became not ready: {Type:Ready Status:False LastHeartbeatTime:2019-05-06 05:00:40.664331773 +0000 UTC LastTransitionTime:2019-05-06 05:00:40.664331773 +0000 UTC Reason:KubeletNotReady Message:container runtime is down}

And here the node goes into NotReady.

Last log messages and status

    systemctl status kubelet
  kubelet.service - kubelet: The Kubernetes Node Agent
   Loaded: loaded (/etc/systemd/system/kubelet.service; enabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/kubelet.service.d
           └─10-kubeadm.conf
   Active: active (running) since Mon 2019-05-06 05:00:39 UTC; 1h 58min ago
     Docs: http://kubernetes.io/docs/
 Main PID: 28087 (kubelet)
    Tasks: 21
   Memory: 42.3M
   CGroup: /system.slice/kubelet.service
           └─28087 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --pod-manifest-path=/etc/kubernetes/manife...
May 06 05:00:45 kube-master-1 kubelet[28087]: I0506 05:00:45.310305   28087 reconciler.go:212] operationExecutor.VerifyControllerAttachedVolume started for vo...4a414b9c")
May 06 05:00:45 kube-master-1 kubelet[28087]: I0506 05:00:45.310330   28087 reconciler.go:212] operationExecutor.VerifyControllerAttachedVolume started for vo...4a414b9c")
May 06 05:00:45 kube-master-1 kubelet[28087]: I0506 05:00:45.310359   28087 reconciler.go:212] operationExecutor.VerifyControllerAttachedVolume started for volume "varl...
May 06 05:00:45 kube-master-1 kubelet[28087]: I0506 05:00:45.310385   28087 reconciler.go:212] operationExecutor.VerifyControllerAttachedVolume started for volume "cali...
May 06 05:00:45 kube-master-1 kubelet[28087]: I0506 05:00:45.310408   28087 reconciler.go:212] operationExecutor.VerifyControllerAttachedVolume started for volume "kube...
May 06 05:00:45 kube-master-1 kubelet[28087]: I0506 05:00:45.310435   28087 reconciler.go:212] operationExecutor.VerifyControllerAttachedVolume started for vo...4a414b9c")
May 06 05:00:45 kube-master-1 kubelet[28087]: I0506 05:00:45.310456   28087 reconciler.go:212] operationExecutor.VerifyControllerAttachedVolume started for vo...4a414b9c")
May 06 05:00:45 kube-master-1 kubelet[28087]: I0506 05:00:45.310480   28087 reconciler.go:212] operationExecutor.VerifyControllerAttachedVolume started for volume "ca-c...
May 06 05:00:45 kube-master-1 kubelet[28087]: I0506 05:00:45.310504   28087 reconciler.go:212] operationExecutor.VerifyControllerAttachedVolume started for volume "k8s-...
May 06 05:14:29 kube-master-1 kubelet[28087]: E0506 05:14:29.848530   28087 helpers.go:468] PercpuUsage had 0 cpus, but the actual number is 2; ignoring extra CPUs

Here is the kubectl get po -o wide output.

NAME                                              READY     STATUS     RESTARTS   AGE       IP               NODE
docker-image-prune-fhjkl                          1/1       Running    4          213d      100.96.67.87     ip-192-168-4-249
docker-image-prune-ltfpf                          1/1       Running    4          213d      100.96.152.74    ip-192-168-4-143
docker-image-prune-nmg29                          1/1       Running    3          213d      100.96.22.236    ip-192-168-4-221
docker-image-prune-pdw5h                          1/1       Running    7          213d      100.96.90.116    ip-192-168-4-174
docker-image-prune-swbhc                          1/1       Running    0          46d       100.96.191.129   ip-192-168-4-182
docker-image-prune-vtsr4                          1/1       NodeLost   1          206d      100.96.182.197   ip-192-168-4-251
fluentd-es-4bgdz                                  1/1       Running    6          213d      192.168.4.249    ip-192-168-4-249
fluentd-es-fb4gw                                  1/1       Running    7          213d      192.168.4.14     ip-192-168-4-14
fluentd-es-fs8gp                                  1/1       Running    6          213d      192.168.4.143    ip-192-168-4-143
fluentd-es-k572w                                  1/1       Running    0          46d       192.168.4.182    ip-192-168-4-182
fluentd-es-lpxhn                                  1/1       Running    5          213d      192.168.4.174    ip-192-168-4-174
fluentd-es-pjp9w                                  1/1       Unknown    2          206d      192.168.4.251    ip-192-168-4-251
fluentd-es-wbwkp                                  1/1       Running    4          213d      192.168.4.221    ip-192-168-4-221
grafana-76c7dbb678-p8hzb                          1/1       Running    3          213d      100.96.90.115    ip-192-168-4-174
model-5bbe4862e4b0aa4f77d0d499-7cb4f74648-g8xmp   2/2       Running    2          101d      100.96.22.234    ip-192-168-4-221
model-5bbe4862e4b0aa4f77d0d499-7cb4f74648-tvp4m   2/2       Running    2          101d      100.96.22.235    ip-192-168-4-221
prometheus-65b4b68d97-82vr7                       1/1       Running    3          213d      100.96.90.87     ip-192-168-4-174
pushgateway-79f575d754-75l6r                      1/1       Running    3          213d      100.96.90.83     ip-192-168-4-174
rabbitmq-cluster-58db9b6978-g6ssb                 2/2       Running    4          181d      100.96.90.117    ip-192-168-4-174
replicator-56x7v                                  1/1       Running    3          213d      100.96.90.84     ip-192-168-4-174
traefik-ingress-6dc9779596-6ghwv                  1/1       Running    3          213d      100.96.90.85     ip-192-168-4-174
traefik-ingress-6dc9779596-ckzbk                  1/1       Running    4          213d      100.96.152.73    ip-192-168-4-143
traefik-ingress-6dc9779596-sbt4n                  1/1       Running    3          213d      100.96.22.232    ip-192-168-4-221

Output of kubectl get po -n kube-system -o wide

NAME                                       READY     STATUS     RESTARTS   AGE       IP          
calico-kube-controllers-78f554c7bb-s7tmj   1/1       Running    4          213d      192.168.4.14
calico-node-5cgc6                          2/2       Running    9          213d      192.168.4.249
calico-node-bbwtm                          2/2       Running    8          213d      192.168.4.14
calico-node-clwqk                          2/2       NodeLost   4          206d      192.168.4.251
calico-node-d2zqz                          2/2       Running    0          46d       192.168.4.182
calico-node-m4x2t                          2/2       Running    6          213d      192.168.4.221
calico-node-m8xwk                          2/2       Running    9          213d      192.168.4.143
calico-node-q7r7g                          2/2       Running    8          213d      192.168.4.174
cluster-autoscaler-65d6d7f544-dpbfk        1/1       Running    10         207d      100.96.67.88
kube-apiserver-ip-192-168-4-14             1/1       Running    6          213d      192.168.4.14
kube-apiserver-ip-192-168-4-143            1/1       Running    6          213d      192.168.4.143
kube-apiserver-ip-192-168-4-249            1/1       Running    6          213d      192.168.4.249
kube-controller-manager-ip-192-168-4-14    1/1       Running    5          213d      192.168.4.14
kube-controller-manager-ip-192-168-4-143   1/1       Running    6          213d      192.168.4.143
kube-controller-manager-ip-192-168-4-249   1/1       Running    6          213d      192.168.4.249
kube-dns-545bc4bfd4-rt7qp                  3/3       Running    13         213d      100.96.19.197
kube-proxy-2bn42                           1/1       Running    0          46d       192.168.4.182
kube-proxy-95cvh                           1/1       Running    4          213d      192.168.4.174
kube-proxy-bqrhw                           1/1       NodeLost   2          206d      192.168.4.251
kube-proxy-cqh67                           1/1       Running    6          213d      192.168.4.14
kube-proxy-fbdvx                           1/1       Running    4          213d      192.168.4.221
kube-proxy-gcjxg                           1/1       Running    5          213d      192.168.4.249
kube-proxy-mt62x                           1/1       Running    4          213d      192.168.4.143
kube-scheduler-ip-192-168-4-14             1/1       Running    6          213d      192.168.4.14
kube-scheduler-ip-192-168-4-143            1/1       Running    6          213d      192.168.4.143
kube-scheduler-ip-192-168-4-249            1/1       Running    6          213d      192.168.4.249
kubernetes-dashboard-7c5d596d8c-q6sf2      1/1       Running    5          213d      100.96.22.230
tiller-deploy-6d9f596465-svpql             1/1       Running    3          213d      100.96.22.231

I am a bit lost at this point of where to go from here. Any suggestions are welcome.

-- Petter
kubernetes

1 Answer

5/6/2019

Most likely the kubelet must be down.

share the output from below command

journalctl -u kubelet

share the output from the below command

kubectl get po -n kube-system -owide

It appears like the node is not able to communicate with the control plane. you can below steps

  1. detached the node from cluster ( cordon the node, drain the node and finally delete the node)
  2. reset the node
  3. rejoin the node as fresh to cluster
-- P Ekambaram
Source: StackOverflow