I have one node that is giving me some trouble at the moment. Not found a solution as of yet but that might be a skill level problem, Google coming up empty, or I have found some unsolvable issue. The latter is highly unlikely.
kubectl version v1.8.5
docker version 1.12.6
Doing some normal maintenance on my nodes I noticed the following:
NAME STATUS ROLES AGE VERSION
ip-192-168-4-14.ourdomain.pro Ready master 213d v1.8.5
ip-192-168-4-143.ourdomain.pro Ready master 213d v1.8.5
ip-192-168-4-174.ourdomain.pro Ready <none> 213d v1.8.5
ip-192-168-4-182.ourdomain.pro Ready <none> 46d v1.8.5
ip-192-168-4-221.ourdomain.pro Ready <none> 213d v1.8.5
ip-192-168-4-249.ourdomain.pro Ready master 213d v1.8.5
ip-192-168-4-251.ourdomain.pro NotReady <none> 206d v1.8.5
On the NotReady node, I am unable to attach or exec myself in which seems normal when in a NotReady state unless I am misreading it. Not able to look at any specific logs on that node for the same reason.
At this point, I restarted kubelet and attached myself to the logs simultaneously to see if anything out of the ordinary would appear.
I have attached the things I spent a day Googling but I can not confirm is the actually connected to the problem.
ERROR 1
unable to connect to Rkt api service
We are not using this so I put this on the ignore list.
ERROR 2
unable to connect to CRI-O api service
We are not using this so I put this on the ignore list.
ERROR 3
Image garbage collection failed once. Stats initialization may not have completed yet: failed to get imageFs info: unable to find data for container /
I have not been able to exclude this as potential pitfall but the things I have found thus far do not seem to relate to the version I am running.
ERROR 4
skipping pod synchronization - [container runtime is down PLEG is not healthy
I do not have an answer for this one except for the fact that the garbage collection error above appears a second time after this message.
ERROR 5
Registration of the rkt container factory failed
Not using this so it should fail unless I am mistaken.
ERROR 6
Registration of the crio container factory failed
Not using this so it should fail unless, again, I am mistaken.
ERROR 7
28087 docker_sandbox.go:343] failed to read pod IP from plugin/docker: NetworkPlugin cni failed on the status hook for pod "kube-dns-545bc4bfd4-rt7qp_kube-system": CNI failed to retrieve network namespace path: Cannot find network namespace for the terminated container
Found a Github ticket for this one but seems it's fixed so not sure how it relates.
ERROR 8
28087 kubelet_node_status.go:791] Node became not ready: {Type:Ready Status:False LastHeartbeatTime:2019-05-06 05:00:40.664331773 +0000 UTC LastTransitionTime:2019-05-06 05:00:40.664331773 +0000 UTC Reason:KubeletNotReady Message:container runtime is down}
And here the node goes into NotReady.
Last log messages and status
systemctl status kubelet
kubelet.service - kubelet: The Kubernetes Node Agent
Loaded: loaded (/etc/systemd/system/kubelet.service; enabled; vendor preset: disabled)
Drop-In: /etc/systemd/system/kubelet.service.d
└─10-kubeadm.conf
Active: active (running) since Mon 2019-05-06 05:00:39 UTC; 1h 58min ago
Docs: http://kubernetes.io/docs/
Main PID: 28087 (kubelet)
Tasks: 21
Memory: 42.3M
CGroup: /system.slice/kubelet.service
└─28087 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --pod-manifest-path=/etc/kubernetes/manife...
May 06 05:00:45 kube-master-1 kubelet[28087]: I0506 05:00:45.310305 28087 reconciler.go:212] operationExecutor.VerifyControllerAttachedVolume started for vo...4a414b9c")
May 06 05:00:45 kube-master-1 kubelet[28087]: I0506 05:00:45.310330 28087 reconciler.go:212] operationExecutor.VerifyControllerAttachedVolume started for vo...4a414b9c")
May 06 05:00:45 kube-master-1 kubelet[28087]: I0506 05:00:45.310359 28087 reconciler.go:212] operationExecutor.VerifyControllerAttachedVolume started for volume "varl...
May 06 05:00:45 kube-master-1 kubelet[28087]: I0506 05:00:45.310385 28087 reconciler.go:212] operationExecutor.VerifyControllerAttachedVolume started for volume "cali...
May 06 05:00:45 kube-master-1 kubelet[28087]: I0506 05:00:45.310408 28087 reconciler.go:212] operationExecutor.VerifyControllerAttachedVolume started for volume "kube...
May 06 05:00:45 kube-master-1 kubelet[28087]: I0506 05:00:45.310435 28087 reconciler.go:212] operationExecutor.VerifyControllerAttachedVolume started for vo...4a414b9c")
May 06 05:00:45 kube-master-1 kubelet[28087]: I0506 05:00:45.310456 28087 reconciler.go:212] operationExecutor.VerifyControllerAttachedVolume started for vo...4a414b9c")
May 06 05:00:45 kube-master-1 kubelet[28087]: I0506 05:00:45.310480 28087 reconciler.go:212] operationExecutor.VerifyControllerAttachedVolume started for volume "ca-c...
May 06 05:00:45 kube-master-1 kubelet[28087]: I0506 05:00:45.310504 28087 reconciler.go:212] operationExecutor.VerifyControllerAttachedVolume started for volume "k8s-...
May 06 05:14:29 kube-master-1 kubelet[28087]: E0506 05:14:29.848530 28087 helpers.go:468] PercpuUsage had 0 cpus, but the actual number is 2; ignoring extra CPUs
Here is the kubectl get po -o wide output.
NAME READY STATUS RESTARTS AGE IP NODE
docker-image-prune-fhjkl 1/1 Running 4 213d 100.96.67.87 ip-192-168-4-249
docker-image-prune-ltfpf 1/1 Running 4 213d 100.96.152.74 ip-192-168-4-143
docker-image-prune-nmg29 1/1 Running 3 213d 100.96.22.236 ip-192-168-4-221
docker-image-prune-pdw5h 1/1 Running 7 213d 100.96.90.116 ip-192-168-4-174
docker-image-prune-swbhc 1/1 Running 0 46d 100.96.191.129 ip-192-168-4-182
docker-image-prune-vtsr4 1/1 NodeLost 1 206d 100.96.182.197 ip-192-168-4-251
fluentd-es-4bgdz 1/1 Running 6 213d 192.168.4.249 ip-192-168-4-249
fluentd-es-fb4gw 1/1 Running 7 213d 192.168.4.14 ip-192-168-4-14
fluentd-es-fs8gp 1/1 Running 6 213d 192.168.4.143 ip-192-168-4-143
fluentd-es-k572w 1/1 Running 0 46d 192.168.4.182 ip-192-168-4-182
fluentd-es-lpxhn 1/1 Running 5 213d 192.168.4.174 ip-192-168-4-174
fluentd-es-pjp9w 1/1 Unknown 2 206d 192.168.4.251 ip-192-168-4-251
fluentd-es-wbwkp 1/1 Running 4 213d 192.168.4.221 ip-192-168-4-221
grafana-76c7dbb678-p8hzb 1/1 Running 3 213d 100.96.90.115 ip-192-168-4-174
model-5bbe4862e4b0aa4f77d0d499-7cb4f74648-g8xmp 2/2 Running 2 101d 100.96.22.234 ip-192-168-4-221
model-5bbe4862e4b0aa4f77d0d499-7cb4f74648-tvp4m 2/2 Running 2 101d 100.96.22.235 ip-192-168-4-221
prometheus-65b4b68d97-82vr7 1/1 Running 3 213d 100.96.90.87 ip-192-168-4-174
pushgateway-79f575d754-75l6r 1/1 Running 3 213d 100.96.90.83 ip-192-168-4-174
rabbitmq-cluster-58db9b6978-g6ssb 2/2 Running 4 181d 100.96.90.117 ip-192-168-4-174
replicator-56x7v 1/1 Running 3 213d 100.96.90.84 ip-192-168-4-174
traefik-ingress-6dc9779596-6ghwv 1/1 Running 3 213d 100.96.90.85 ip-192-168-4-174
traefik-ingress-6dc9779596-ckzbk 1/1 Running 4 213d 100.96.152.73 ip-192-168-4-143
traefik-ingress-6dc9779596-sbt4n 1/1 Running 3 213d 100.96.22.232 ip-192-168-4-221
Output of kubectl get po -n kube-system -o wide
NAME READY STATUS RESTARTS AGE IP
calico-kube-controllers-78f554c7bb-s7tmj 1/1 Running 4 213d 192.168.4.14
calico-node-5cgc6 2/2 Running 9 213d 192.168.4.249
calico-node-bbwtm 2/2 Running 8 213d 192.168.4.14
calico-node-clwqk 2/2 NodeLost 4 206d 192.168.4.251
calico-node-d2zqz 2/2 Running 0 46d 192.168.4.182
calico-node-m4x2t 2/2 Running 6 213d 192.168.4.221
calico-node-m8xwk 2/2 Running 9 213d 192.168.4.143
calico-node-q7r7g 2/2 Running 8 213d 192.168.4.174
cluster-autoscaler-65d6d7f544-dpbfk 1/1 Running 10 207d 100.96.67.88
kube-apiserver-ip-192-168-4-14 1/1 Running 6 213d 192.168.4.14
kube-apiserver-ip-192-168-4-143 1/1 Running 6 213d 192.168.4.143
kube-apiserver-ip-192-168-4-249 1/1 Running 6 213d 192.168.4.249
kube-controller-manager-ip-192-168-4-14 1/1 Running 5 213d 192.168.4.14
kube-controller-manager-ip-192-168-4-143 1/1 Running 6 213d 192.168.4.143
kube-controller-manager-ip-192-168-4-249 1/1 Running 6 213d 192.168.4.249
kube-dns-545bc4bfd4-rt7qp 3/3 Running 13 213d 100.96.19.197
kube-proxy-2bn42 1/1 Running 0 46d 192.168.4.182
kube-proxy-95cvh 1/1 Running 4 213d 192.168.4.174
kube-proxy-bqrhw 1/1 NodeLost 2 206d 192.168.4.251
kube-proxy-cqh67 1/1 Running 6 213d 192.168.4.14
kube-proxy-fbdvx 1/1 Running 4 213d 192.168.4.221
kube-proxy-gcjxg 1/1 Running 5 213d 192.168.4.249
kube-proxy-mt62x 1/1 Running 4 213d 192.168.4.143
kube-scheduler-ip-192-168-4-14 1/1 Running 6 213d 192.168.4.14
kube-scheduler-ip-192-168-4-143 1/1 Running 6 213d 192.168.4.143
kube-scheduler-ip-192-168-4-249 1/1 Running 6 213d 192.168.4.249
kubernetes-dashboard-7c5d596d8c-q6sf2 1/1 Running 5 213d 100.96.22.230
tiller-deploy-6d9f596465-svpql 1/1 Running 3 213d 100.96.22.231
I am a bit lost at this point of where to go from here. Any suggestions are welcome.
Most likely the kubelet must be down.
share the output from below command
journalctl -u kubelet
share the output from the below command
kubectl get po -n kube-system -owide
It appears like the node is not able to communicate with the control plane. you can below steps