Kubernetes version : v1.6.7
Network plugin : weave
I recently noticed that my entire cluster of 3 nodes went down. Doing my initial level of troubleshooting revealed that /var
on all nodes was 100%
.
Doing further into the logs revealed the logs to be flooded by kubelet
stating
Jan 15 19:09:43 test-master kubelet[1220]: E0115 19:09:43.636001 1220 kuberuntime_gc.go:138] Failed to stop sandbox "fea8c54ca834a339e8fd476e1cfba44ae47188bbbbb7140e550d055a63487211" before removing: rpc error: code = 2 desc = NetworkPlugin cni failed to teardown pod "<TROUBLING_POD>-1545236220-ds0v1_kube-system" network: CNI failed to retrieve network namespace path: Error: No such container: fea8c54ca834a339e8fd476e1cfba44ae47188bbbbb7140e550d055a63487211
Jan 15 19:09:43 test-master kubelet[1220]: E0115 19:09:43.637690 1220 docker_sandbox.go:205] Failed to stop sandbox "fea94c9f46923806c177e4a158ffe3494fe17638198f30498a024c3e8237f648": Error response from daemon: {"message":"No such container: fea94c9f46923806c177e4a158ffe3494fe17638198f30498a024c3e8237f648"}
The <TROUBLING_POD>-1545236220-ds0v1
was being initiated due to a cronjob and due to some misconfigurations, there were errors occurring during the running of those pods and more pods were being spun up.
So I deleted all the jobs and their related pods. So I had a cluster that had no jobs/pods running related to my cronjob and still see the same ERROR messages flooding the logs.
I did :
1) Restart docker and kubelet on all nodes.
2) Restart the entire control plane
and also 3) Reboot all nodes.
But still the logs are being flooded with the same error messages even though no such pods are even being spun up.
So I dont know how can I stop kubelet from throwing out the errors.
Is there a way for me to reset the network plugin I am using ? Or do something else ?
Check if the pod directory exists under /var/lib/kubelet
You're on a very old version of Kubernetes, upgrading will fix this issue.