I've been running a Kubernetes cluster for a while now, but I haven't been able to keep it stable. My cluster consists of four nodes, two masters and two workers. All nodes run on the same physical server, which in turn runs VMware vSphere 6.5. Each node runs CoreOS stable (1353.7.0), and I'm running Kubernetes/Hyperkube v1.6.4, using Calico for networking. I've followed the steps in this guide.
What happens is that for a few hours/days, the cluster will run without a hitch. Then, all of a sudden (for no discernible reason as far as I can tell) all my pods go to status "Pending" and stay that way. Any hosted services are then no longer reachable. After a while (usually 5 to 10 minutes), it seems to restore itself, after which it starts recreating all my pods, and trying (but failing) to shut down all my running pods. Some of the newly created pods come up, but will initially have no connection to the internet.
For a couple of weeks now I've had this issue intermittently, and it's been preventing me from using Kubernetes in production. I'd really like to figure out what's been causing this!
Weirdly enough, when I try to diagnose the problem by inspecting the logs, I've noticed that on both of my worker nodes, the journald logs will have become corrupted! On the master nodes, the log is still readable, but not very informative.
Even when running, kubelet is constantly emitting errors in its logs. On all the nodes, this is what's posted about once a minute:
May 26 09:37:14 kube-master1 kubelet-wrapper[24228]: E0526 09:37:14.012890 24228 cni.go:275] Error deleting network: open /var/lib/cni/flannel/3975179a14dac15cd41881266c9bfd6b8763c0a48934147582cb55d5618a9233: no such file or directory
May 26 09:37:14 kube-master1 kubelet-wrapper[24228]: E0526 09:37:14.014762 24228 remote_runtime.go:109] StopPodSandbox "3975179a14dac15cd41881266c9bfd6b8763c0a48934147582cb55d5618a9233" from runtime service failed: rpc error: code = 2 desc = NetworkPlugin cni failed to teardown pod "logstash-s3498_default" network: open /var/lib/cni/flannel/3975179a14dac15cd41881266c9bfd6b8763c0a48934147582cb55d5618a9233: no such file or directory
May 26 09:37:14 kube-master1 kubelet-wrapper[24228]: E0526 09:37:14.014818 24228 kuberuntime_gc.go:138] Failed to stop sandbox "3975179a14dac15cd41881266c9bfd6b8763c0a48934147582cb55d5618a9233" before removing: rpc error: code = 2 desc = NetworkPlugin cni failed to teardown pod "logstash-s3498_default" network: open /var/lib/cni/flannel/3975179a14dac15cd41881266c9bfd6b8763c0a48934147582cb55d5618a9233: no such file or directory
May 26 09:38:07 kube-master1 kubelet-wrapper[24228]: I0526 09:38:07.422341 24228 operation_generator.go:597] MountVolume.SetUp succeeded for volume "kubernetes.io/secret/9a378211-3597-11e7-a7ec-000c2958a0d7-default-token-0p3gf" (spec.Name: "default-token-0p3gf") pod "9a378211-3597-11e7-a7ec-000c2958a0d7" (UID: "9a378211-3597-11e7-a7ec-000c2958a0d7").
May 26 09:38:14 kube-master1 kubelet-wrapper[24228]: W0526 09:38:14.037553 24228 docker_sandbox.go:263] NetworkPlugin cni failed on the status hook for pod "logstash-s3498_default": Unexpected command output nsenter: cannot open : No such file or directory
May 26 09:38:14 kube-master1 kubelet-wrapper[24228]: with error: exit status 1
I've googled this error, encountered this issue, but that has been closed and people indicate that using v1.6.0 or later should resolve it, but it definitely hasn't in my case...
Can anybody point me in the right direction?!
Thanks!
Seen this as well. problem seems to go away if you downgrade CoreOS to a older version with docker 1.12.3.
Docker is a nightmare with regressions in every single version they release :(