kube-apiserver unable to create storage backend

6/14/2018

I set up a high-availability Kubernetes cluster by following the official Creating HA clusters with kubeadm guide. It is an experimental cluster for exploring the feasibility of a on-premises high-availability deployment, and as such I created the cluster on six Cent OS 7 virtual machines hosted on VMware Workstation - three master nodes and three worker nodes.

It was running fine after initial setup, but after I shut down everything last night and restarted all the VMs this morning, kube-apiserver is no longer starting on any of the master nodes. It is failing on all nodes with a message stating that it is "unable to create storage backend (context deadline exceeded)":

F0614 20:18:43.297064       1 storage_decorator.go:57] Unable to create storage backend: config (&{ /registry [https://192.168.56.10.localdomain:2379 https://192.168.56.11.localdomain:2379 https://192.168.56.12.localdomain:2379] /etc/pki/tls/private/client-key.pem /etc/pki/tls/certs/client.pem /etc/pki/ca-trust/source/anchors/ca.pem true false 1000 0xc42047e100 <nil> 5m0s 1m0s}), err (context deadline exceeded)

That suggests a problem with etcd, but the etcd cluster reports healthy, and I can successfully use it to set and query values using the same certs provided to kube-apiserver.

My versions are:

CentOS 7.5.1804
Kubernetes - 1.10.4
Docker - 18.03.1-ce
etcd - 3.1.17
keepalived - 1.3.5

And though these all worked fine together last night, in an effort to rule out version conflicts, I tried adding --storage-backend=etcd3 to the kube-apiserver.yaml manifest file and downgrading Docker to 17.03.2-ce. Neither helped.

I also disabled firewalld to ensure it wasn't blocking any etcd traffic. Again, that did not help (nor did I see any evidence of dropped connections)

I don't know how to dig any deeper to discover why the kube-apiserver can't create its storage backend. So far my experiment with high-availability is a failure.

-- seefer
kubernetes

2 Answers

10/15/2018

I've run into this problem and solved it by deleting the /etc/kubernetes directory on the host OS and reinstalling k8s. (Using Rancher)

-- Geert Schuring
Source: StackOverflow

6/14/2018

The details at the end of the error message (context deadline expired), suggest a timeout (Go's context package is used for handling timeouts). But I wasn't seeing any slowness when I accessed the etcd cluster directly via etcdctl, so I set up a tcpdump capture to see if it would tell me anything more about what was happening between kube-apiserver and etcd. I filtered on port 2379, which is etcd's client request port:

tcpdump -i any port 2379

I did not see any activity at first, so I forced activity by querying etcd directly via etcdctl. That worked, and it showed the expected traffic to port 2379.

At this point I was still stuck, because it appeared that kube-apiserver simply wasn't calling etcd. But then a few mysterious entries appeared in tcpdump's output:

18:04:30.912541 IP master0.34480 > unallocated.barefruit.co.uk.2379: Flags [S], seq 1974036339, win 29200, options [mss 1460,sackOK,TS val 4294906938 ecr 0,nop,wscale 7], length 0
18:04:32.902298 IP master0.34476 > unallocated.barefruit.co.uk.2379: Flags [S], seq 3960458101, win 29200, options [mss 1460,sackOK,TS val 4294908928 ecr 0,nop,wscale 7], length 0
18:04:32.910289 IP master0.34478 > unallocated.barefruit.co.uk.2379: Flags [S], seq 2100196833, win 29200, options [mss 1460,sackOK,TS val 4294908936 ecr 0,nop,wscale 7], length 0

What is unallocated.barefruit.co.uk and why is a process on my master node trying to make an etcd client request to it?

A quick google search reveals that unallocated.barefruit.co.uk is a DNS "enhancing" service that redirects bad DNS queries.

My nodes aren't registered in DNS because this is just an experimental cluster. I have entries for them in /etc/hosts, but that's it. Apparently something in kube-apiserver is attempting to resolve my etcd node names (e.g. master0.localdomain) and is querying DNS before /etc/hosts (I always thought /etc/hosts took priority). And rather than rejecting the invalid names, my ISP (Verizon FIOS) is using this "enhanced" DNS service that redirects to unallocated.barefruit.co.uk which, surprisingly enough, isn't running an etcd cluster for me.

I edited the network configuration on my master nodes to prove out my hypothesis, adding explicit DNS settings pointing to google's servers 8.8.8.8 and 8.8.4.4 that are not "enhanced". I then rebooted, and the cluster came right up.

So what really changed between last night and today? My experimental cluster is running on my laptop, and yesterday I was working in the office (no FIOS), while today I was working at home (connected to FIOS). Ugh. Thanks Verizon!

I'm still not sure why kube-apiserver seems to be prioritizing DNS over /etc/hosts. But I guess the lesson is to either make sure your node names have valid DNS entries or specify everything by IP address. Anyone have any thoughts as to which is best practice?

-- seefer
Source: StackOverflow