I have a Kubernetes cluster with 3-master stacked control plane - so each master also has its own etcd instance running locally. The problem I am trying solve is this: "If one master dies such that it cannot be restarted, how do I replace it?"
Currently, when I try to add the replacement master into the cluster, I get the following error while running kubeadm join
:
[check-etcd] Checking that the etcd cluster is healthy
I0302 22:43:41.968068 9158 local.go:66] [etcd] Checking etcd cluster health
I0302 22:43:41.968089 9158 local.go:69] creating etcd client that connects to etcd pods
I0302 22:43:41.986715 9158 etcd.go:106] etcd endpoints read from pods: https://10.0.2.49:2379,https://10.0.225.90:2379,https://10.0.247.138:2379
error execution phase check-etcd: error syncing endpoints with etc: dial tcp 10.0.2.49:2379: connect: no route to host
The 10.0.2.49 node is the one that died. These nodes are all running in an AWS AutoScaling group, so I don't have control over the addresses.
I have drained and deleted the dead master node using kubectl drain
and kubectl delete
; and I have used etcdctl
to make sure the dead node was not in the member list.
Why is it still trying to connect to that node's etcd?
The problem turned out to be that the failed node was still listed in the ConfigMap. Further investigation led me to the following thread, which discusses the same problem: https://github.com/kubernetes/kubeadm/issues/1300
The solution that worked for me was to edit the ConfigMap manually.
kubectl -n kube-system get cm kubeadm-config -o yaml > tmp-kubeadm-config.yaml
manually edit tmp-kubeadm-config.yaml to remove the old server
kubectl -n kube-system apply -f tmp-kubeadm-config.yaml
I believe updating the etcd member list is still necessary to ensure cluster stability, but it wasn't the full solution.
It is still trying to connect to the member because etcd maintains a list of members in its store -- that's how it knows to vote on quorum decisions. I don't believe etcd is unique in that way -- most distributed key-value stores know their member list
The fine manual shows how to remove a dead member, but it also warns to add a new member before removing unhealthy ones.
There is also a project etcdadm that is designed to smooth over some of the rough edges about etcd cluster management, but I haven't used it to say what it is good at versus not