Re-installed node cannot join Kubernetes cluster

6/10/2021

I had a working 3 node k8s cluster (v1.21.0 on Ubuntu 20.04 bare metal) installed using kubeadm. I removed one of the nodes and re-installed it from scratch (wipe disks, new OS but IP address is the same). Now it is unable to join the cluster:

# kubeadm join k8s.example.com:6443 --token who21h.jolq7z79twv7bf4m \
--discovery-token-ca-cert-hash sha256:f63c5786cea2be46c999f4b5c595abd0aa24896c3b37616c347df318d7406c00 \
--control-plane
...
[check-etcd] Checking that the etcd cluster is healthy
error execution phase check-etcd: etcd cluster is not healthy: failed to dial endpoint https://65.21.128.36:2379 with maintenance client: context deadline exceeded
To see the stack trace of this error execute with --v=5 or higher

I ran the same (after kubeadm reset) with --v=5 and it gets stuck logging these:

Failed to get etcd status for https://123.123.123.123:2379: failed to dial endpoint https://123.123.123.123:2379 with maintenance client: context deadline exceeded

123.123.123.123 is the IP address for the node I am trying to return to the cluster.

Running kubectl get nodes on one of the other masters just lists the 2 remaining masters. I removed the node in question properly:

kubectl get nodes
kubectl drain <node-name>
kubectl drain <node-name> --ignore-daemonsets --delete-local-data
kubectl delete node <node-name>

Any ideas? Tx.

-- David Tinker
kubeadm
kubernetes

1 Answer

6/11/2021

Take a closer look at the error message you get:

Failed to get etcd status for https://123.123.123.123:2379: failed to dial endpoint https://123.123.123.123:2379 with maintenance client: context deadline exceeded

This is quite common issue, related with etcd cluster, which is well-documented. Compare with the following threads:

Specifically, this is related with the loss of etcd quorum. You can check it as described here.

The solution is described step by step in this comment:

For the record here the command to run on one of the remaining etcd pod :

Find the id of the member to remove

ETCDCTL_API=3 etcdctl --endpoints 127.0.0.1:2379 --cacert /etc/kubernetes/pki/etcd/ca.crt --cert
/etc/kubernetes/pki/etcd/server.crt --key
/etc/kubernetes/pki/etcd/server.key member list
5a4945140f0b39d9, started, sbg2-k8s001, https://192.168.208.12:2380, https://192.168.208.12:2379
740381e3c57ef823, started, gra3-k8s001, https://192.168.208.13:2380, https://192.168.208.13:2379
77a8fbb530b10f4a, started, rbx4-k8s001, https://192.168.208.14:2380, https://192.168.208.14:2379

I want to remove 740381e3c57ef823

ETCDCTL_API=3 etcdctl --endpoints 127.0.0.1:2379 --cacert /etc/kubernetes/pki/etcd/ca.crt --cert
/etc/kubernetes/pki/etcd/server.crt --key
/etc/kubernetes/pki/etcd/server.key member remove 740381e3c57ef823
Member 740381e3c57ef823 removed from cluster a2c90ef66bb95cc9

Checking

ETCDCTL_API=3 etcdctl --endpoints 127.0.0.1:2379 --cacert /etc/kubernetes/pki/etcd/ca.crt --cert
/etc/kubernetes/pki/etcd/server.crt --key
/etc/kubernetes/pki/etcd/server.key member list
5a4945140f0b39d9, started, sbg2-k8s001, https://192.168.208.12:2380, https://192.168.208.12:2379
77a8fbb530b10f4a, started, rbx4-k8s001, https://192.168.208.14:2380, https://192.168.208.14:2379

Now I can join my new master.

-- mario
Source: StackOverflow