I was following Kelsey Hightower's kubernetes-the-hard-way repo and successfully created a cluster with 3 master nodes and 3 worker nodes. Here are the problems encountered when removing one of the etcd members and then adding it back, also with all the steps used:
3 master nodes:
10.240.0.10 controller-0
10.240.0.11 controller-1
10.240.0.12 controller-2
Step 1:
isaac@controller-0:~$ sudo ETCDCTL_API=3 etcdctl member list --endpoints=https://127.0.0.1:2379 --cacert=/etc/etcd/ca.pem --cert=/etc/etcd/kubernetes.pem --key=/etc/etcd/kubernetes-key.pem
Result:
b28b52253c9d447e, started, controller-2, https://10.240.0.12:2380, https://10.240.0.12:2379
f98dc20bce6225a0, started, controller-0, https://10.240.0.10:2380, https://10.240.0.10:2379
ffed16798470cab5, started, controller-1, https://10.240.0.11:2380, https://10.240.0.11:2379
Step 2 (Remove etcd member of controller-2):
isaac@controller-0:~$ sudo ETCDCTL_API=3 etcdctl member remove b28b52253c9d447e --endpoints=https://127.0.0.1:2379 --cacert=/etc/etcd/ca.pem --cert=/etc/etcd/kubernetes.pem --key=/etc/etcd/kubernetes-key.pem
Step 3 (Add the member back):
isaac@controller-0:~$ sudo ETCDCTL_API=3 etcdctl member add controller-2 --peer-urls=https://10.240.0.12:2380 --endpoints=https://127.0.0.1:2379 --cacert=/etc/etcd/ca.pem --cert=/etc/etcd/kubernetes.pem --key=/etc/etcd/kubernetes-key.pem
Result:
Member 66d450d03498eb5c added to cluster 3e7cc799faffb625 ETCD_NAME="controller-2" ETCD_INITIAL_CLUSTER="controller-2=https://10.240.0.12:2380,controller-0=https://10.240.0.10:2380,controller-1=https://10.240.0.11:2380" ETCD_INITIAL_ADVERTISE_PEER_URLS="https://10.240.0.12:2380" ETCD_INITIAL_CLUSTER_STATE="existing"
Step 4 (run member list command):
isaac@controller-0:~$ sudo ETCDCTL_API=3 etcdctl member list --endpoints=https://127.0.0.1:2379 --cacert=/etc/etcd/ca.pem --cert=/etc/etcd/kubernetes.pem --key=/etc/etcd/kubernetes-key.pem
Result:
66d450d03498eb5c, unstarted, , https://10.240.0.12:2380,
f98dc20bce6225a0, started, controller-0, https://10.240.0.10:2380, https://10.240.0.10:2379 ffed16798470cab5, started, controller-1, https://10.240.0.11:2380, https://10.240.0.11:2379
Step 5 (Run the command to start etcd in controller-2):
isaac@controller-2:~$ sudo etcd --name controller-2 --listen-client-urls https://10.240.0.12:2379,http://127.0.0.1:2379 --advertise-client-urls https://10.240.0.12:2379 --listen-peer-urls https://10.240.0.12:
2380 --initial-advertise-peer-urls https://10.240.0.12:2380 --initial-cluster-state existing --initial-cluster controller-0=http://10.240.0.10:2380,controller-1=http://10.240.0.11:2380,controller-2=http://10.240.0.1
2:2380 --ca-file /etc/etcd/ca.pem --cert-file /etc/etcd/kubernetes.pem --key-file /etc/etcd/kubernetes-key.pem
Result:
2019-06-09 13:10:14.958799 I | etcdmain: etcd Version: 3.3.9 2019-06-09 13:10:14.959022 I | etcdmain: Git SHA: fca8add78 2019-06-09 13:10:14.959106 I | etcdmain: Go Version: go1.10.3 2019-06-09 13:10:14.959177 I | etcdmain: Go OS/Arch: linux/amd64 2019-06-09 13:10:14.959237 I | etcdmain: setting maximum number of CPUs to 1, total number of available CPUs is 1 2019-06-09 13:10:14.959312 W | etcdmain: no data-dir provided, using default data-dir ./controller-2.etcd 2019-06-09 13:10:14.959435 N | etcdmain: the server is already initialized as member before, starting as etcd member... 2019-06-09 13:10:14.959575 C | etcdmain: cannot listen on TLS for 10.240.0.12:2380: KeyFile and CertFile are not presented
Clearly, the etcd service did not start as expected, so I do the troubleshooting as below:
isaac@controller-2:~$ sudo systemctl status etcd
Result:
● etcd.service - etcd Loaded: loaded (/etc/systemd/system/etcd.service; enabled; vendor preset: enabled)
Active: inactive (dead) since Sun 2019-06-09 13:06:55 UTC; 29min ago Docs: https://github.com/coreos Process: 1876 ExecStart=/usr/local/bin/etcd --name controller-2 --cert-file=/etc/etcd/kubernetes.pem --key-file=/etc/etcd/kubernetes-key.pem --peer-cert-file=/etc/etcd/kubernetes.pem --peer-key-file=/etc/etcd/kube Main PID: 1876 (code=exited, status=0/SUCCESS) Jun 09 13:06:55 controller-2 etcd[1876]: stopped peer f98dc20bce6225a0 Jun 09 13:06:55 controller-2 etcd[1876]: stopping peer ffed16798470cab5... Jun 09 13:06:55 controller-2 etcd[1876]: stopped streaming with peer ffed16798470cab5 (writer) Jun 09 13:06:55 controller-2 etcd[1876]: stopped streaming with peer ffed16798470cab5 (writer) Jun 09 13:06:55 controller-2 etcd[1876]: stopped HTTP pipelining with peer ffed16798470cab5 Jun 09 13:06:55 controller-2 etcd[1876]: stopped streaming with peer ffed16798470cab5 (stream MsgApp v2 reader) Jun 09 13:06:55 controller-2 etcd[1876]: stopped streaming with peer ffed16798470cab5 (stream Message reader) Jun 09 13:06:55 controller-2 etcd[1876]: stopped peer ffed16798470cab5 Jun 09 13:06:55 controller-2 etcd[1876]: failed to find member f98dc20bce6225a0 in cluster 3e7cc799faffb625 Jun 09 13:06:55 controller-2 etcd[1876]: forgot to set Type=notify in systemd service file?
I indeed tried to start the etcd member using different commands but seems the etcd of controller-2 still stuck at unstarted state. May I know the reason of that? Any pointers would be highly appreciated! Thanks.
Turned out I solved the problem as follows (credit to Matthew):
rm -rf /var/lib/etcd/*
cannot listen on TLS for 10.240.0.12:2380: KeyFile and CertFile are not presented
, I revised the command to start the etcd as follows:sudo etcd --name controller-2 --listen-client-urls https://10.240.0.12:2379,http://127.0.0.1:2379 --advertise-client-urls https://10.240.0.12:2379 --listen-peer-urls https://10.240.0.12:2380 --initial-advertise-peer-urls https://10.240.0.12:2380 --initial-cluster-state existing --initial-cluster controller-0=https://10.240.0.10:2380,controller-1=https://10.240.0.11:2380,controller-2=https://10.240.0.12:2380 --peer-trusted-ca-file /etc/etcd/ca.pem --cert-file /etc/etcd/kubernetes.pem --key-file /etc/etcd/kubernetes-key.pem --peer-cert-file /etc/etcd/kubernetes.pem --peer-key-file /etc/etcd/kubernetes-key.pem --data-dir /var/lib/etcd
A few points to note here:
--cert-file
and --key-file
presented the required key and certificate of controller2
.--peer-trusted-ca-file
is also presented so as to check if the x509 certificate presented by controller0
and controller1
are signed by a known CA. If this is not presented, error etcdserver: could not get cluster response from https://10.240.0.11:2380: Get https://10.240.0.11:2380/members: x509: certificate signed by unknown authority
may be encountered.--initial-cluster
needs to be in-line with that shown in the systemd unit file.