I set up a Kubernetes cluster with a single master node and two worker nodes using kubeadm
, and I am trying to figure out how to recover from node failure.
When a worker node fails, recovery is straightforward: I create a new worker node from scratch, run kubeadm join
, and everything's fine.
However, I cannot figure out how to recover from master node failure (without interrupting the deployments running on the worker nodes). Do I need to backup and restore the original certificates or can I just run kubeadm init
to create a new master from scratch? How do I join the existing worker nodes?
I ended up writing a Kubernetes CronJob backing up the etcd data. If you are interested: I wrote a blog post about it: https://labs.consol.de/kubernetes/2018/05/25/kubeadm-backup.html
In addition to that you may want to backup all of /etc/kubernetes/pki
to avoid issues with secrets (tokens) having to be renewed.
For example, kube-proxy uses a secret to store a token and this token becomes invalid if only the etcd certificate is backed up.
As per your mention about Master's backup , actually if you mean backup procedures (like traditional/legacy backups tools/techs) isn't mentioned directly in the official documentation (as i know), but you can take your precautions by some Options/Workarounds :
Setup HA Masters (only for GCE)
Set up High-Availability Kubernetes Masters
Setup HA etcd cluster / Master Load Balancer
Setting-up-an-ha-etcd-cluster
Set up master Load Balancer
Operating etcd clusters for Kubernetes
OS file Systems Snapshot/backup
kubeadm init
will definitely not work out of the box, as that will create a new cluster altogether, credentials, ip space, etc.
At a minimum, restoring the master node will require a backup of your etcd data. This typically lives in /var/lib/etcd directory.
You will also need the kubeadm config from the cluster kubeadm config view
should output this. (upward of v1.8)
The step-by-step to restore a master node really isn't so clean cut, which is why they introduce HA - High Availability. This is a much safer way of maintaining redundancy and uptime. Particularly because restoring anything from etcd can be a real pain (in my humble opinion and experience).
If I may go a bit off topic from your question, if you are still getting started with Kubernetes and not deeply invested in kubeadm, i would suggest you consider creating your cluster with kops instead. It supports HA already and I found kops to be more robust and easier to use to either kubeadm and kube-aws (the coreos cluster builder). https://kubernetes.io/docs/getting-started-guides/kops/