how to recover from master failure with kubeadm

3/25/2018

I set up a Kubernetes cluster with a single master node and two worker nodes using kubeadm, and I am trying to figure out how to recover from node failure.

When a worker node fails, recovery is straightforward: I create a new worker node from scratch, run kubeadm join, and everything's fine.

However, I cannot figure out how to recover from master node failure (without interrupting the deployments running on the worker nodes). Do I need to backup and restore the original certificates or can I just run kubeadm init to create a new master from scratch? How do I join the existing worker nodes?

-- fabstab
kubeadm
kubernetes

3 Answers

5/25/2018

I ended up writing a Kubernetes CronJob backing up the etcd data. If you are interested: I wrote a blog post about it: https://labs.consol.de/kubernetes/2018/05/25/kubeadm-backup.html

In addition to that you may want to backup all of /etc/kubernetes/pki to avoid issues with secrets (tokens) having to be renewed.

For example, kube-proxy uses a secret to store a token and this token becomes invalid if only the etcd certificate is backed up.

-- fabstab
Source: StackOverflow

3/27/2018

As per your mention about Master's backup , actually if you mean backup procedures (like traditional/legacy backups tools/techs) isn't mentioned directly in the official documentation (as i know), but you can take your precautions by some Options/Workarounds :

-- EngSabry
Source: StackOverflow

3/25/2018

kubeadm init will definitely not work out of the box, as that will create a new cluster altogether, credentials, ip space, etc.

At a minimum, restoring the master node will require a backup of your etcd data. This typically lives in /var/lib/etcd directory.

You will also need the kubeadm config from the cluster kubeadm config view should output this. (upward of v1.8)

The step-by-step to restore a master node really isn't so clean cut, which is why they introduce HA - High Availability. This is a much safer way of maintaining redundancy and uptime. Particularly because restoring anything from etcd can be a real pain (in my humble opinion and experience).

If I may go a bit off topic from your question, if you are still getting started with Kubernetes and not deeply invested in kubeadm, i would suggest you consider creating your cluster with kops instead. It supports HA already and I found kops to be more robust and easier to use to either kubeadm and kube-aws (the coreos cluster builder). https://kubernetes.io/docs/getting-started-guides/kops/

-- nelsonenzo
Source: StackOverflow