k8s high availability configuration edge cases for prod

11/21/2020

we have an app in production which need to be highly available (100%),so we did the following:

  1. We configure 3 instance as HA but then the node died
  2. We configure anti-affinity (to run on differents nodes) but some update done on the nodes and we were unavailable(evicted) for some min.
  3. Now we consider to add pod disruption Budget https://kubernetes.io/docs/concepts/workloads/pods/disruptions/

My question are:

  1. How the affinity works with pod disruption Budget, could be any collusion ? or this is redundant configs ?
  2. is there any other configuration which I need to add to make sure that my pods run always (as much as possible )
-- Rayn D
amazon-web-services
azure
kubernetes

1 Answer

11/21/2020

How the affinity works with pod disruption Budget, could be any collusion ? or this is redundant configs ?

Affinity and Anti-affinity is about where your Pod is scheduled, e.g. so that two replicas of the same app is not scheduled to the same node. Pod Disruption Budgets is about to increase availability when using voluntary disruption e.g. maintenance. They are both related to making better availability for your app - but not related to eachother.

Is there any other configuration which I need to add to make sure that my pods run always (as much as possible)

Things will fail. What you need to do is to embrace distributed systems and make all your workload a distributed system, e.g. with multiple instances to remove single point of failure. This is done differently for stateless (e.g. Deployment) and stateful (e.g. StatefulSet) workload. What's important for you is that your app is available at much as possible, but individual instances (e.g. Pods) can fail, almost without that any user notice it.

We configure 3 instance as HA but then the node died

Things will always fail. E.g. a physical node may crash. You need to design your apps so that it can tolerate some failures.

If you use a cloud provider, you should use regional clusters that uses three independent Availability Zones and you need to spread your workload so that it runs in more than one Availability Zone - in this way, your app can tolerate that a whole Availability Zone is down without affecting your users.

-- Jonas
Source: StackOverflow