Kubernetes pod distribution amongst nodes with preferred mode

11/5/2019

I am working on migrating my applications to Kubernetes. I am using EKS.

I want to distribute my pods to different nodes, to avoid having a single point of failure. I read about pod-affinity and anti-affinity and required and preferred mode.

This answer gives a very nice way to accomplish this.

But my doubt is, let's say if I have 3 nodes, of which 2 are already full(resource-wise). If I use requiredDuringSchedulingIgnoredDuringExecution, k8s will spin-up new nodes and will distribute the pods to each node. And if I use preferredDuringSchedulingIgnoredDuringExecution, it will check for preferred-nodes, and not finding different nodes, will deploy all pods on the third node only. In which case, it will again become a single point of failure.

How do I solve this condition?

One way I can think of is to have an over-provisioned cluster, so that there are always some extra nodes.

Second way, I am not sure how to do this, but I think there should be a way of using both requiredDuringSchedulingIgnoredDuringExecution and preferredDuringSchedulingIgnoredDuringExecution.

Can anyone help me with this? Am I missing something? How do people work with this condition?

I am new to Kubernetes, so feel free to correct me if I am wrong or missing something.

Thanks in advance

Note:

I don't have a problem running a few similar pods on the same node, just don't want all pods to be running on the same node, just because there was only one node available to deploy.

-- kadamb
amazon-eks
aws-eks
kubernetes
kubernetes-pod

1 Answer

11/8/2019

I see you are trying to make sure that k8s will never schedule all pod replicas on the same node.

It's not possible to create hard requrement like this for kubernetes scheduler.

Scheduler will try its best to schedule your application as evenly as possible but in situation when you have 2 nodes without spare resources and 1 node where all pod replicas would be scheduled, k8s can do one of the folowing actions (depending on configuration):

  1. schedule your pods on one node (best effort/default)
  2. run one pod and not schedule rest of the pods at all (antiaffnity + requiredDuringSchedulingIgnoredDuringExecution)
  3. create new nodes for pods if needed (antiaffnity + requiredDuringSchedulingIgnoredDuringExecution + cluster autoscaler)
  4. start deleting pods from nodes to free resources for high-priority pods (priority based preemption) and reschedule preempted pods if possible

Also read this article to get better understanding on how scheduler makes its decisions.

You can also use PodDisruptionBudget to tell kubernetes to make sure a specified replicas are always working, remember that although:

A disruption budget does not truly guarantee that the specified number/percentage of pods will always be up.

kubernetes will take it under consideration when making scheduling decisions.

-- HelloWorld
Source: StackOverflow