Using podAntiAffinity rules to ensure pods run on different pre-emptible nodes

11/5/2019

I have a 3-node cluster running on GKE. All the nodes are pre-emptible meaning they can be killed at any time and generally do not live longer than 24 hours. In the event a node is killed the autoscaler spins up a new node to replace it. This usually takes a minute or so when this happens.

In my cluster I have a deployment with its replicas set to 3. My intention is that each pod will be spread across all the nodes such that my application will still run as long as at least one node in my cluster is alive.

I've used the following affinity configuration such that pods prefer running on hosts different to ones already running pods for that deployment:

spec:
  affinity:
    podAntiAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - podAffinityTerm:
          labelSelector:
            matchExpressions:
            - key: app
              operator: In
              values:
              - my-app
          topologyKey: kubernetes.io/hostname
        weight: 100

When I scale my application from 0 this seems to work as intended. But in practice the following happens:

  1. Lets say pods belonging to the my-app replicaset A, B and C are running on nodes 1, 2 and 3 respectively. So state would be:
  1 -> A
  2 -> B
  3 -> C
  1. Node 3 is killed taking pod C with it, resulting in 2 running pods in the replicaset.
  2. The scheduler automatically starts to schedule a new pod to bring the replicaset back up to 3.
  3. It looks for a node without any pods for my-app. As the autoscalar is still in the process of starting a replacement node (4), only 1 and 2 are available.
  4. It schedules the new pod D on node 1
  5. Node 4 eventually comes online but as my-app has all its pods scheduled it doesn't have any pods running on it. Resultant state is
  1 -> A, D
  2 -> B
  4 -> -

This is not the ideal configuration. The problem arises because there's a delay creating the new node and the schedular is not aware that it'll be available very soon.

Is there a better configuration that can ensure the pods will always be distributed across the node? I was thinking a directive like preferredDuringSchedulingpreferredDuringExecution might do it but that doesn't exist.

-- harryg
google-kubernetes-engine
kubernetes
preemption
replicaset

1 Answer

11/5/2019

preferredDuringSchedulingIgnoredDuringExecution means it is a preference not a hard requirement, which could explain 1 -> A, D

I believe you are searching for requiredDuringSchedulingIgnoredDuringExecution in conjunction with anti-affinity such that you have distributed workloads.

Please have a look at this github for more details and examples.

-- dany L
Source: StackOverflow