I have a 3-node cluster running on GKE. All the nodes are pre-emptible meaning they can be killed at any time and generally do not live longer than 24 hours. In the event a node is killed the autoscaler spins up a new node to replace it. This usually takes a minute or so when this happens.
In my cluster I have a deployment with its replicas set to 3. My intention is that each pod will be spread across all the nodes such that my application will still run as long as at least one node in my cluster is alive.
I've used the following affinity configuration such that pods prefer running on hosts different to ones already running pods for that deployment:
spec:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- my-app
topologyKey: kubernetes.io/hostname
weight: 100
When I scale my application from 0 this seems to work as intended. But in practice the following happens:
my-app
replicaset A
, B
and C
are running on nodes 1
, 2
and 3
respectively. So state would be: 1 -> A
2 -> B
3 -> C
my-app
. As the autoscalar is still in the process of starting a replacement node (4
), only 1
and 2
are available.D
on node 1
4
eventually comes online but as my-app
has all its pods scheduled it doesn't have any pods running on it. Resultant state is 1 -> A, D
2 -> B
4 -> -
This is not the ideal configuration. The problem arises because there's a delay creating the new node and the schedular is not aware that it'll be available very soon.
Is there a better configuration that can ensure the pods will always be distributed across the node? I was thinking a directive like preferredDuringSchedulingpreferredDuringExecution
might do it but that doesn't exist.
preferredDuringSchedulingIgnoredDuringExecution means it is a preference not a hard requirement, which could explain 1 -> A, D
I believe you are searching for requiredDuringSchedulingIgnoredDuringExecution in conjunction with anti-affinity such that you have distributed workloads.
Please have a look at this github for more details and examples.