I have a multizone (3 zones) GKE cluster (1.10.7-gke.1) of 6 nodes and want each zone to have at least one replica of my application.
So I've tried preferred podAntiAffinity:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: component
operator: In
values:
- app
topologyKey: failure-domain.beta.kubernetes.io/zone
Everything looks good the first time I install (scale from 1 to 3 replicas) my application. After the next rolling update, everything gets mixed up and I can have 3 copies of my application in one zone. Since additional replicas are created and the old ones are terminated.
When I am trying the same term with requiredDuringSchedulingIgnoredDuringExecution everything looks good but rolling updates don't work because new replicas can't be scheduled (pods with "component" = "app" already exist in each zone).
How to configure my deployment to be sure I have replica in each availability zone?
UPDATED:
My workaround now is to have hard anti-affinity and deny additional pods (more than 3) during the rolling update:
replicaCount: 3
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: component
operator: In
values:
- app
topologyKey: failure-domain.beta.kubernetes.io/zone
deploymentStrategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 0
maxUnavailable: 1
If you have two nodes in each zone, you can use below affinity rules to make sure rolling updates works as well and you have a pod in each zone.
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: component
operator: In
values:
- app
topologyKey: "kubernetes.io/hostname"
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: component
operator: In
values:
- app
topologyKey: failure-domain.beta.kubernetes.io/zone
The key issue here is rolling update - upon doing rolling update, old replica is kept until new one is launched. But new one can't be scheduled/launched due to conflict with its old replica.
So if rolling update isn't a concern, a workaround here to change strategy type to Recreate
:
apiVersion: apps/v1
kind: Deployment
...
spec:
...
strategy:
type: Recreate
...
Then applying podAntiAffinity/requiredDuringSchedulingIgnoredDuringExecution rules would work.
I don't think the Kubernetes scheduler provides a way to guarantee pods in all availability zones. I believe it's a best-effort approach when it comes to that and there may be some limitations.
I've opened an issue to check whether this can be supported either through NodeAffinity or PodAffiity/PodAntiAffinity.