So I have 4 nodes. 1 is System, 1 is Dev, 1 is Qa and 1 is UAT.
My affinity is as follows:
apiVersion: apps/v1
kind: Deployment
metadata:
name: auth
namespace: dev
labels:
app: auth
environment: dev
app-role: api
tier: backend
spec:
replicas: 1
selector:
matchLabels:
app: auth
template:
metadata:
labels:
app: auth
environment: dev
app-role: api
tier: backend
annotations:
build: _{Tag}_
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- auth
topologyKey: kubernetes.io/hostname
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: environment
operator: In
values:
- dev
containers:
- name: companyauth
image: company.azurecr.io/auth:_{Tag}_
imagePullPolicy: Always
env:
- name: ConnectionStrings__DevAuth
value: dev
ports:
- containerPort: 80
imagePullSecrets:
- name: ipsIt is my intention to make sure that on my production cluster, which has 3 nodes in 3 different availability zones. That all the pods will be scheduled on a different node/availability zone. However, it appears that if I already have pods scheduled on a node, then when I do a deployment it will not overwrite the pods that already exist.
0/4 nodes are available: 1 node(s) didn't match pod affinity/anti-affinity, 3 node(s) didn't match node selector.
However, if I remove the podAffinity, it works fine and will overwrite the current node with the new pod from the deployment. What is the correct way to do this to ensure my deployment on my production cluster will always have a pod scheduled on a different node in a different availability zone and also be able to update the existing nodes?
Your node affinity rule mandates that only the Dev node will be considered for scheduling. In combination with your podAntiAffinityRule this means, only one Pod can be scheduled (the one on the Dev node).
To get an even scheduling across nodes, you will have to add additional Dev nodes or remove the nodeAffinity rule.
Your goal can be achieved using only PodAntiAffinity.
I have tested this with my GKE test cluster, but it should work similar on Azure.
In your current setup, you have set podAntiAffinity with nodeAffinity.
Pod anti-affinitycan prevent the scheduler from locating a new pod on the same node as pods with the same labels if the label selector on the new pod matches the label on the current pod.
In your Deployment setup, new pods will have labels like:
app: authenvironment: devapp-role: apitier: backendPodAntiAffinity was configured to not allow deploying new pod, if there is already pod with label: app: auth.
NodeAffinity was configured to deploy only on the node with label environment: dev.
To Sum up, your error:
0/4 nodes are available: 1 node(s) didn't match pod affinity/anti-affinity, 3 node(s) didn't match node selector.1 node(s) didn't match pod affinity/anti-affinity
your setup allows to deploy only on the node with label environment: dev and only one pod with label app: auth.
As you mention
if I already have pods scheduled on a node, then when I do a deployment it will not overwrite the pods that already exist.
PodAntiAffinity behavior worked and didn't allow to deploy new pod with label app: auth as there was already one.
3 node(s) didn't match node selector.
NodeAffinity allows to deploy pods only on the node with label environment: dev. Other nodes have probably labels like environment: system, environment: uat, environment: qa which didn't match environment: dev label thus didn't match node selector.
Easiest way is to remove NodeAffinity.
While TolpologyKey is set to kubernetes.io/hostname in PodAntiAffinity it's enough.
The topologyKey uses the default label attached to a node to dynamically filter on the name of the node.
For more information, please check this article.
If you will describe your nodes and grep them with kubernetes.io/hostname you will get unique value:
$ kubectl describe node | grep kubernetes.io/hostname
kubernetes.io/hostname=gke-affinity-default-pool-27d6eabd-vhss
kubernetes.io/hostname=gke-affinity-default-pool-5014ecf7-5tkh
kubernetes.io/hostname=gke-affinity-default-pool-c2afcc97-clg9apiVersion: apps/v1
kind: Deployment
metadata:
name: auth
labels:
app: auth
environment: dev
app-role: api
tier: backend
spec:
replicas: 3
selector:
matchLabels:
app: auth
template:
metadata:
labels:
app: auth
environment: dev
app-role: api
tier: backend
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- auth
topologyKey: kubernetes.io/hostname
containers:
- name: nginx
image: nginx
imagePullPolicy: Always
ports:
- containerPort: 80After deploying this YAML.
$ kubectl get po -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
auth-7fccf5f7b8-4dkc4 1/1 Running 0 9s 10.0.1.9 gke-affinity-default-pool-c2afcc97-clg9 <none> <none>
auth-7fccf5f7b8-5qgt4 1/1 Running 0 8s 10.0.2.6 gke-affinity-default-pool-5014ecf7-5tkh <none> <none>
auth-7fccf5f7b8-bdmtw 1/1 Running 0 8s 10.0.0.9 gke-affinity-default-pool-27d6eabd-vhss <none> <none>If you would increase replicas to 7, no more pods will be deployed. All new pods will stuck in Pending state, as antiPodAffinity worked (each node already have pod with label app: dev).
$ kubectl get po -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
auth-7fccf5f7b8-4299k 0/1 Pending 0 79s <none> <none> <none> <none>
auth-7fccf5f7b8-4dkc4 1/1 Running 0 2m1s 10.0.1.9 gke-affinity-default-pool-c2afcc97-clg9 <none> <none>
auth-7fccf5f7b8-556h5 0/1 Pending 0 78s <none> <none> <none> <none>
auth-7fccf5f7b8-5qgt4 1/1 Running 0 2m 10.0.2.6 gke-affinity-default-pool-5014ecf7-5tkh <none> <none>
auth-7fccf5f7b8-bdmtw 1/1 Running 0 2m 10.0.0.9 gke-affinity-default-pool-27d6eabd-vhss <none> <none>
auth-7fccf5f7b8-q4s2c 0/1 Pending 0 79s <none> <none> <none> <none>
auth-7fccf5f7b8-twb9j 0/1 Pending 0 79s <none> <none> <none> <none>Similar solution was described in High-Availability Deployment of Pods on Multi-Zone Worker Nodes blog.