We have a situation now when 2 pods of same type can run on the same node. Sometimes during restarts and rescheduling 2 pods happen to be on the same node and when this node itself is rescheduled all of our pods are gone for a while resulting in connection troubles (we have just 2 pods that are load balanced).
I think the best way to fix it is to not allow 2 pods to run on the same node and to use inter-pod anti-affinity for that.
Is this the correct solution to the problem? I tried to understand it but got a bit bogged down with topologyKey
and the syntax. Can someone explain/give an example on how to achieve this?
There is one more way out, in case you don't to use pod affinity as described above ( or any such network policy to keep things simple), then at real time you can still handle this scenario if your Node gets down.
Make it cordon, so that it no longer appears to kubernetes for traffic scheduling and ultimately it will do that for other nodes which are working.
Please have a glance at attached file.
Yes, you are right Affinity is your friend here and is the correct solution.
Node Affinity : will help your app or microservice to stick to a particular kind of node (in a multinode architecture) like below my app ngnix-ms
always sticks to the nodes which have a label role=admin
.
The pod antiAffinity rule: sticks to the node labels (topologyKey) and makes sure that group (marked with topologyKey) of nodes.
If there is a node which already has a pod with the label component=nginx
, K8s won't allow to spin up a pod.
Here is the explanation:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: role
operator: In
values:
- app-1
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: component
operator: In
values:
- nginx-ms
topologyKey: "kubernetes.io/1-hostname"
and
kubectl get node --show-labels
NAME STATUS ROLES AGE VERSION LABELS
xx-admin-1 Ready master 19d v1.13.4 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/1-hostname=xx-admin-1,node-role.kubernetes.io/master=,role=admin
xx-admin-2 Ready master 19d v1.13.4 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/1-hostname=xx-admin-2,node-role.kubernetes.io/master=,role=admin
xx-plat-1-1 Ready <none> 19d v1.13.4 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/2-hostname=xx-plat-1-1,role=admin
xx-plat-2-1 Ready <none> 19d v1.13.4 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/2-hostname=xx-plat-2-1,role=admin
Explanation of topologyKey: Think of it as a label, now you can have two different topologies in the same cluster.
example: kubernetes.io/1-hostname
and kubernetes.io/2-hostname
Now when you are doing podAntinity you set `topologyKey:
kubernetes.io/1-hostname
Then your rule is valid in all the nodes with that topologyKey but your rule is not valid in topologyKey: kubernetes.io/2-hostname
labelled nodes.
Hence, in my example pods are scheduled within nodes with label kubernetes.io/1-hostname
and has podAntiAffinity implied, but nodes with labels kubernetes.io/2-hostname
doesn't have the podAntiAffinity rule!
I have decided in the end to not use Pod Anti Affinity, but use a rather simpler mechanism in Kubernetes that is called Pod Disruption Budget. It generally says that at least X pods have to run at a given time.
apiVersion: policy/v1beta1
kind: PodDisruptionBudget
metadata:
name: myapp-disruption-budget
spec:
minAvailable: 1
selector:
matchLabels:
app: myapp
This will not allow a pod to be evicted before another pod is up and running. This fixes the problem with controlled downtime for node, but if node goes down uncontrollable (failures with hardware, etc.), then there is no guarantee, but hopefully this does not happen too often.
I think you need to create NodeAffinity and taints for this.
For Node affinity
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: role
operator: In
values:
- app
For taints use kubectl taint nodes <nodename> key=example-key:NoSchedule
Add this in your yaml file.
tolerations:
- key: "example-key"
operator: "Exists"
effect: "NoSchedule"