Do not allow two pods of the same application on same node in kubernetes

10/7/2019

We have a situation now when 2 pods of same type can run on the same node. Sometimes during restarts and rescheduling 2 pods happen to be on the same node and when this node itself is rescheduled all of our pods are gone for a while resulting in connection troubles (we have just 2 pods that are load balanced).

I think the best way to fix it is to not allow 2 pods to run on the same node and to use inter-pod anti-affinity for that.

Is this the correct solution to the problem? I tried to understand it but got a bit bogged down with topologyKey and the syntax. Can someone explain/give an example on how to achieve this?

-- Ilya Chernomordik
kubernetes

4 Answers

10/7/2019

There is one more way out, in case you don't to use pod affinity as described above ( or any such network policy to keep things simple), then at real time you can still handle this scenario if your Node gets down.

Make it cordon, so that it no longer appears to kubernetes for traffic scheduling and ultimately it will do that for other nodes which are working.

Please have a glance at attached file.

Finally you can uncordon that node, once it gets ready.cordon and uncordon the node

-- Tushar Mahajan
Source: StackOverflow

10/7/2019

Yes, you are right Affinity is your friend here and is the correct solution.

Node Affinity : will help your app or microservice to stick to a particular kind of node (in a multinode architecture) like below my app ngnix-ms always sticks to the nodes which have a label role=admin.

The pod antiAffinity rule: sticks to the node labels (topologyKey) and makes sure that group (marked with topologyKey) of nodes.

If there is a node which already has a pod with the label component=nginx, K8s won't allow to spin up a pod.

Here is the explanation:

affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
            - key: role
              operator: In
              values:
              - app-1
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchExpressions:
              - key: component
                operator: In
                values:
                - nginx-ms
          topologyKey: "kubernetes.io/1-hostname"

and

kubectl get node --show-labels

NAME                  STATUS   ROLES    AGE   VERSION   LABELS
xx-admin-1      Ready    master   19d   v1.13.4   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/1-hostname=xx-admin-1,node-role.kubernetes.io/master=,role=admin
xx-admin-2      Ready    master   19d   v1.13.4   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/1-hostname=xx-admin-2,node-role.kubernetes.io/master=,role=admin
xx-plat-1-1     Ready    <none>   19d   v1.13.4   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/2-hostname=xx-plat-1-1,role=admin
xx-plat-2-1     Ready    <none>   19d   v1.13.4   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/2-hostname=xx-plat-2-1,role=admin

Explanation of topologyKey: Think of it as a label, now you can have two different topologies in the same cluster.

example: kubernetes.io/1-hostname and kubernetes.io/2-hostname

Now when you are doing podAntinity you set `topologyKey:

kubernetes.io/1-hostname

Then your rule is valid in all the nodes with that topologyKey but your rule is not valid in topologyKey: kubernetes.io/2-hostname labelled nodes.

Hence, in my example pods are scheduled within nodes with label kubernetes.io/1-hostname and has podAntiAffinity implied, but nodes with labels kubernetes.io/2-hostname doesn't have the podAntiAffinity rule!

-- garlicFrancium
Source: StackOverflow

10/9/2019

I have decided in the end to not use Pod Anti Affinity, but use a rather simpler mechanism in Kubernetes that is called Pod Disruption Budget. It generally says that at least X pods have to run at a given time.

apiVersion: policy/v1beta1
kind: PodDisruptionBudget
metadata:
  name: myapp-disruption-budget
spec:
  minAvailable: 1
  selector:
    matchLabels:
      app: myapp

This will not allow a pod to be evicted before another pod is up and running. This fixes the problem with controlled downtime for node, but if node goes down uncontrollable (failures with hardware, etc.), then there is no guarantee, but hopefully this does not happen too often.

-- Ilya Chernomordik
Source: StackOverflow

10/8/2019

I think you need to create NodeAffinity and taints for this.

For Node affinity

    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
            - key: role
              operator: In
              values:
              - app

For taints use kubectl taint nodes <nodename> key=example-key:NoSchedule

Add this in your yaml file.

  tolerations:
  - key: "example-key"
    operator: "Exists"
    effect: "NoSchedule"
-- Sachin Arote
Source: StackOverflow