Spread specific number of deployment pods per node

6/13/2021

I have an EKS node group with 2 nodes for compute workloads. I use a taint on these nodes and tolerations in the deployment. I have a deployment with 2 replicas I want these two pods to be spread on these two nodes like one pod on each node.

I tried using:

affinity:
    podAntiAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 1
        podAffinityTerm:
          labelSelector:
            matchExpressions:
            - key: app
              operator: In
              values:
              - appname

Each pod is put on each node but if I updated the deployment file like changing its image name, it fails to schedule a new pod.

I also tried:

      topologySpreadConstraints:
        - maxSkew: 1
          topologyKey: type
          whenUnsatisfiable: DoNotSchedule
          labelSelector:
            matchLabels:
              type: compute

but they aren't spread evenly like 2 pods on a node.

-- Icarus
amazon-eks
kubernetes

3 Answers

9/1/2021

I was having the same problem with pods failing to schedule and getting stuck in pending state while rolling out new versions while my goal was to run exactly 3 pods at all times, 1 on each of the 3 available nodes.

That means I could not use maxUnavailable: 1 because that would temporarily result in less than 3 pods during the rollout.

Instead of using the app name label for matching anti-affinity, I ended up using a label with a random value ("version") on each deployment. This means new deployments will happily schedule pods to nodes where a previous version is still running, but the new versions will always be spread evenly.

Something like this:

apiVersion: apps/v1
kind: Deployment
spec:
  replicas: 3
  template:
    metadata:
      labels:
        deploymentVersion: v1
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchExpressions:
                  - key: deploymentVersion
                    operator: In
                    values:
                      - v1
              topologyKey: "kubernetes.io/hostname"

v1 can be anything that's a valid label and changes on every deployment attempt.

I'm using envsubst to have dynamic variables in yaml files:

DEPLOYMENT_VERSION=$(date +%s) envsubst < deploy.yaml | kubectl apply -f -

And then the config looks like this:

apiVersion: apps/v1
kind: Deployment
spec:
  replicas: 3
  template:
    metadata:
      labels:
        deploymentVersion: v${DEPLOYMENT_VERSION}
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchExpressions:
                  - key: deploymentVersion
                    operator: In
                    values:
                      - v${DEPLOYMENT_VERSION}
              topologyKey: "kubernetes.io/hostname"

I wish Kubernetes offered a more straightforward way to achieve this.

-- salomvary
Source: StackOverflow

6/13/2021

You can use DeamonSet instead of Deployment. A DaemonSet ensures that all (or some) Nodes run a copy of a Pod. As nodes are added to the cluster, Pods are added to them. As nodes are removed from the cluster, those Pods are garbage collected. Deleting a DaemonSet will clean up the Pods it created.

See documentation for Deamonset

-- pcsutar
Source: StackOverflow

6/14/2021

Try adding:

spec:
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 0
      maxUnavailable: 1

By default K8s is trying to scale the new replicaset up first before it starts downscaling the old replicas. Since it cannot schedule new replicas (because antiaffinity) they are stuck in pending state.

Once you set the deployment's maxSurge=0, you tell k8s that you don't want the deployment to scale up first during update, and thus in result it can only scale down making place for new replicas to be scheduled.

Setting maxUnavailable=1 tells k8s to replace only one pod at a time.

-- Matt
Source: StackOverflow