Kubernetes - Completely avoid node with PreferNoSchedule taint

1/22/2021

The problem:

We have nodes in kubernetes, which will occasionally become tainted with an effect of PreferNoSchedule. When this happens, we would like our pods to completely avoid scheduling on these nodes (in other words, act like the taint's effect were actually NoSchedule). The taints aren't applied by us - we're using GKE and it's an automated thing on their end, so we're stuck with the PreferNoSchedule behaviour.

What we can control is the spec of our pods. I'm hoping this might be possible using a nodeAffinity on them, however the documentation on this is fairly sparse: see e.g. here. All the examples I can find refer to an affinity by labels, so I'm not sure if a taint is even visible/accessible by this logic.

Effectively, in my pod spec I want to write something like:

spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: kubernetes.io/taints *I've invented this, it doesn't work
            operator: NotIn
            values:
            - DeletionCandidateOfClusterAutoscaler

where DeletionCandidateOfClusterAutoscaler is the taint that we see applied. Is there a way to make this work?

The other approach we've thought about is a cronjob which looks for the PreferNoSchedule taint and adds our own NoSchedule taint on top... but that feels a little gross!

Any neat suggestions / workarounds would be appreciated!

The long version (if you're interested):

The taint gets applied by the autoscaler to say the node is going to go away in 30 minutes or so. This issue describes in some more detail, from people having the same trouble: https://github.com/kubernetes/autoscaler/issues/2967

-- Cookie Monster
google-kubernetes-engine
kubernetes

1 Answer

1/28/2021

I have tested your same scenario and indeed, GKE reconciles the current node configuration at the moment the autoscaler starts to spin up. This is in order to ensure no downtime in case of a lack of resources in the node which the pods/workloads can be scheduled. I believe there is no way to set the hard NoSchedule taint cleanly.

So, the critical information to keep in mind when using the autoscaler is:

  • Pods will not be scheduled to the soft-tainted nodepool if there are resources available in the regular one.

  • If not enough resources are available in the regular one, they will be scheduled to the soft-tainted nodepool.

  • If there aren’t enough resources in the nodepools, the nodepool with the smallest nodetype will be autoscaled no matter the taints.

As you mentioned a dirty workaround will be to:

A.- Create a cron or daemon to set the NoSchedule and overwrite the soft one set by the autoscaler.

B.- A.- Ensure the resources habitability, maybe by setting perfect resource limits, and requests.

-- Jujosiga
Source: StackOverflow