nodeAffinity with preferredDuringSchedulingIgnoredDuringExecution set always schedules a pod on an incorrect node

7/17/2018

I have two worker nodes in my environment. I have added a label to one of them like so:

kubectl label nodes "${node}" type=infrastructure --overwrite

In my service yaml file, I have set the following up:

    affinity:
      nodeAffinity:
        preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 1
          preference:
            matchExpressions:
            - key: "type"
              operator: In
              values: ["infrastructure"]

The preferredDuringSchedulingIgnoredDuringExecution rule should mean that the kubernetes scheduler should try its best to deploy the pods to the node with the "infrastructure" label, but if it is unable to (e.g. not enough resources), it will deploy to a different node in the cluster.

I am seeing that every time I deploy the service (3 pods), that 1 pod always gets deployed to the node without the label.

Is there any way to find out why the kubernetes scheduler chose the unlabelled node? If it was a resource issue, I would expect to see it logged in the events, but instead I see the scheduler choose the unlabeled node straight away:

    Normal  Scheduled              23m   default-scheduler  Successfully assigned es-master-5f55dd9dd-2n48b to pink02

I understand that I can use the rule requiredDuringSchedulingIgnoredDuringExecution to force the pods onto the labeled node, but I don't want to do this because some environments may not have the label.

-- Conall Ó Cofaigh
kubernetes

1 Answer

7/18/2018

I assume that you actually do not have enough resources since the nodeAffinity is a pretty simple process as it is score-based. Comment in the github doc

/ CalculateNodeAffinityPriorityMap prioritizes nodes according to node affinity scheduling preferences // indicated in PreferredDuringSchedulingIgnoredDuringExecution. Each time a node match a preferredSchedulingTerm, // it will a get an add of preferredSchedulingTerm.Weight. Thus, the more preferredSchedulingTerms // the node satisfies and the more the preferredSchedulingTerm that is satisfied weights, the higher // score the node gets.

Another:

  • a field called RequiredDuringSchedulingIgnoredDuringExecution which is identical to RequiredDuringSchedulingRequiredDuringExecution except that the system may or may not try to eventually evict the pod from its node.

Please try to check if you really have enough resources as according to the quoted documentation the behavior seems to be applicable to the better scoring of the second node after putting first pods into the 1st node. Answering your last question: you will not see this in the event logs as those kinds of events are not logged, theoretically you should see them in debug mode, but I am not sure about that.

-- aurelius
Source: StackOverflow