Kubernetes: kube-scheduler is not correctly scoring nodes for pod assignment

11/15/2019

I am running Kubernetes with Rancher, and I am seeing weird behavior with the kube-scheduler. After adding a third node, I expect to see pods start to get scheduled & assigned to it. However, the kube-scheduler scores this new third node node3 with the lowest score, even though it has almost no pods running in it, and I expect it to receive the highest score.

Here are the logs from the Kube-scheduler:

scheduling_queue.go:815] About to try and schedule pod namespace1/pod1
scheduler.go:456] Attempting to schedule pod: namespace1/pod1
predicates.go:824] Schedule Pod namespace1/pod1 on Node node1 is allowed, Node is running only 94 out of 110 Pods.
predicates.go:1370] Schedule Pod namespace1/pod1 on Node node1 is allowed, existing pods anti-affinity terms satisfied.
predicates.go:824] Schedule Pod namespace1/pod1 on Node node3 is allowed, Node is running only 4 out of 110 Pods.
predicates.go:1370] Schedule Pod namespace1/pod1 on Node node3 is allowed, existing pods anti-affinity terms satisfied.
predicates.go:824] Schedule Pod namespace1/pod1 on Node node2 is allowed, Node is running only 95 out of 110 Pods.
predicates.go:1370] Schedule Pod namespace1/pod1 on Node node2 is allowed, existing pods anti-affinity terms satisfied.
resource_allocation.go:78] pod1 -> node1: BalancedResourceAllocation, capacity 56000 millicores 270255251456 memory bytes, total request 40230 millicores 122473676800 memory bytes, score 7
resource_allocation.go:78] pod1 -> node1: LeastResourceAllocation, capacity 56000 millicores 270255251456 memory bytes, total request 40230 millicores 122473676800 memory bytes, score 3
resource_allocation.go:78] pod1 -> node3: BalancedResourceAllocation, capacity 56000 millicores 270255251456 memory bytes, total request 800 millicores 807403520 memory bytes, score 9
resource_allocation.go:78] pod1 -> node3: LeastResourceAllocation, capacity 56000 millicores 270255251456 memory bytes, total request 800 millicores 807403520 memory bytes, score 9
resource_allocation.go:78] pod1 -> node2: BalancedResourceAllocation, capacity 56000 millicores 270255247360 memory bytes, total request 43450 millicores 133693440000 memory bytes, score 7
resource_allocation.go:78] pod1 -> node2: LeastResourceAllocation, capacity 56000 millicores 270255247360 memory bytes, total request 43450 millicores 133693440000 memory bytes, score 3
generic_scheduler.go:748] pod1_namespace1 -> node1: TaintTolerationPriority, Score: (10)
generic_scheduler.go:748] pod1_namespace1 -> node3: TaintTolerationPriority, Score: (10)
generic_scheduler.go:748] pod1_namespace1 -> node2: TaintTolerationPriority, Score: (10)
selector_spreading.go:146] pod1 -> node1: SelectorSpreadPriority, Score: (10)
selector_spreading.go:146] pod1 -> node3: SelectorSpreadPriority, Score: (10)
selector_spreading.go:146] pod1 -> node2: SelectorSpreadPriority, Score: (10)
generic_scheduler.go:748] pod1_namespace1 -> node1: SelectorSpreadPriority, Score: (10)
generic_scheduler.go:748] pod1_namespace1 -> node3: SelectorSpreadPriority, Score: (10)
generic_scheduler.go:748] pod1_namespace1 -> node2: SelectorSpreadPriority, Score: (10)
generic_scheduler.go:748] pod1_namespace1 -> node1: NodeAffinityPriority, Score: (0)
generic_scheduler.go:748] pod1_namespace1 -> node3: NodeAffinityPriority, Score: (0)
generic_scheduler.go:748] pod1_namespace1 -> node2: NodeAffinityPriority, Score: (0)
 interpod_affinity.go:232] pod1 -> node1: InterPodAffinityPriority, Score: (0)
 interpod_affinity.go:232] pod1 -> node3: InterPodAffinityPriority, Score: (0)
interpod_affinity.go:232] pod1 -> node2: InterPodAffinityPriority, Score: (10)
generic_scheduler.go:803] Host node1 => Score 100040
generic_scheduler.go:803] Host node3 => Score 100038
generic_scheduler.go:803] Host node2 => Score 100050
scheduler_binder.go:256] AssumePodVolumes for pod "namespace1/pod1", node "node2"
scheduler_binder.go:266] AssumePodVolumes for pod "namespace1/pod1", node "node2": all PVCs bound and nothing to do
factory.go:727] Attempting to bind pod1 to node2
-- KZcoding
kube-scheduler
kubernetes
rancher
rancher-rke

1 Answer

11/16/2019

I can tell from the logs that your pod will always be scheduled on node2 because it seems like you have some sort of PodAffinity that scores an additional 10. Making it go to 50.

What's kind of odd is that I'm scoring 48 for node3 but it seems like -10 is being stuck there somewhere (totaling 38). Perhaps because of the affinity, or some entry not being shown in the logs, or plain simply a bug in the way the scheduler is doing the calculation. You'll probably have to dig deep into the kube-scheduler code if you'd like to find out more.

This is what I have:

node1 7 + 3 + 10 + 10 + 10 = 40
node2 7 + 3 + 10 + 10 + 10 + 10 = 50
node3 9 + 9 + 10 + 10 + 10 = 48
-- Rico
Source: StackOverflow