Even after adding additional Kubernetes node, I see new node unused while getting error "No nodes are available that match all of the predicates:

7/23/2018

We tried to add one more deployment with 2 pods to existing mix of pods scheduled over 4 nodes and 1 master node cluster. We are getting following error: No nodes are available that match all of the predicates: Insufficient cpu (4), Insufficient memory (1), PodToleratesNodeTaints (2).

Looking at the other threads and documentation, this would be the case when existing nodes are exceeding cpu capacity (on 4 nodes) and memory capacity(on 1 node)...

To solve the resource issue, we added another node and redeployed the bits. But still see the same issues and see almost unused node. (see node-5 below being not used while node-2 and node-4 are over allocated, node 1 and 3 would be overallocated after addition of the new pods which are failing)

nodename | CPU requests (cores) | CPU limits (cores) | Memory requests (bytes) | Memory limits (bytes) | Age

node-5 | 0.11 (5.50%) | 0 (0.00%) | 50 Mi (1.26%) | 50 Mi (1.26%) | 3 hours

node-4 | 1.61 (80.50%) | 2.8 (140.00%) | 2.674 Gi (69.24%) | 4.299 Gi (111.32%) | 7 days

node-3 | 1.47 (73.50%) | 1.7 (85.00%) | 2.031 Gi (52.60%) | 2.965 Gi (76.78%) | 7 months

node-2 | 1.33 (66.50%) | 2.1 (105.00%) | 2.684 Gi (69.49%) | 3.799 Gi (98.37%) | 7 months

node-1 | 1.48 (74.00%) | 1.4 (70.00%) | 1.705 Gi (44.15%) | 2.514 Gi (65.09%) | 7 months

master | 0.9 (45.00%) | 0.1 (5.00%) | 350 Mi (8.85%) | 300 Mi (7.59%) | 7 months

Note that We have auto scaling enabled (with limit of 8 nodes). (client version is v1.9.0 while our kubernetes server version is v1.8.4). We are using helm to deploy and using kops to add new node.

Why the pods are not scheduled so that each node can be below capacity? Why are we seeing errors and one fully unused node?

-- mi10
amazon-web-services
kops
kubernetes
kubernetes-helm

1 Answer

7/23/2018

Figured out what was going on. Here is we think what happened...

  1. We Added a new node (5th one) using kops.
  2. at that time the cluster autoscaler that we had running had node settings of min 4 and max 8. So probably it found this node not useful and added a taint to it as following:

Taints: ToBeDeletedByClusterAutoscaler=1532321512:NoSchedule

  1. So even if we tried to deploy and redeploy services, none of the pods were scheduled to this node because of this taint.

We then redeployed the autoscaler with new values of min = 5 and max = 8.

Then we removed this taint and redeployed, the issue of that 5th node not being leveraged went away. And hence now there were enough node resources because of which we didn't get the error we were getting.

Not sure why autoscaler marked the new node with this taint. That is a question for some other day or may be bug in k8s autoscaler. But the issue was fixed with removal of that taint on that new node.

-- mi10
Source: StackOverflow