We tried to add one more deployment with 2 pods to existing mix of pods scheduled over 4 nodes and 1 master node cluster. We are getting following error: No nodes are available that match all of the predicates: Insufficient cpu (4), Insufficient memory (1), PodToleratesNodeTaints (2).
Looking at the other threads and documentation, this would be the case when existing nodes are exceeding cpu capacity (on 4 nodes) and memory capacity(on 1 node)...
To solve the resource issue, we added another node and redeployed the bits. But still see the same issues and see almost unused node. (see node-5 below being not used while node-2 and node-4 are over allocated, node 1 and 3 would be overallocated after addition of the new pods which are failing)
nodename | CPU requests (cores) | CPU limits (cores) | Memory requests (bytes) | Memory limits (bytes) | Age
node-5 | 0.11 (5.50%) | 0 (0.00%) | 50 Mi (1.26%) | 50 Mi (1.26%) | 3 hours
node-4 | 1.61 (80.50%) | 2.8 (140.00%) | 2.674 Gi (69.24%) | 4.299 Gi (111.32%) | 7 days
node-3 | 1.47 (73.50%) | 1.7 (85.00%) | 2.031 Gi (52.60%) | 2.965 Gi (76.78%) | 7 months
node-2 | 1.33 (66.50%) | 2.1 (105.00%) | 2.684 Gi (69.49%) | 3.799 Gi (98.37%) | 7 months
node-1 | 1.48 (74.00%) | 1.4 (70.00%) | 1.705 Gi (44.15%) | 2.514 Gi (65.09%) | 7 months
master | 0.9 (45.00%) | 0.1 (5.00%) | 350 Mi (8.85%) | 300 Mi (7.59%) | 7 months
Note that We have auto scaling enabled (with limit of 8 nodes). (client version is v1.9.0 while our kubernetes server version is v1.8.4). We are using helm to deploy and using kops to add new node.
Why the pods are not scheduled so that each node can be below capacity? Why are we seeing errors and one fully unused node?
Figured out what was going on. Here is we think what happened...
Taints: ToBeDeletedByClusterAutoscaler=1532321512:NoSchedule
We then redeployed the autoscaler with new values of min = 5 and max = 8.
Then we removed this taint and redeployed, the issue of that 5th node not being leveraged went away. And hence now there were enough node resources because of which we didn't get the error we were getting.
Not sure why autoscaler marked the new node with this taint. That is a question for some other day or may be bug in k8s autoscaler. But the issue was fixed with removal of that taint on that new node.