It is clear from the documentation that whenever pods are in Pending state because there is no node that has enough free resources to respect the pods resource request - the cluster autoscaler will create another node within 30 seconds of the pod creation (for reasonably sized clusters).
However, consider the case that a node is pretty packed. Let's say the node has 2 CPU cores and it contains 4 pods that define 0.5 CPU request and 1.0 CPU limit. Suddenly there is load, and all 4 pods are suddenly requesting an additional 0.5 CPU that the node is not able to give since all of it's CPU is already taken by the 4 running pods.
In this situation, I'd expect Kubernetes to 'understand' that there are Pending resource requests by running pods that cannot be served and 'move' (destroy and create) those pods to another node that can respect their request (plus the resources they are currently using). In case no such node exists - I'd expected Kubernetes to create an additional node and move the pods there.
However, I don't see this happening. I see that the pods are running on the same node (I guess that node can be called over-provisioned) regardless of resource requests that cannot be respected and performance suffers as a result.
My question is whether this behaviour is avoidable by any means apart from setting the ratio between pod resource requests and limits to 1:1 (where a pod cannot request more resources than initially allocated). Obviously I would to avoid setting requests and limits to be the same to avoid under-provisioning and pay for more than I need.
It's important to recognise the distinction here between the CPU request
in a PodSpec, and the amount of cpu a process is trying to use. Kubernetes provisioning and cluster autoscaling is based purely on the request
in the PodSpec. Actual use is irrelevant for those decisions.
In the case you're describing, the Pod still only requests 0.5 CPU - that field is immutable. The process is now trying to use 1 CPU - but this isn't looked at.
CPU limits being higher than requests allows the best-efforts use of that capacity, but it isn't a guarantee, as you're seeing.
In this scenario, it sounds like you might want to be using both the Horizontal Pod Autoscaler, and the cluster autoscaler. In a situation with increased load (where the Pods start to use >80% of the CPU request
, say), the HPA will increase the number of Pods for the service, to handle demand. If then those Pods have nowhere they can fit, the cluster autoscaler will provision more Nodes. In this way, your Pods can still use up to the request value, and it's only when they start getting close to it that more Nodes are provisioned, so you won't over-provision resources up-front.