How to avoid the last pod being killed on automatic node scale down in AKS

10/7/2020

We are using Azure AKS v1.17.9 with auto-scaling both for pods (using HorizontalPodAutoscaler) and for nodes. Overall it works well, but we have seen outages in some cases. We have some deployments where minReplicas=1 and maxReplicas=4. Most of the time there will only be one pod running for such a deployment. In some cases where the auto-scaler has decided to scale down a node, the last remaining pod has been killed. Later a new pod is started on another node, but this means an outage.

I would have expected the auto-scaler to first create a new pod running on another node (bringing the number of replicas up to the allowed value of 2) and then scaling down the old pod. That would have worked without downtime. As it is it kills first and asks questions later.

Is there a way around this except the obvious alternative of setting minReplicas=2 (which increases the cost as all these pods are doubled, needing additional VMs)? And is this expected, or is it a bug?

-- ewramner
azure-aks
kubernetes

1 Answer

10/7/2020

In some cases where the auto-scaler has decided to scale down a node, the last remaining pod has been killed. Later a new pod is started on another node, but this means an outage.

For this reason, you should always have at least 2 replicas for Deployment in a production environment. And you should use Pod Anti-Affinity so that those two pods are not scheduled to the same Availability Zone. E.g. if there is network problems in one Availability Zone, your app is still available.

It is common to have at least 3 replicas, one in each Availability Zone, since cloud providers typically has 3 Availability Zones in each Region - so that you can use inter-zone traffic which is cheaper than cross-zone traffic, typically.

You can always use fewer replicas to save cost, but it is a trade-off and you get worse availability.

-- Jonas
Source: StackOverflow