Will k8s scale a pod within HPA range to evict it and meet disruption budget?

4/22/2021

excuse me for asking something that has much overlap with many specific questions about the same knowledge area. I am curious to know if kubernetes will scale a pod in order to evict it.

Given are the following facts at the time of eviction:

  1. The pod is running one instance.

  2. The pod has an HPA controlling it, with the following params:

    • minCount: 1
    • maxCount: 2
  3. It has a PDB with params:

    • minAvailable: 1

I would expect the k8s controller to have enough information to safely scale up to 2 instances to meet the PDB, and until recently I was assuming it would indeed do so.

Why am I asking this? (The question behind the question ;)

Well, we run into auto-upgrade problems on AKS because it won't evict pods as described above, and the Azure team told me to change the params. But if no scaling happens, this means we have to set minAvailable to 2, effectively increasing pod amount only for future evictions. I want to get to the bottom of this before I file a feature request with k8s or a bug with AKS.

-- Morriz
azure-aks
horizontal-scaling
kubernetes
kubernetes-pod

1 Answer

4/22/2021

I believe these two parts are independent; the pod disruption budget doesn't look at the autoscaling capability, or otherwise realize that a pod is running as part of a deployment that could be temporarily upscaled.

If you have a deployment with replicas: 1, and a corresponding PDB with minAvailable: 1, this will prevent the node the pod is running on from being taken out of service. (I see this behavior in the system I work on professionally, using a different Kubernetes environment.)

The way this works normally (see also the PodDisruptionBudget example in the Kubernetes documentation):

  1. Some command like kubectl drain or the cluster autoscaler marks a node as going out of service.
  2. The pods on that node are terminated.
  3. The replication controller sees that some replica sets have too few pods, and creates new ones.
  4. The new pods get scheduled on in-service nodes.

The pod disruption budget only affects the first part of this sequence; it would keep kubectl drain from actually draining a node until the disruption budget could be satisfied, or cause the cluster autoscaler to pick a different node. HPA isn't considered at all, nor is it considered that it's "normal" to run extra copies of a deployment-managed pod during upgrades. (That is, this is a very reasonable question, it just doesn't work that way right now.)

My default setup for most deployments tends to be to use 3 replicas and to have a pod disruption budget requiring at least 1 of them to be available. That definitely adds some cost to operating the service, but it makes you tolerant of an involuntary node failure and it does allow you to consciously rotate nodes out. For things that read from message queues (Kafka or RabbitMQ-based workers) it could make sense to run only 1 replica with no PDB since the worker will be able to tolerate an outage.

-- David Maze
Source: StackOverflow