How to configure pod disruption budget to drain kubernetes node?

12/7/2018

I'd like to configure cluster autoscaler on AKS. When scaling down it fails due to PDB:

I1207 14:24:09.523313       1 cluster.go:95] Fast evaluation: node aks-nodepool1-32797235-0 cannot be removed: no enough pod disruption budget to move kube-system/metrics-server-5cbc77f79f-44f9w
I1207 14:24:09.523413       1 cluster.go:95] Fast evaluation: node aks-nodepool1-32797235-3 cannot be removed: non-daemonset, non-mirrored, non-pdb-assignedkube-system pod present: cluster-autoscaler-84984799fd-22j42
I1207 14:24:09.523438       1 scale_down.go:490] 2 nodes found to be unremovable in simulation, will re-check them at 2018-12-07 14:29:09.231201368 +0000 UTC m=+8976.856144807

All system pods have minAvailable: 1 PDB assigned manually. I can imagine that this is not working for PODs with only a single replica like the metrics-server:

k get nodes -o wide
NAME                       STATUS   ROLES   AGE   VERSION   INTERNAL-IP   EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME
aks-nodepool1-32797235-0   Ready    agent   4h    v1.11.4   10.240.0.4    <none>        Ubuntu 16.04.5 LTS   4.15.0-1030-azure   docker://3.0.1
aks-nodepool1-32797235-3   Ready    agent   4h    v1.11.4   10.240.0.6    <none>        Ubuntu 16.04.5 LTS   4.15.0-1030-azure   docker://3.0.1

ks get pods -o wide
NAME                                    READY   STATUS    RESTARTS   AGE   IP            NODE                       NOMINATED NODE
cluster-autoscaler-84984799fd-22j42     1/1     Running   0          2h    10.244.1.5    aks-nodepool1-32797235-3   <none>
heapster-5d6f9b846c-g7qb8               2/2     Running   0          1h    10.244.0.16   aks-nodepool1-32797235-0   <none>
kube-dns-v20-598f8b78ff-8pshc           4/4     Running   0          3h    10.244.1.4    aks-nodepool1-32797235-3   <none>
kube-dns-v20-598f8b78ff-plfv8           4/4     Running   0          1h    10.244.0.15   aks-nodepool1-32797235-0   <none>
kube-proxy-fjvjv                        1/1     Running   0          1h    10.240.0.6    aks-nodepool1-32797235-3   <none>
kube-proxy-szr8z                        1/1     Running   0          1h    10.240.0.4    aks-nodepool1-32797235-0   <none>
kube-svc-redirect-2rhvg                 2/2     Running   0          4h    10.240.0.4    aks-nodepool1-32797235-0   <none>
kube-svc-redirect-r2m4r                 2/2     Running   0          4h    10.240.0.6    aks-nodepool1-32797235-3   <none>
kubernetes-dashboard-68f468887f-c8p78   1/1     Running   0          4h    10.244.0.7    aks-nodepool1-32797235-0   <none>
metrics-server-5cbc77f79f-44f9w         1/1     Running   0          4h    10.244.0.3    aks-nodepool1-32797235-0   <none>
tiller-deploy-57f988f854-z9qln          1/1     Running   0          4h    10.244.0.8    aks-nodepool1-32797235-0   <none>
tunnelfront-7cf9d447f9-56g7k            1/1     Running   0          4h    10.244.0.2    aks-nodepool1-32797235-0   <none>

What needs be changed (number of replicas? PDB configuration?) for down-scaling to work?

-- andig
autoscaling
azure
azure-aks
kubernetes

1 Answer

12/10/2018

Basically, this is an administration issues when draining nodes that are configured by PDB ( Pod Disruption Budget )

This is because the evictions are forced to respect the PDB you specify

you have two options:

Either force the hand:

kubectl drain foo --force --grace-period=0

you can check other options from the doc -> https://kubernetes.io/docs/reference/generated/kubectl/kubectl-commands#drain

or use the eviction api:

{
  "apiVersion": "policy/v1beta1",
  "kind": "Eviction",
  "metadata": {
    "name": "quux",
    "namespace": "default"
  }
}

Anyhow, the drain or the eviction api attempts delete on pod to let them be scheduled elswhere before completely draining the node

As mentioned in the docs:

the API can respond in one of three ways:

  1. If the eviction is granted, then the pod is deleted just as if you had sent a DELETE request to the pod’s URL and you get back 200 OK.
  2. If the current state of affairs wouldn’t allow an eviction by the rules set forth in the budget, you get back 429 Too Many Requests. This is typically used for generic rate limiting of any requests
  3. If there is some kind of misconfiguration, like multiple budgets pointing at the same pod, you will get 500 Internal Server Error.

For a given eviction request, there are two cases:

  1. There is no budget that matches this pod. In this case, the server always returns 200 OK.

  2. There is at least one budget. In this case, any of the three above responses may apply.

If it gets stuck then you might need to do it manually

you can read me here or here

-- hkhelil
Source: StackOverflow