Imagine there is a cluster with lots of different deployments running on it. Some pods uses PersistentVolumes (Azure Disks). There is a limit in Azure how much disks can be mounted to a VM and this leads to errors on scheduling like
Status=409 Code="OperationNotAllowed" Message="The maximum number of data disks allowed to be attached to a VM of this size is 8
Pods stay in
Waiting: Container creating
state forever, however some nodes were having much less pods with attached disks at the moment of scheduling. It would be great to limit amount of pods with attached disks per node so this error will never happen. I believe
podAntiAffinity
is what I need and I know I can restrict pods with same label from scheduling on same node, but I don't know how to allow it until node has maximum amount of pods with disks.
My installation is AKS.
az acs create \ --orchestrator-type=kubernetes \ --orchestrator-version 1.7.9 \ --resource-group <resource_group_here> \ --name=<name_here> \ ...
KUBE_MAX_PD_VOLS
is what you are looking for. By default it's value is 16 for Azure Disks. So you can either use instances which has same limit of attached disks (16) or set it to preferrable value. You can see where it's declared at github
You should set this environment variable in your scheduler declaration. I found my scheduler declaration in /etc/kubernetes/manifests/kube-scheduler.yaml
. This is what it looks now: apiVersion: "v1" kind: "Pod" metadata: name: "kube-scheduler" ... spec: containers: - name: "kube-scheduler" ... env: - name: KUBE_MAX_PD_VOLS value: "8" ...
Note spec.containers.env.KUBE_MAX_PD_VOLS
setting - it prevents from scheduling more than 8 disks on each node.
This way pods spread among nodes without any issues, pods which cannot fit stays in Pending state until they find enough nodes to fit in.