Azure AKS Scale out

2/4/2021

I have AKS with 03 nodes, I tried to manually scale out from 3 to 4 nodes. Scale up was fine. After ~ 20 minutes , all 04 Node are in NotReady Service, all kube-system services is not Ready status.

NAME STATUS ROLES AGE VERSION
aks-agentpool-40760006-vmss000000 Ready agent 16m v1.18.14
aks-agentpool-40760006-vmss000001 Ready agent 17m v1.18.14
aks-agentpool-40760006-vmss000002 Ready agent 16m v1.18.14
aks-agentpool-40760006-vmss000003 Ready agent 11m v1.18.14

NAME STATUS ROLES AGE VERSION
aks-agentpool-40760006-vmss000000 NotReady agent 23m v1.18.14
aks-agentpool-40760006-vmss000002 NotReady agent 24m v1.18.14
aks-agentpool-40760006-vmss000003 NotReady agent 19m v1.18.14

k get po -n kube-system
NAME                                  READY   STATUS        RESTARTS   AGE
coredns-748cdb7bf4-7frq2              0/1     Pending       0          10m
coredns-748cdb7bf4-vg5nn              0/1     Pending       0          10m
coredns-748cdb7bf4-wrhxs              1/1     Terminating   0          28m
coredns-autoscaler-868b684fd4-2gb8f   0/1     Pending       0          10m
kube-proxy-p6wmv                      1/1     Running       0          28m
kube-proxy-sksz6                      1/1     Running       0          23m
kube-proxy-vpb2g                      1/1     Running       0          28m
metrics-server-58fdc875d5-sbckj       0/1     Pending       0          10m
tunnelfront-5d74798f6b-w6rvn          0/1     Pending       0          10m

The node logs shows that:

Events:
  Type     Reason                   Age                   From              Message
  ----     ------                   ----                  ----              -------
  Normal   Starting                 25m                   kubelet           Starting kubelet.
  Normal   NodeHasSufficientMemory  25m (x2 over 25m)     kubelet           Node aks-agentpool-40760006-vmss000000 status is now: NodeHasSufficientMemory
  Normal   NodeHasNoDiskPressure    25m (x2 over 25m)     kubelet           Node aks-agentpool-40760006-vmss000000 status is now: NodeHasNoDiskPressure
  Normal   NodeHasSufficientPID     25m (x2 over 25m)     kubelet           Node aks-agentpool-40760006-vmss000000 status is now: NodeHasSufficientPID
  Normal   NodeAllocatableEnforced  25m                   kubelet           Updated Node Allocatable limit across pods
  Normal   Starting                 25m                   kube-proxy        Starting kube-proxy.
  Normal   NodeReady                24m                   kubelet           Node aks-agentpool-40760006-vmss000000 status is now: NodeReady
  Warning  FailedToCreateRoute      5m5s                  route_controller  Could not create route e496c1aa-be11-412b-b820-178d83b42f29 10.244.2.0/24 for node aks-agentpool-40760006-vmss000000 after 50.264754ms: timed out waiting for the condition
  Warning  FailedToCreateRoute      4m55s                 route_controller  Could not create route e496c1aa-be11-412b-b820-178d83b42f29 10.244.2.0/24 for node aks-agentpool-40760006-vmss000000 after 45.945658ms: timed out waiting for the condition
  Warning  FailedToCreateRoute      4m45s                 route_controller  Could not create route e496c1aa-be11-412b-b820-178d83b42f29 10.244.2.0/24 for node aks-agentpool-40760006-vmss000000 after 46.180158ms: timed out waiting for the condition
  Warning  FailedToCreateRoute      4m35s                 route_controller  Could not create route e496c1aa-be11-412b-b820-178d83b42f29 10.244.2.0/24 for node aks-agentpool-40760006-vmss000000 after 46.550858ms: timed out waiting for the condition
  Warning  FailedToCreateRoute      4m25s                 route_controller  Could not create route e496c1aa-be11-412b-b820-178d83b42f29 10.244.2.0/24 for node aks-agentpool-40760006-vmss000000 after 44.74355ms: timed out waiting for the condition
  Warning  FailedToCreateRoute      4m15s                 route_controller  Could not create route e496c1aa-be11-412b-b820-178d83b42f29 10.244.2.0/24 for node aks-agentpool-40760006-vmss000000 after 42.428456ms: timed out waiting for the condition
  Warning  FailedToCreateRoute      4m5s                  route_controller  Could not create route e496c1aa-be11-412b-b820-178d83b42f29 10.244.2.0/24 for node aks-agentpool-40760006-vmss000000 after 41.664858ms: timed out waiting for the condition
  Warning  FailedToCreateRoute      3m55s                 route_controller  Could not create route e496c1aa-be11-412b-b820-178d83b42f29 10.244.2.0/24 for node aks-agentpool-40760006-vmss000000 after 48.456954ms: timed out waiting for the condition
  Warning  FailedToCreateRoute      3m45s                 route_controller  Could not create route e496c1aa-be11-412b-b820-178d83b42f29 10.244.2.0/24 for node aks-agentpool-40760006-vmss000000 after 38.611964ms: timed out waiting for the condition
  Warning  FailedToCreateRoute      65s (x16 over 3m35s)  route_controller  (combined from similar events): Could not create route e496c1aa-be11-412b-b820-178d83b42f29 10.244.2.0/24 for node aks-agentpool-40760006-vmss000000 after 13.972487ms: timed out waiting for the condition
-- Phan Tung Chau
azure
azure-aks
azure-devops
azure-vm-scale-set
kubernetes

2 Answers

2/4/2021

You can use cluster autoscaler option to avoid such situations in future.

To keep up with application demands in Azure Kubernetes Service (AKS), you may need to adjust the number of nodes that run your workloads. The cluster autoscaler component can watch for pods in your cluster that can't be scheduled because of resource constraints. When issues are detected, the number of nodes in a node pool is increased to meet the application demand. Nodes are also regularly checked for a lack of running pods, with the number of nodes then decreased as needed. This ability to automatically scale up or down the number of nodes in your AKS cluster lets you run an efficient, cost-effective cluster.

You can Update an existing AKS cluster to enable the cluster autoscaler in order to use your current resource group.

az aks update \
  --resource-group myResourceGroup \
  --name myAKSCluster \
  --enable-cluster-autoscaler \
  --min-count 1 \
  --max-count 3
-- Vit
Source: StackOverflow

2/4/2021

Seem it is OK now. I am lacking of the right to scale up node.

-- Phan Tung Chau
Source: StackOverflow