I have AKS with 03 nodes, I tried to manually scale out from 3 to 4 nodes. Scale up was fine. After ~ 20 minutes , all 04 Node are in NotReady Service, all kube-system services is not Ready status.
NAME STATUS ROLES AGE VERSION
aks-agentpool-40760006-vmss000000 Ready agent 16m v1.18.14
aks-agentpool-40760006-vmss000001 Ready agent 17m v1.18.14
aks-agentpool-40760006-vmss000002 Ready agent 16m v1.18.14
aks-agentpool-40760006-vmss000003 Ready agent 11m v1.18.14
NAME STATUS ROLES AGE VERSION
aks-agentpool-40760006-vmss000000 NotReady agent 23m v1.18.14
aks-agentpool-40760006-vmss000002 NotReady agent 24m v1.18.14
aks-agentpool-40760006-vmss000003 NotReady agent 19m v1.18.14
k get po -n kube-system
NAME READY STATUS RESTARTS AGE
coredns-748cdb7bf4-7frq2 0/1 Pending 0 10m
coredns-748cdb7bf4-vg5nn 0/1 Pending 0 10m
coredns-748cdb7bf4-wrhxs 1/1 Terminating 0 28m
coredns-autoscaler-868b684fd4-2gb8f 0/1 Pending 0 10m
kube-proxy-p6wmv 1/1 Running 0 28m
kube-proxy-sksz6 1/1 Running 0 23m
kube-proxy-vpb2g 1/1 Running 0 28m
metrics-server-58fdc875d5-sbckj 0/1 Pending 0 10m
tunnelfront-5d74798f6b-w6rvn 0/1 Pending 0 10m
The node logs shows that:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Starting 25m kubelet Starting kubelet.
Normal NodeHasSufficientMemory 25m (x2 over 25m) kubelet Node aks-agentpool-40760006-vmss000000 status is now: NodeHasSufficientMemory
Normal NodeHasNoDiskPressure 25m (x2 over 25m) kubelet Node aks-agentpool-40760006-vmss000000 status is now: NodeHasNoDiskPressure
Normal NodeHasSufficientPID 25m (x2 over 25m) kubelet Node aks-agentpool-40760006-vmss000000 status is now: NodeHasSufficientPID
Normal NodeAllocatableEnforced 25m kubelet Updated Node Allocatable limit across pods
Normal Starting 25m kube-proxy Starting kube-proxy.
Normal NodeReady 24m kubelet Node aks-agentpool-40760006-vmss000000 status is now: NodeReady
Warning FailedToCreateRoute 5m5s route_controller Could not create route e496c1aa-be11-412b-b820-178d83b42f29 10.244.2.0/24 for node aks-agentpool-40760006-vmss000000 after 50.264754ms: timed out waiting for the condition
Warning FailedToCreateRoute 4m55s route_controller Could not create route e496c1aa-be11-412b-b820-178d83b42f29 10.244.2.0/24 for node aks-agentpool-40760006-vmss000000 after 45.945658ms: timed out waiting for the condition
Warning FailedToCreateRoute 4m45s route_controller Could not create route e496c1aa-be11-412b-b820-178d83b42f29 10.244.2.0/24 for node aks-agentpool-40760006-vmss000000 after 46.180158ms: timed out waiting for the condition
Warning FailedToCreateRoute 4m35s route_controller Could not create route e496c1aa-be11-412b-b820-178d83b42f29 10.244.2.0/24 for node aks-agentpool-40760006-vmss000000 after 46.550858ms: timed out waiting for the condition
Warning FailedToCreateRoute 4m25s route_controller Could not create route e496c1aa-be11-412b-b820-178d83b42f29 10.244.2.0/24 for node aks-agentpool-40760006-vmss000000 after 44.74355ms: timed out waiting for the condition
Warning FailedToCreateRoute 4m15s route_controller Could not create route e496c1aa-be11-412b-b820-178d83b42f29 10.244.2.0/24 for node aks-agentpool-40760006-vmss000000 after 42.428456ms: timed out waiting for the condition
Warning FailedToCreateRoute 4m5s route_controller Could not create route e496c1aa-be11-412b-b820-178d83b42f29 10.244.2.0/24 for node aks-agentpool-40760006-vmss000000 after 41.664858ms: timed out waiting for the condition
Warning FailedToCreateRoute 3m55s route_controller Could not create route e496c1aa-be11-412b-b820-178d83b42f29 10.244.2.0/24 for node aks-agentpool-40760006-vmss000000 after 48.456954ms: timed out waiting for the condition
Warning FailedToCreateRoute 3m45s route_controller Could not create route e496c1aa-be11-412b-b820-178d83b42f29 10.244.2.0/24 for node aks-agentpool-40760006-vmss000000 after 38.611964ms: timed out waiting for the condition
Warning FailedToCreateRoute 65s (x16 over 3m35s) route_controller (combined from similar events): Could not create route e496c1aa-be11-412b-b820-178d83b42f29 10.244.2.0/24 for node aks-agentpool-40760006-vmss000000 after 13.972487ms: timed out waiting for the condition
You can use cluster autoscaler option to avoid such situations in future.
To keep up with application demands in Azure Kubernetes Service (AKS), you may need to adjust the number of nodes that run your workloads. The cluster autoscaler component can watch for pods in your cluster that can't be scheduled because of resource constraints. When issues are detected, the number of nodes in a node pool is increased to meet the application demand. Nodes are also regularly checked for a lack of running pods, with the number of nodes then decreased as needed. This ability to automatically scale up or down the number of nodes in your AKS cluster lets you run an efficient, cost-effective cluster.
You can Update an existing AKS cluster to enable the cluster autoscaler in order to use your current resource group.
az aks update \
--resource-group myResourceGroup \
--name myAKSCluster \
--enable-cluster-autoscaler \
--min-count 1 \
--max-count 3
Seem it is OK now. I am lacking of the right to scale up node.