I'm using the Kubernetes autoscaler for AWS. I've deployed it using the following commands:
command:
- ./cluster-autoscaler
- --v=4
- --stderrthreshold=info
- --cloud-provider=aws
- --skip-nodes-with-local-storage=false
- --nodes=1:10:nodes.k8s-1-17.dev.platform
However, the autoscaler can't seem to initiate scaledown. The logs show it finds an unused node, but then doesn't scale it down and doesn't give me an error (the nodes that show "no node group config" are the master nodes).
I0610 22:09:37.164102 1 static_autoscaler.go:147] Starting main loop
I0610 22:09:37.164462 1 utils.go:471] Removing autoscaler soft taint when creating template from node ip-10-141-10-176.ec2.internal
I0610 22:09:37.164805 1 utils.go:626] No pod using affinity / antiaffinity found in cluster, disabling affinity predicate for this loop
I0610 22:09:37.164823 1 static_autoscaler.go:303] Filtering out schedulables
I0610 22:09:37.165083 1 static_autoscaler.go:320] No schedulable pods
I0610 22:09:37.165106 1 static_autoscaler.go:328] No unschedulable pods
I0610 22:09:37.165123 1 static_autoscaler.go:375] Calculating unneeded nodes
I0610 22:09:37.165141 1 utils.go:574] Skipping ip-10-141-12-194.ec2.internal - no node group config
I0610 22:09:37.165155 1 utils.go:574] Skipping ip-10-141-15-159.ec2.internal - no node group config
I0610 22:09:37.165167 1 utils.go:574] Skipping ip-10-141-11-28.ec2.internal - no node group config
I0610 22:09:37.165181 1 utils.go:574] Skipping ip-10-141-13-239.ec2.internal - no node group config
I0610 22:09:37.165197 1 utils.go:574] Skipping ip-10-141-10-69.ec2.internal - no node group config
I0610 22:09:37.165378 1 scale_down.go:379] Scale-down calculation: ignoring 4 nodes unremovable in the last 5m0s
I0610 22:09:37.165397 1 scale_down.go:410] Node ip-10-141-10-176.ec2.internal - utilization 0.023750
I0610 22:09:37.165692 1 cluster.go:90] Fast evaluation: ip-10-141-10-176.ec2.internal for removal
I0610 22:09:37.166115 1 cluster.go:225] Pod metrics-storage/querier-6bdfd7c6cf-wm7r8 can be moved to ip-10-141-13-253.ec2.internal
I0610 22:09:37.166227 1 cluster.go:225] Pod metrics-storage/querier-75588cb7dc-cwqpv can be moved to ip-10-141-12-116.ec2.internal
I0610 22:09:37.166398 1 cluster.go:121] Fast evaluation: node ip-10-141-10-176.ec2.internal may be removed
I0610 22:09:37.166553 1 static_autoscaler.go:391] ip-10-141-10-176.ec2.internal is unneeded since 2020-06-10 22:06:55.528567955 +0000 UTC m=+1306.007780301 duration 2m41.635504026s
I0610 22:09:37.166608 1 static_autoscaler.go:402] Scale down status: unneededOnly=true lastScaleUpTime=2020-06-10 21:45:31.739421421 +0000 UTC m=+22.218633767 lastScaleDownDeleteTime=2020-06-10 21:45:31.739421531 +0000 UTC m=+22.218633877 lastScaleDownFailTime=2020-06-10 22:06:44.128044684 +0000 UTC m=+1294.607257070 scaleDownForbidden=false isDeleteInProgress=false
Why is the autoscaler not scaling down nodes?
Recently we faced a similar with cluster-autoscaler. Post upgrading EKS cluster to 1.18, we observed a similar log in autoscaler.
Skipping ip-xx-xx-xx-xx.ec2.internal - no node group config
The issue was with autoDiscovery. Instead of kubernetes.io/cluster/YOUR_CLUSTER_NAME, below mentioned tags should be there in ASG
k8s.io/cluster-autoscaler/YOUR_CLUSTER_NAME
k8s.io/cluster-autoscaler/enabled
Please refer to this for more detail: https://artifacthub.io/packages/helm/cluster-autoscaler/cluster-autoscaler/9.4.0
We have recently found that this was happening due to autoscaler was launched without specifying the correct region - eu-west-1 was there by default. After resetting this value to the right region and re-launching autoscaler, our nodes started to be discovered correctly.
It looks to me cluster-autoscaler
is behaving correctly so far. It has decided one of the nodes can be scaled down:
1 cluster.go:121] Fast evaluation: node ip-10-141-10-176.ec2.internal may be removed
I0610 22:09:37.166553
1 static_autoscaler.go:391] ip-10-141-10-176.ec2.internal is unneeded since 2020-06-10 22:06:55.528567955 +0000 UTC m=+1306.007780301 duration 2m41.635504026s
However, by default cluster-autoscaler
will wait 10 minutes before it actually does terminate the node. See "How does scale-down work":
https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#how-does-scale-down-work
From the first logs above, it says that your node has been unneeded for duration 2m41
- when it reaches 10 minutes, the scale down will occur.
After 10 minutes, you should see something like:
I0611 14:58:02.384101 1 static_autoscaler.go:382] <node_name> is unneeded since 2020-06-11 14:47:59.621770178 +0000 UTC m=+1299856.757452427 duration 10m2.760318899s
<...snip...>
I0611 14:58:02.385035 1 scale_down.go:754] Scale-down: removing node <node_name>, utilization: {0.8316326530612245 0.34302838802551344 0.8316326530612245}, pods to reschedule: <...snip...>
I0611 14:58:02.386146 1 event.go:209] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"cluster-autoscaler", Name:"cluster-autoscaler-status", UID:"31a72ce9-9c4e-11ea-a0a8-0201be076001", APIVersion:"v1", ResourceVersion:"13431409", FieldPath:""}): type: 'Normal' reason: 'ScaleDown' Scale-down: removing node <node_name>, utilization: {0.8316326530612245 0.34302838802551344 0.8316326530612245}, pods to reschedule: <...snip...>
I believe this set up is to prevent thrashing.