kubernetes autoscaler will not scale down nodes

6/10/2020

I'm using the Kubernetes autoscaler for AWS. I've deployed it using the following commands:

          command:
            - ./cluster-autoscaler
            - --v=4
            - --stderrthreshold=info
            - --cloud-provider=aws
            - --skip-nodes-with-local-storage=false
            - --nodes=1:10:nodes.k8s-1-17.dev.platform

However, the autoscaler can't seem to initiate scaledown. The logs show it finds an unused node, but then doesn't scale it down and doesn't give me an error (the nodes that show "no node group config" are the master nodes).

I0610 22:09:37.164102       1 static_autoscaler.go:147] Starting main loop
I0610 22:09:37.164462       1 utils.go:471] Removing autoscaler soft taint when creating template from node ip-10-141-10-176.ec2.internal
I0610 22:09:37.164805       1 utils.go:626] No pod using affinity / antiaffinity found in cluster, disabling affinity predicate for this loop
I0610 22:09:37.164823       1 static_autoscaler.go:303] Filtering out schedulables
I0610 22:09:37.165083       1 static_autoscaler.go:320] No schedulable pods
I0610 22:09:37.165106       1 static_autoscaler.go:328] No unschedulable pods
I0610 22:09:37.165123       1 static_autoscaler.go:375] Calculating unneeded nodes
I0610 22:09:37.165141       1 utils.go:574] Skipping ip-10-141-12-194.ec2.internal - no node group config
I0610 22:09:37.165155       1 utils.go:574] Skipping ip-10-141-15-159.ec2.internal - no node group config
I0610 22:09:37.165167       1 utils.go:574] Skipping ip-10-141-11-28.ec2.internal - no node group config
I0610 22:09:37.165181       1 utils.go:574] Skipping ip-10-141-13-239.ec2.internal - no node group config
I0610 22:09:37.165197       1 utils.go:574] Skipping ip-10-141-10-69.ec2.internal - no node group config
I0610 22:09:37.165378       1 scale_down.go:379] Scale-down calculation: ignoring 4 nodes unremovable in the last 5m0s
I0610 22:09:37.165397       1 scale_down.go:410] Node ip-10-141-10-176.ec2.internal - utilization 0.023750
I0610 22:09:37.165692       1 cluster.go:90] Fast evaluation: ip-10-141-10-176.ec2.internal for removal
I0610 22:09:37.166115       1 cluster.go:225] Pod metrics-storage/querier-6bdfd7c6cf-wm7r8 can be moved to ip-10-141-13-253.ec2.internal
I0610 22:09:37.166227       1 cluster.go:225] Pod metrics-storage/querier-75588cb7dc-cwqpv can be moved to ip-10-141-12-116.ec2.internal
I0610 22:09:37.166398       1 cluster.go:121] Fast evaluation: node ip-10-141-10-176.ec2.internal may be removed
I0610 22:09:37.166553       1 static_autoscaler.go:391] ip-10-141-10-176.ec2.internal is unneeded since 2020-06-10 22:06:55.528567955 +0000 UTC m=+1306.007780301 duration 2m41.635504026s
I0610 22:09:37.166608       1 static_autoscaler.go:402] Scale down status: unneededOnly=true lastScaleUpTime=2020-06-10 21:45:31.739421421 +0000 UTC m=+22.218633767 lastScaleDownDeleteTime=2020-06-10 21:45:31.739421531 +0000 UTC m=+22.218633877 lastScaleDownFailTime=2020-06-10 22:06:44.128044684 +0000 UTC m=+1294.607257070 scaleDownForbidden=false isDeleteInProgress=false

Why is the autoscaler not scaling down nodes?

-- djsumdog
amazon-web-services
autoscaling
kubernetes

3 Answers

3/6/2021

Recently we faced a similar with cluster-autoscaler. Post upgrading EKS cluster to 1.18, we observed a similar log in autoscaler.

Skipping ip-xx-xx-xx-xx.ec2.internal - no node group config

The issue was with autoDiscovery. Instead of kubernetes.io/cluster/YOUR_CLUSTER_NAME, below mentioned tags should be there in ASG

k8s.io/cluster-autoscaler/YOUR_CLUSTER_NAME

k8s.io/cluster-autoscaler/enabled

Please refer to this for more detail: https://artifacthub.io/packages/helm/cluster-autoscaler/cluster-autoscaler/9.4.0

-- Prosenjit Sen
Source: StackOverflow

11/15/2021

We have recently found that this was happening due to autoscaler was launched without specifying the correct region - eu-west-1 was there by default. After resetting this value to the right region and re-launching autoscaler, our nodes started to be discovered correctly.

-- Viji
Source: StackOverflow

6/11/2020

It looks to me cluster-autoscaler is behaving correctly so far. It has decided one of the nodes can be scaled down:

     1 cluster.go:121] Fast evaluation: node ip-10-141-10-176.ec2.internal may be removed
I0610 22:09:37.166553
     1 static_autoscaler.go:391] ip-10-141-10-176.ec2.internal is unneeded since 2020-06-10 22:06:55.528567955 +0000 UTC m=+1306.007780301 duration 2m41.635504026s

However, by default cluster-autoscaler will wait 10 minutes before it actually does terminate the node. See "How does scale-down work": https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#how-does-scale-down-work

From the first logs above, it says that your node has been unneeded for duration 2m41 - when it reaches 10 minutes, the scale down will occur.

After 10 minutes, you should see something like:

I0611 14:58:02.384101       1 static_autoscaler.go:382] <node_name> is unneeded since 2020-06-11 14:47:59.621770178 +0000 UTC m=+1299856.757452427 duration 10m2.760318899s
<...snip...>
I0611 14:58:02.385035       1 scale_down.go:754] Scale-down: removing node <node_name>, utilization: {0.8316326530612245 0.34302838802551344 0.8316326530612245}, pods to reschedule: <...snip...>
I0611 14:58:02.386146       1 event.go:209] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"cluster-autoscaler", Name:"cluster-autoscaler-status", UID:"31a72ce9-9c4e-11ea-a0a8-0201be076001", APIVersion:"v1", ResourceVersion:"13431409", FieldPath:""}): type: 'Normal' reason: 'ScaleDown' Scale-down: removing node <node_name>, utilization: {0.8316326530612245 0.34302838802551344 0.8316326530612245}, pods to reschedule: <...snip...>

I believe this set up is to prevent thrashing.

-- weichung.shaw
Source: StackOverflow