Kubernetes Cluster autoscaler not scaling down instances on EKS - just logs that the node is unneeded

9/22/2019

Here are the logs from the autoscaler:

0922 17:08:33.857348       1 auto_scaling_groups.go:102] Updating ASG terraform-eks-demo20190922161659090500000007--terraform-eks-demo20190922161700651000000008
I0922 17:08:33.857380       1 aws_manager.go:152] Refreshed ASG list, next refresh after 2019-09-22 17:08:43.857375311 +0000 UTC m=+259.289807511
I0922 17:08:33.857465       1 utils.go:526] No pod using affinity / antiaffinity found in cluster, disabling affinity predicate for this loop
I0922 17:08:33.857482       1 static_autoscaler.go:261] Filtering out schedulables
I0922 17:08:33.857532       1 static_autoscaler.go:271] No schedulable pods
I0922 17:08:33.857545       1 static_autoscaler.go:279] No unschedulable pods
I0922 17:08:33.857557       1 static_autoscaler.go:333] Calculating unneeded nodes
I0922 17:08:33.857601       1 scale_down.go:376] Scale-down calculation: ignoring 2 nodes unremovable in the last 5m0s
I0922 17:08:33.857621       1 scale_down.go:407] Node ip-10-0-1-135.us-west-2.compute.internal - utilization 0.055000
I0922 17:08:33.857688       1 static_autoscaler.go:349] ip-10-0-1-135.us-west-2.compute.internal is unneeded since 2019-09-22 17:05:07.299351571 +0000 UTC m=+42.731783882 duration 3m26.405144434s
I0922 17:08:33.857703       1 static_autoscaler.go:360] Scale down status: unneededOnly=true lastScaleUpTime=2019-09-22 17:04:42.29864432 +0000 UTC m=+17.731076395 lastScaleDownDeleteTime=2019-09-22 17:04:42.298645611 +0000 UTC m=+17.731077680 lastScaleDownFailTime=2019-09-22 17:04:42.298646962 +0000 UTC m=+17.731079033 scaleDownForbidden=false isDeleteInProgress=false
I0922 17:08:33.857688       1 static_autoscaler.go:349] ip-10-0-1-135.us-west-2.compute.internal is unneeded since 2019-09-22 17:05:07.299351571 +0000 UTC m=+42.731783882 duration 3m26.405144434s

If it's unneeded, then what is the next step? What is it waiting for?

I've drained one node:

kubectl get nodes -o=wide
NAME                                       STATUS                     ROLES    AGE   VERSION               INTERNAL-IP   EXTERNAL-IP      OS-IMAGE         KERNEL-VERSION                  CONTAINER-RUNTIME
ip-10-0-0-118.us-west-2.compute.internal   Ready                      <none>   46m   v1.13.10-eks-d6460e   10.0.0.118    52.40.115.132    Amazon Linux 2   4.14.138-114.102.amzn2.x86_64   docker://18.6.1
ip-10-0-0-211.us-west-2.compute.internal   Ready                      <none>   44m   v1.13.10-eks-d6460e   10.0.0.211    35.166.57.203    Amazon Linux 2   4.14.138-114.102.amzn2.x86_64   docker://18.6.1
ip-10-0-1-135.us-west-2.compute.internal   Ready,SchedulingDisabled   <none>   46m   v1.13.10-eks-d6460e   10.0.1.135    18.237.253.134   Amazon Linux 2   4.14.138-114.102.amzn2.x86_64   docker://18.6.1

Why is it not terminating the instance?

These are the parameters I'm using:

        - ./cluster-autoscaler
        - --cloud-provider=aws
        - --namespace=default
        - --scan-interval=25s
        - --scale-down-unneeded-time=30s
        - --nodes=1:20:terraform-eks-demo20190922161659090500000007--terraform-eks-demo20190922161700651000000008
        - --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/example-job-runner
        - --logtostderr=true
        - --stderrthreshold=info
        - --v=4
-- Chris Stryczynski
amazon-eks
autoscaling
kubernetes

1 Answer

9/22/2019

Have you got any of the following?

  • Pods running on that node without a controller object (i.e. deployment / replica-set?
  • Any kube-system pods that don't have a pod disruption budget
  • Pods with local storage or any custom affinity/anti-affinity/nodeSelectors
  • An annotation set on that node that prevents cluster-autoscaler from scaling it down

Your config/start-up options for CA look good to me though.

I can only imagine it might be something to with a specific pod running on that node. Maybe run through the kube-system pods running on the nodes listed that are not scaling down and check the above list.

These two page sections have some good items to check on that might be causing CA to not scale down nodes.

low utilization nodes but not scaling down, why? what types of pods can prevent CA from removing a node?

-- Shogan
Source: StackOverflow