Here are the logs from the autoscaler:
0922 17:08:33.857348 1 auto_scaling_groups.go:102] Updating ASG terraform-eks-demo20190922161659090500000007--terraform-eks-demo20190922161700651000000008
I0922 17:08:33.857380 1 aws_manager.go:152] Refreshed ASG list, next refresh after 2019-09-22 17:08:43.857375311 +0000 UTC m=+259.289807511
I0922 17:08:33.857465 1 utils.go:526] No pod using affinity / antiaffinity found in cluster, disabling affinity predicate for this loop
I0922 17:08:33.857482 1 static_autoscaler.go:261] Filtering out schedulables
I0922 17:08:33.857532 1 static_autoscaler.go:271] No schedulable pods
I0922 17:08:33.857545 1 static_autoscaler.go:279] No unschedulable pods
I0922 17:08:33.857557 1 static_autoscaler.go:333] Calculating unneeded nodes
I0922 17:08:33.857601 1 scale_down.go:376] Scale-down calculation: ignoring 2 nodes unremovable in the last 5m0s
I0922 17:08:33.857621 1 scale_down.go:407] Node ip-10-0-1-135.us-west-2.compute.internal - utilization 0.055000
I0922 17:08:33.857688 1 static_autoscaler.go:349] ip-10-0-1-135.us-west-2.compute.internal is unneeded since 2019-09-22 17:05:07.299351571 +0000 UTC m=+42.731783882 duration 3m26.405144434s
I0922 17:08:33.857703 1 static_autoscaler.go:360] Scale down status: unneededOnly=true lastScaleUpTime=2019-09-22 17:04:42.29864432 +0000 UTC m=+17.731076395 lastScaleDownDeleteTime=2019-09-22 17:04:42.298645611 +0000 UTC m=+17.731077680 lastScaleDownFailTime=2019-09-22 17:04:42.298646962 +0000 UTC m=+17.731079033 scaleDownForbidden=false isDeleteInProgress=false
I0922 17:08:33.857688 1 static_autoscaler.go:349] ip-10-0-1-135.us-west-2.compute.internal is unneeded since 2019-09-22 17:05:07.299351571 +0000 UTC m=+42.731783882 duration 3m26.405144434s
If it's unneeded, then what is the next step? What is it waiting for?
I've drained one node:
kubectl get nodes -o=wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
ip-10-0-0-118.us-west-2.compute.internal Ready <none> 46m v1.13.10-eks-d6460e 10.0.0.118 52.40.115.132 Amazon Linux 2 4.14.138-114.102.amzn2.x86_64 docker://18.6.1
ip-10-0-0-211.us-west-2.compute.internal Ready <none> 44m v1.13.10-eks-d6460e 10.0.0.211 35.166.57.203 Amazon Linux 2 4.14.138-114.102.amzn2.x86_64 docker://18.6.1
ip-10-0-1-135.us-west-2.compute.internal Ready,SchedulingDisabled <none> 46m v1.13.10-eks-d6460e 10.0.1.135 18.237.253.134 Amazon Linux 2 4.14.138-114.102.amzn2.x86_64 docker://18.6.1
Why is it not terminating the instance?
These are the parameters I'm using:
- ./cluster-autoscaler
- --cloud-provider=aws
- --namespace=default
- --scan-interval=25s
- --scale-down-unneeded-time=30s
- --nodes=1:20:terraform-eks-demo20190922161659090500000007--terraform-eks-demo20190922161700651000000008
- --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/example-job-runner
- --logtostderr=true
- --stderrthreshold=info
- --v=4
Have you got any of the following?
Your config/start-up options for CA look good to me though.
I can only imagine it might be something to with a specific pod running on that node. Maybe run through the kube-system pods running on the nodes listed that are not scaling down and check the above list.
These two page sections have some good items to check on that might be causing CA to not scale down nodes.
low utilization nodes but not scaling down, why? what types of pods can prevent CA from removing a node?