I am trying to deploy a Cluster Autoscaler on EKS. I followed the EKS Workshop guide by AWS: https://www.eksworkshop.com/beginner/080_scaling/deploy_ca/<br>
Upon deploying the autoscaler, scaling up a nginx app, and checking the logs, I see that it marks pods that cannot run on any current running nodes 'unschedulable' despite having sufficient ASG settings. The pods are stuck in the 'pending' state. Below, I show my ASG config, CA config, CA logs, and Nginx pod config. Please help me figure out why my cluster will not autoscale.
aws autoscaling describe-auto-scaling-groups --query "AutoScalingGroups[? Tags[? (Key=='eks:cluster-name') && Value=='esdeeplearning']].[AutoScalingGroupName, MinSize, MaxSize,DesiredCapacity]" --output table
----------------------------------------------------------------------------------------
| DescribeAutoScalingGroups |
+----------------------------------------------------------------------+----+-----+----+
| eks-autoscaler-76bf19b9-b7ca-cdc2-46f5-29d621e9c4bf | 1 | 2 | 1 |
| eks-kubecontrol-02bf192a-568b-2af9-457d-549e36369ecc | 0 | 10 | 1 |
| eks-training-deployment-f6bed5ca-eaea-8ee4-404a-f6a050d3ba93 | 0 | 10 | 0 |
| eks-training-deployment-large-7ebf19c2-59fe-c7e5-13d7-240be78eeb08 | 0 | 2 | 0 |
+----------------------------------------------------------------------+----+-----+----+
kubectl describe pod cluster-autoscaler-5cb5b99c7b-5phqn -n kube-system
Name: cluster-autoscaler-5cb5b99c7b-5phqn
Namespace: kube-system
Priority: 0
Node: ip-192-168-1-46.us-east-2.compute.internal/192.168.1.46
Start Time: Fri, 07 Jan 2022 01:47:27 -0600
Labels: app=cluster-autoscaler
pod-template-hash=5cb5b99c7b
Annotations: kubernetes.io/psp: eks.privileged
prometheus.io/port: 8085
prometheus.io/scrape: true
Status: Running
IP: 192.168.1.58
IPs:
IP: 192.168.1.58
Controlled By: ReplicaSet/cluster-autoscaler-5cb5b99c7b
Containers:
cluster-autoscaler:
Container ID: docker://34e283e8127218682c61a71ea9aea14395d3c3b36a94f9b38584411f66a7410e
Image: us.gcr.io/k8s-artifacts-prod/autoscaling/cluster-autoscaler:v1.21.2
Image ID: docker-pullable://us.gcr.io/k8s-artifacts-prod/autoscaling/cluster-autoscaler@sha256:69e980d32052fa6c38e8744f1db9b176f11a2a2eb0d5a1db8990139dd29ded4b
Port: <none>
Host Port: <none>
Command:
./cluster-autoscaler
--v=4
--stderrthreshold=info
--cloud-provider=aws
--skip-nodes-with-local-storage=false
--expander=least-waste
--node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/eksworkshop-eksctl
--balance-similar-node-groups
--skip-nodes-with-system-pods=false
State: Running
Started: Fri, 07 Jan 2022 01:47:32 -0600
Ready: True
Restart Count: 0
Limits:
cpu: 100m
memory: 500Mi
Requests:
cpu: 100m
memory: 500Mi
Environment:
AWS_DEFAULT_REGION: us-east-2
AWS_REGION: us-east-2
AWS_ROLE_ARN: arn:aws:iam::385352568821:role/eksctl-esdeeplearning-addon-iamserviceaccoun-Role1-1VLHNE7BSJ2NZ
AWS_WEB_IDENTITY_TOKEN_FILE: /var/run/secrets/eks.amazonaws.com/serviceaccount/token
Mounts:
/etc/ssl/certs/ca-certificates.crt from ssl-certs (ro)
/var/run/secrets/eks.amazonaws.com/serviceaccount from aws-iam-token (ro)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-dg88f (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
aws-iam-token:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 86400
ssl-certs:
Type: HostPath (bare host directory volume)
Path: /etc/ssl/certs/ca-bundle.crt
HostPathType:
kube-api-access-dg88f:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: Guaranteed
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events: <none>
kubectl -n kube-system logs -f deployment/cluster-autoscaler
I0107 20:03:19.927177 1 static_autoscaler.go:228] Starting main loop
I0107 20:03:20.013621 1 auto_scaling_groups.go:351] Regenerating instance to ASG map for ASGs: []
I0107 20:03:20.013645 1 auto_scaling.go:199] 0 launch configurations already in cache
I0107 20:03:20.013655 1 aws_manager.go:269] Refreshed ASG list, next refresh after 2022-01-07 20:04:20.013650977 +0000 UTC m=+44207.147803630
I0107 20:03:20.014174 1 filter_out_schedulable.go:65] Filtering out schedulables
I0107 20:03:20.014193 1 filter_out_schedulable.go:132] Filtered out 0 pods using hints
I0107 20:03:20.014419 1 filter_out_schedulable.go:170] 0 pods were kept as unschedulable based on caching
I0107 20:03:20.014434 1 filter_out_schedulable.go:171] 0 pods marked as unschedulable can be scheduled.
I0107 20:03:20.014448 1 filter_out_schedulable.go:82] No schedulable pods
I0107 20:03:20.014468 1 klogx.go:86] Pod default/esdeeplearning-test-model-09 is unschedulable
I0107 20:03:20.014477 1 klogx.go:86] Pod default/nginx-to-scaleout-6fcd49fb84-bzj7r is unschedulable
I0107 20:03:20.014485 1 klogx.go:86] Pod default/nginx-to-scaleout-6fcd49fb84-j5gtn is unschedulable
I0107 20:03:20.014490 1 klogx.go:86] Pod default/nginx-to-scaleout-6fcd49fb84-xs2xm is unschedulable
I0107 20:03:20.014495 1 klogx.go:86] Pod default/nginx-to-scaleout-6fcd49fb84-mszph is unschedulable
I0107 20:03:20.014500 1 klogx.go:86] Pod default/nginx-to-scaleout-6fcd49fb84-f8tt6 is unschedulable
I0107 20:03:20.014506 1 klogx.go:86] Pod default/nginx-to-scaleout-6fcd49fb84-hqwwt is unschedulable
I0107 20:03:20.014512 1 klogx.go:86] Pod default/nginx-to-scaleout-6fcd49fb84-56zvp is unschedulable
I0107 20:03:20.014546 1 scale_up.go:376] Upcoming 0 nodes
I0107 20:03:20.014663 1 scale_up.go:453] No expansion options
I0107 20:03:20.014797 1 static_autoscaler.go:448] Calculating unneeded nodes
I0107 20:03:20.014815 1 pre_filtering_processor.go:57] Skipping ip-192-168-1-46.us-east-2.compute.internal - no node group config
I0107 20:03:20.014823 1 pre_filtering_processor.go:57] Skipping ip-192-168-2-33.us-east-2.compute.internal - no node group config
I0107 20:03:20.014851 1 static_autoscaler.go:502] Scale down status: unneededOnly=false lastScaleUpTime=2022-01-07 07:48:01.208866522 +0000 UTC m=+28.343019164 lastScaleDownDeleteTime=2022-01-07 07:48:01.208866584 +0000 UTC m=+28.343019228 lastScaleDownFailTime=2022-01-07 07:48:01.208866648 +0000 UTC m=+28.343019291 scaleDownForbidden=false isDeleteInProgress=false scaleDownInCooldown=false
I0107 20:03:20.014880 1 static_autoscaler.go:515] Starting scale down
I0107 20:03:20.014967 1 scale_down.go:917] No candidates for scale down
I0107 20:03:23.575204 1 reflector.go:255] Listing and watching *v1beta1.CSIStorageCapacity from k8s.io/client-go/informers/factory.go:134
E0107 20:03:23.587685 1 reflector.go:138] k8s.io/client-go/informers/factory.go:134: Failed to watch *v1beta1.CSIStorageCapacity: failed to list *v1beta1.CSIStorageCapacity: csistoragecapacities.storage.k8s.io is forbidden: User "system:serviceaccount:kube-system:cluster-autoscaler" cannot list resource "csistoragecapacities" in API group "storage.k8s.io" at the cluster scope
kubectl describe pod nginx-to-scaleout-6fcd49fb84-56zvp
Name: nginx-to-scaleout-6fcd49fb84-56zvp
Namespace: default
Priority: 0
Node: <none>
Labels: app=nginx
pod-template-hash=6fcd49fb84
service=nginx
Annotations: kubernetes.io/psp: eks.privileged
Status: Pending
IP:
IPs: <none>
Controlled By: ReplicaSet/nginx-to-scaleout-6fcd49fb84
Containers:
nginx-to-scaleout:
Image: nginx
Port: <none>
Host Port: <none>
Limits:
cpu: 500m
memory: 512Mi
Requests:
cpu: 500m
memory: 512Mi
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-2rqm4 (ro)
Conditions:
Type Status
PodScheduled False
Volumes:
kube-api-access-2rqm4:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: Guaranteed
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal NotTriggerScaleUp 4m3s (x4501 over 12h) cluster-autoscaler pod didn't trigger scale-up:
Warning FailedScheduling 24s (x664 over 12h) default-scheduler 0/2 nodes are available: 1 Insufficient cpu, 1 Insufficient memory, 1 Too many pods.