EKS Cluster Autoscaler Pods Unschedulable

1/7/2022

The Problem

I am trying to deploy a Cluster Autoscaler on EKS. I followed the EKS Workshop guide by AWS: https://www.eksworkshop.com/beginner/080_scaling/deploy_ca/<br>

Upon deploying the autoscaler, scaling up a nginx app, and checking the logs, I see that it marks pods that cannot run on any current running nodes 'unschedulable' despite having sufficient ASG settings. The pods are stuck in the 'pending' state. Below, I show my ASG config, CA config, CA logs, and Nginx pod config. Please help me figure out why my cluster will not autoscale.

ASG configuration

aws autoscaling describe-auto-scaling-groups --query "AutoScalingGroups[? Tags[? (Key=='eks:cluster-name') && Value=='esdeeplearning']].[AutoScalingGroupName, MinSize, MaxSize,DesiredCapacity]" --output table

----------------------------------------------------------------------------------------
|                               DescribeAutoScalingGroups                              |
+----------------------------------------------------------------------+----+-----+----+
|  eks-autoscaler-76bf19b9-b7ca-cdc2-46f5-29d621e9c4bf                 |  1 |  2  |  1 |
|  eks-kubecontrol-02bf192a-568b-2af9-457d-549e36369ecc                |  0 |  10 |  1 |
|  eks-training-deployment-f6bed5ca-eaea-8ee4-404a-f6a050d3ba93        |  0 |  10 |  0 |
|  eks-training-deployment-large-7ebf19c2-59fe-c7e5-13d7-240be78eeb08  |  0 |  2  |  0 |
+----------------------------------------------------------------------+----+-----+----+

CA Configuration

kubectl describe pod cluster-autoscaler-5cb5b99c7b-5phqn -n kube-system

Name:         cluster-autoscaler-5cb5b99c7b-5phqn
Namespace:    kube-system
Priority:     0
Node:         ip-192-168-1-46.us-east-2.compute.internal/192.168.1.46
Start Time:   Fri, 07 Jan 2022 01:47:27 -0600
Labels:       app=cluster-autoscaler
              pod-template-hash=5cb5b99c7b
Annotations:  kubernetes.io/psp: eks.privileged
              prometheus.io/port: 8085
              prometheus.io/scrape: true
Status:       Running
IP:           192.168.1.58
IPs:
  IP:           192.168.1.58
Controlled By:  ReplicaSet/cluster-autoscaler-5cb5b99c7b
Containers:
  cluster-autoscaler:
    Container ID:  docker://34e283e8127218682c61a71ea9aea14395d3c3b36a94f9b38584411f66a7410e
    Image:         us.gcr.io/k8s-artifacts-prod/autoscaling/cluster-autoscaler:v1.21.2
    Image ID:      docker-pullable://us.gcr.io/k8s-artifacts-prod/autoscaling/cluster-autoscaler@sha256:69e980d32052fa6c38e8744f1db9b176f11a2a2eb0d5a1db8990139dd29ded4b
    Port:          <none>
    Host Port:     <none>
    Command:
      ./cluster-autoscaler
      --v=4
      --stderrthreshold=info
      --cloud-provider=aws
      --skip-nodes-with-local-storage=false
      --expander=least-waste
      --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/eksworkshop-eksctl
      --balance-similar-node-groups
      --skip-nodes-with-system-pods=false
    State:          Running
      Started:      Fri, 07 Jan 2022 01:47:32 -0600
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     100m
      memory:  500Mi
    Requests:
      cpu:     100m
      memory:  500Mi
    Environment:
      AWS_DEFAULT_REGION:           us-east-2
      AWS_REGION:                   us-east-2
      AWS_ROLE_ARN:                 arn:aws:iam::385352568821:role/eksctl-esdeeplearning-addon-iamserviceaccoun-Role1-1VLHNE7BSJ2NZ
      AWS_WEB_IDENTITY_TOKEN_FILE:  /var/run/secrets/eks.amazonaws.com/serviceaccount/token
    Mounts:
      /etc/ssl/certs/ca-certificates.crt from ssl-certs (ro)
      /var/run/secrets/eks.amazonaws.com/serviceaccount from aws-iam-token (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-dg88f (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  aws-iam-token:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  86400
  ssl-certs:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/ssl/certs/ca-bundle.crt
    HostPathType:  
  kube-api-access-dg88f:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Guaranteed
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:                      <none>

CA Logs

kubectl -n kube-system logs -f deployment/cluster-autoscaler

I0107 20:03:19.927177       1 static_autoscaler.go:228] Starting main loop
I0107 20:03:20.013621       1 auto_scaling_groups.go:351] Regenerating instance to ASG map for ASGs: []
I0107 20:03:20.013645       1 auto_scaling.go:199] 0 launch configurations already in cache
I0107 20:03:20.013655       1 aws_manager.go:269] Refreshed ASG list, next refresh after 2022-01-07 20:04:20.013650977 +0000 UTC m=+44207.147803630
I0107 20:03:20.014174       1 filter_out_schedulable.go:65] Filtering out schedulables
I0107 20:03:20.014193       1 filter_out_schedulable.go:132] Filtered out 0 pods using hints
I0107 20:03:20.014419       1 filter_out_schedulable.go:170] 0 pods were kept as unschedulable based on caching
I0107 20:03:20.014434       1 filter_out_schedulable.go:171] 0 pods marked as unschedulable can be scheduled.
I0107 20:03:20.014448       1 filter_out_schedulable.go:82] No schedulable pods
I0107 20:03:20.014468       1 klogx.go:86] Pod default/esdeeplearning-test-model-09 is unschedulable
I0107 20:03:20.014477       1 klogx.go:86] Pod default/nginx-to-scaleout-6fcd49fb84-bzj7r is unschedulable
I0107 20:03:20.014485       1 klogx.go:86] Pod default/nginx-to-scaleout-6fcd49fb84-j5gtn is unschedulable
I0107 20:03:20.014490       1 klogx.go:86] Pod default/nginx-to-scaleout-6fcd49fb84-xs2xm is unschedulable
I0107 20:03:20.014495       1 klogx.go:86] Pod default/nginx-to-scaleout-6fcd49fb84-mszph is unschedulable
I0107 20:03:20.014500       1 klogx.go:86] Pod default/nginx-to-scaleout-6fcd49fb84-f8tt6 is unschedulable
I0107 20:03:20.014506       1 klogx.go:86] Pod default/nginx-to-scaleout-6fcd49fb84-hqwwt is unschedulable
I0107 20:03:20.014512       1 klogx.go:86] Pod default/nginx-to-scaleout-6fcd49fb84-56zvp is unschedulable
I0107 20:03:20.014546       1 scale_up.go:376] Upcoming 0 nodes
I0107 20:03:20.014663       1 scale_up.go:453] No expansion options
I0107 20:03:20.014797       1 static_autoscaler.go:448] Calculating unneeded nodes
I0107 20:03:20.014815       1 pre_filtering_processor.go:57] Skipping ip-192-168-1-46.us-east-2.compute.internal - no node group config
I0107 20:03:20.014823       1 pre_filtering_processor.go:57] Skipping ip-192-168-2-33.us-east-2.compute.internal - no node group config
I0107 20:03:20.014851       1 static_autoscaler.go:502] Scale down status: unneededOnly=false lastScaleUpTime=2022-01-07 07:48:01.208866522 +0000 UTC m=+28.343019164 lastScaleDownDeleteTime=2022-01-07 07:48:01.208866584 +0000 UTC m=+28.343019228 lastScaleDownFailTime=2022-01-07 07:48:01.208866648 +0000 UTC m=+28.343019291 scaleDownForbidden=false isDeleteInProgress=false scaleDownInCooldown=false
I0107 20:03:20.014880       1 static_autoscaler.go:515] Starting scale down
I0107 20:03:20.014967       1 scale_down.go:917] No candidates for scale down
I0107 20:03:23.575204       1 reflector.go:255] Listing and watching *v1beta1.CSIStorageCapacity from k8s.io/client-go/informers/factory.go:134
E0107 20:03:23.587685       1 reflector.go:138] k8s.io/client-go/informers/factory.go:134: Failed to watch *v1beta1.CSIStorageCapacity: failed to list *v1beta1.CSIStorageCapacity: csistoragecapacities.storage.k8s.io is forbidden: User "system:serviceaccount:kube-system:cluster-autoscaler" cannot list resource "csistoragecapacities" in API group "storage.k8s.io" at the cluster scope

Nginx Pod Configuration

kubectl describe pod nginx-to-scaleout-6fcd49fb84-56zvp

Name:           nginx-to-scaleout-6fcd49fb84-56zvp
Namespace:      default
Priority:       0
Node:           <none>
Labels:         app=nginx
                pod-template-hash=6fcd49fb84
                service=nginx
Annotations:    kubernetes.io/psp: eks.privileged
Status:         Pending
IP:             
IPs:            <none>
Controlled By:  ReplicaSet/nginx-to-scaleout-6fcd49fb84
Containers:
  nginx-to-scaleout:
    Image:      nginx
    Port:       <none>
    Host Port:  <none>
    Limits:
      cpu:     500m
      memory:  512Mi
    Requests:
      cpu:        500m
      memory:     512Mi
    Environment:  <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-2rqm4 (ro)
Conditions:
  Type           Status
  PodScheduled   False 
Volumes:
  kube-api-access-2rqm4:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Guaranteed
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason             Age                    From                Message
  ----     ------             ----                   ----                -------
  Normal   NotTriggerScaleUp  4m3s (x4501 over 12h)  cluster-autoscaler  pod didn't trigger scale-up:
  Warning  FailedScheduling   24s (x664 over 12h)    default-scheduler   0/2 nodes are available: 1 Insufficient cpu, 1 Insufficient memory, 1 Too many pods.
-- iamPres
amazon-web-services
autoscaling
kubernetes

0 Answers