Understanding Kubernetes cluster scaling

6/8/2020

Using AWS EKS with t3.medium instances so I have (2 VCPU = 2000 cores and 4gb ram).

Running 6 different apps on the node with these cpu request definitions:

name  request replica total-cpu
app#1 300m    x2      600
app#2 100m    x4      400
app#3 150m    x1      150
app#4 300m    x1      300
app#5 100m    x1      100
app#6 150m    x1      150

With basic math I can say whole apps consume 1700m cpu cores. Also I have hpa with 60% cpu limit for app#1 and app#2. So, I am expecting to have just one node, or maybe two nodes (because of kube-system pods), but the cluster always running with 3 nodes. It looks like I understood autoscaling thing wrong.

$ kubectl top nodes
NAME                                          CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
ip-*.eu-central-1.compute.internal    221m         11%    631Mi           18%
ip-*.eu-central-1.compute.internal    197m         10%    718Mi           21%
ip-*.eu-central-1.compute.internal   307m         15%    801Mi           23%

As you see it's just using 10-15% of nodes. How can I optimize node scaling? What is the reason to have 3 nodes?

$ kubectl get hpa
NAME                       REFERENCE                             TARGETS   MINPODS   MAXPODS   REPLICAS   AGE
app#1   Deployment/easyinventory-deployment   37%/60%   1         5         3          5d16h
app#2   Deployment/poolinventory-deployment   64%/60%   1         5         4          4d10h

UPDATE #1

I have pod disruption budget for kube-system pods

kubectl create poddisruptionbudget pdb-event --namespace=kube-system --selector k8s-app=event-exporter --max-unavailable 1 
kubectl create poddisruptionbudget pdb-fluentd --namespace=kube-system --selector k8s-app=k8s-app: fluentd-gcp-scaler --max-unavailable 1 
kubectl create poddisruptionbudget pdb-heapster --namespace=kube-system --selector k8s-app=heapster --max-unavailable 1 
kubectl create poddisruptionbudget pdb-dns --namespace=kube-system --selector k8s-app=kube-dns --max-unavailable 1 
kubectl create poddisruptionbudget pdb-dnsauto --namespace=kube-system --selector k8s-app=kube-dns-autoscaler --max-unavailable 1 
kubectl create poddisruptionbudget pdb-glbc --namespace=kube-system --selector k8s-app=glbc --max-unavailable 1 
kubectl create poddisruptionbudget pdb-metadata --namespace=kube-system --selector app=metadata-agent-cluster-level --max-unavailable 1 
kubectl create poddisruptionbudget pdb-kubeproxy --namespace=kube-system --selector component=kube-proxy --max-unavailable 1 
kubectl create poddisruptionbudget pdb-metrics --namespace=kube-system --selector k8s-app=metrics-server --max-unavailable 1
#source: https://gist.github.com/kenthua/fc06c6ea52a25a51bc07e70c8f781f8f

UPDATE #2

Figured out 3rd node is not always live, k8s scaling down to 2 nodes but after a few minutes, scaling up again to 3 nodes and later down to 2 nodes again and again. kubectl describe nodes

# Node 1
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                    Requests      Limits
  --------                    --------      ------
  cpu                         1010m (52%)   1300m (67%)
  memory                      3040Mi (90%)  3940Mi (117%)
  ephemeral-storage           0 (0%)        0 (0%)
  attachable-volumes-aws-ebs  0             0
# Node 2
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                    Requests      Limits
  --------                    --------      ------
  cpu                         1060m (54%)   1850m (95%)
  memory                      3300Mi (98%)  4200Mi (125%)
  ephemeral-storage           0 (0%)        0 (0%)
  attachable-volumes-aws-ebs  0             0

UPDATE #3

I0608 11:03:21.965642       1 static_autoscaler.go:192] Starting main loop
I0608 11:03:21.965976       1 utils.go:590] No pod using affinity / antiaffinity found in cluster, disabling affinity predicate for this loop
I0608 11:03:21.965996       1 filter_out_schedulable.go:65] Filtering out schedulables
I0608 11:03:21.966120       1 filter_out_schedulable.go:130] 0 other pods marked as unschedulable can be scheduled.
I0608 11:03:21.966164       1 filter_out_schedulable.go:130] 0 other pods marked as unschedulable can be scheduled.
I0608 11:03:21.966175       1 filter_out_schedulable.go:90] No schedulable pods
I0608 11:03:21.966202       1 static_autoscaler.go:334] No unschedulable pods
I0608 11:03:21.966257       1 static_autoscaler.go:381] Calculating unneeded nodes
I0608 11:03:21.966336       1 scale_down.go:437] Scale-down calculation: ignoring 1 nodes unremovable in the last 5m0s
I0608 11:03:21.966359       1 scale_down.go:468] Node ip-*-93.eu-central-1.compute.internal - memory utilization 0.909449
I0608 11:03:21.966411       1 scale_down.go:472] Node ip-*-93.eu-central-1.compute.internal is not suitable for removal - memory utilization too big (0.909449)
I0608 11:03:21.966460       1 scale_down.go:468] Node ip-*-115.eu-central-1.compute.internal - memory utilization 0.987231
I0608 11:03:21.966469       1 scale_down.go:472] Node ip-*-115.eu-central-1.compute.internal is not suitable for removal - memory utilization too big (0.987231)
I0608 11:03:21.966551       1 static_autoscaler.go:440] Scale down status: unneededOnly=false lastScaleUpTime=2020-06-08 09:14:54.619088707 +0000 UTC m=+143849.361988520 lastScaleDownDeleteTime=2020-06-06 17:18:02.104469988 +0000 UTC m=+36.847369765 lastScaleDownFailTime=2020-06-06 17:18:02.104470075 +0000 UTC m=+36.847369849 scaleDownForbidden=false isDeleteInProgress=false scaleDownInCooldown=false
I0608 11:03:21.966578       1 static_autoscaler.go:453] Starting scale down
I0608 11:03:21.966667       1 scale_down.go:785] No candidates for scale down

Update #4

According to autoscaler logs, it was ignoring the ip-*145.eu-central-1.compute.internal to scale down, for some reason, I wonder what will happen and terminated the instance from EC2 console directly. And these lines appeared in autoscaler logs:

I0608 11:10:43.747445       1 scale_down.go:517] Finding additional 1 candidates for scale down.
I0608 11:10:43.747477       1 cluster.go:93] Fast evaluation: ip-*-145.eu-central-1.compute.internal for removal
I0608 11:10:43.747540       1 cluster.go:248] Evaluation ip-*-115.eu-central-1.compute.internal for default/app2-848db65964-9nr2m -> PodFitsResources predicate mismatch, reason: Insufficient memory,
I0608 11:10:43.747549       1 cluster.go:248] Evaluation ip-*-93.eu-central-1.compute.internal for default/app2-848db65964-9nr2m -> PodFitsResources predicate mismatch, reason: Insufficient memory,
I0608 11:10:43.747557       1 cluster.go:129] Fast evaluation: node ip-*-145.eu-central-1.compute.internal is not suitable for removal: failed to find place for default/app2-848db65964-9nr2m
I0608 11:10:43.747569       1 scale_down.go:554] 1 nodes found to be unremovable in simulation, will re-check them at 2020-06-08 11:15:43.746773707 +0000 UTC m=+151098.489673532
I0608 11:10:43.747596       1 static_autoscaler.go:440] Scale down status: unneededOnly=false lastScaleUpTime=2020-06-08 09:14:54.619088707 +0000 UTC m=+143849.361988520 lastScaleDownDeleteTime=2020-06-06 17:18:02.104469988 +0000 UTC m=+36.847369765 lastScaleDownFailTime=2020-06-06 17:18:02.104470075 +0000 UTC m=+36.847369849 scaleDownForbidden=false isDeleteInProgress=false scaleDownInCooldown=false

As far as I see, the node is not scaling down because there are no other nodes to fit "app2". But app memory request is 700Mi and at the moment other nodes have enough memory for the app2

$ kubectl top nodes
NAME                                          CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
ip-10-0-0-93.eu-central-1.compute.internal    386m         20%    920Mi           27%
ip-10-0-1-115.eu-central-1.compute.internal   298m         15%    794Mi           23%

Still no idea why autoscaler is not moving app2 to one of other available nodes and scale down the ip-*-145.

-- Eray
amazon-eks
amazon-web-services
kubernetes

1 Answer

6/9/2020

How Pods with resource requests are scheduled.

A request is the amount guaranteed for the container. So the scheduler will not schedule a pod to a node without enough capacity. In your case, the 2 existing nodes already cap their mem (0.9 and 0.98). So ip-*-145 cannot be scaled down otherwise app2 has nowhere to go.

-- Ken Chen
Source: StackOverflow