GKE Autoscaler is not scaling nodes up after 15 nodes (former limit)
I've changed the Min
and Max
values in Cluster to 17-25
However the node count is stuck on 14-15 and is not going up, right now my cluster is full, no more pods can fit in, so every new deployment should trigger node scale up and schedule itself onto the new node, which is not happening.
When I create deployment it is stuck in Pending
state with a message:
pod didn't trigger scale-up (it wouldn't fit if a new node is added): 1 max cluster cpu, memory limit reached
Max cluster cpu, memory limit reached sounds like the maximum node count is somehow still 14-15, how is that possible? Why it is not triggering node scale up?
ClusterAutoscalerStatus:
apiVersion: v1
data:
status: |+
Cluster-autoscaler status at 2020-03-10 10:35:39.899329642 +0000 UTC:
Cluster-wide:
Health: Healthy (ready=14 unready=0 notStarted=0 longNotStarted=0 registered=14 longUnregistered=0)
LastProbeTime: 2020-03-10 10:35:39.608193389 +0000 UTC m=+6920.650397445
LastTransitionTime: 2020-03-10 09:49:11.965623459 +0000 UTC m=+4133.007827509
ScaleUp: NoActivity (ready=14 registered=14)
LastProbeTime: 2020-03-10 10:35:39.608193389 +0000 UTC m=+6920.650397445
LastTransitionTime: 2020-03-10 08:40:47.775200087 +0000 UTC m=+28.817404126
ScaleDown: NoCandidates (candidates=0)
LastProbeTime: 2020-03-10 10:35:39.608193389 +0000 UTC m=+6920.650397445
LastTransitionTime: 2020-03-10 09:49:49.580623718 +0000 UTC m=+4170.622827779
NodeGroups:
Name: https://content.googleapis.com/compute/v1/projects/project/zones/europe-west4-b/instanceGroups/adjust-scope-bff43e09-grp
Health: Healthy (ready=14 unready=0 notStarted=0 longNotStarted=0 registered=14 longUnregistered=0 cloudProviderTarget=14 (minSize=17, maxSize=25))
LastProbeTime: 2020-03-10 10:35:39.608193389 +0000 UTC m=+6920.650397445
LastTransitionTime: 2020-03-10 09:46:19.45614781 +0000 UTC m=+3960.498351857
ScaleUp: NoActivity (ready=14 cloudProviderTarget=14)
LastProbeTime: 2020-03-10 10:35:39.608193389 +0000 UTC m=+6920.650397445
LastTransitionTime: 2020-03-10 09:46:19.45614781 +0000 UTC m=+3960.498351857
ScaleDown: NoCandidates (candidates=0)
LastProbeTime: 2020-03-10 10:35:39.608193389 +0000 UTC m=+6920.650397445
LastTransitionTime: 2020-03-10 09:49:49.580623718 +0000 UTC m=+4170.622827779
Deployment is very small! (200m CPU, 256Mi mem) so it will surely fit if new node would be added.
Looks like a bug in nodepool/autoscaler as 15 was my former node count limit, somehow it looks like it still things 15 is top.
EDIT: New nodepool with bigger machines, autoscaling in GKE turned on, still the same issue after some time, even though the nodes are having free resources. Top from nodes:
NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
gke-infrastructure-n-autoscaled-node--0816b9c6-fm5v 805m 41% 4966Mi 88%
gke-infrastructure-n-autoscaled-node--0816b9c6-h98f 407m 21% 2746Mi 48%
gke-infrastructure-n-autoscaled-node--0816b9c6-hr0l 721m 37% 3832Mi 67%
gke-infrastructure-n-autoscaled-node--0816b9c6-prfw 1020m 52% 5102Mi 90%
gke-infrastructure-n-autoscaled-node--0816b9c6-s94x 946m 49% 3637Mi 64%
gke-infrastructure-n-autoscaled-node--0816b9c6-sz5l 2000m 103% 5738Mi 101%
gke-infrastructure-n-autoscaled-node--0816b9c6-z6dv 664m 34% 4271Mi 75%
gke-infrastructure-n-autoscaled-node--0816b9c6-zvbr 970m 50% 3061Mi 54%
And yet Still the message 1 max cluster cpu, memory limit reached
. This is still happening when updating a deployment, the new version sometimes stuck in Pending
because it wont trigger the scale up.
EDIT2: While describing cluster with cloud command, I've found this:
autoscaling:
autoprovisioningNodePoolDefaults:
oauthScopes:
- https://www.googleapis.com/auth/logging.write
- https://www.googleapis.com/auth/monitoring
serviceAccount: default
enableNodeAutoprovisioning: true
resourceLimits:
- maximum: '5'
minimum: '1'
resourceType: cpu
- maximum: '5'
minimum: '1'
resourceType: memory
How is this working with autoscaling turned on? It wont trigger scaleup if those are reached? (The sum is already are above that)
Can you please check if you didn't reach your project quotas? Like, try to manually create VM. If not related to quota, can you specify GKE version you use?