Google Kubernetes: worker pool not scaling down to zero

12/6/2019

I'm setting up a GKE cluster on Google Kubernetes Engine to run some heavy jobs. I have a render-pool of big machines that I want to autoscale from 0 to N (using the cluster autoscaler). My default-pool is a cheap g1-small to run the system pods (those never go away so the default pool can't autoscale to 0, too bad).

My problem is that the render-pool doesn't want to scale down to 0. It has some system pods running on it; are those the problem? The default pool has plenty of resources to run all of them as far as I can tell. I've read the autoscaler FAQ, and it looks like it should delete my node after 10 min of inactivity. I've waited an hour though.

I created the render pool like this:

gcloud container node-pools create render-pool-1 --cluster=test-zero-cluster-2 \
 --disk-size=60 --machine-type=n2-standard-8 --image-type=COS \
 --disk-type=pd-standard --preemptible --num-nodes=1 --max-nodes=3 --min-nodes=0 \
 --enable-autoscaling

The cluster-autoscaler-status configmap says ScaleDown: NoCandidates and it is probing the pool frequently, as it should.

What am I doing wrong, and how do I debug it? Can I see why the autoscaler doesn't think it can delete the node?

-- GaryO
google-kubernetes-engine
kubernetes

1 Answer

12/10/2019

As pointed out in the comments, some pods, under specific circumstances will prevent the CA from downscaling.

In GKE, you have logging pods (fluentd), kube-dns, monitoring, etc., all considered system pods. This means that any node where they're scheduled, will not be a candidate for downscaling.

Considering this, it all boils down to creating an scenario where all the previous conditions for downscaling are met.

Since you only want to scale down an specific node-pool, I'd use Taints and tolerations to keep system pods in the default pool.

For GKE specifically, you can pick each app by their k8s-app label, for instance:

$ kubectl taint nodes GPU-NODE k8s-app=heapster:NoSchedule

This will prevent the tainted nodes from scheduling Heapster.

Not recommended but, you can go broader and try to get all the GKE system pods using kubernetes.io/cluster-service instead:

$ kubectl taint nodes GPU-NODE kubernetes.io/cluster-service=true:NoSchedule

Just be careful as the scope of this label is broader and you'll have to keep track of oncoming changes, as this label is possibily going to be deprecated someday.

Another thing that you might want to consider is using Pod Disruption Budgets. This might be more effective in stateless workloads, but setting it very tight is likely to cause inestability.

The idea of a PDB is to tell GKE what's the very minimal amount of pods that can be run at any given time, allowing the CA to evict them. It can be applied to system pods like below:

apiVersion: policy/v1beta1
kind: PodDisruptionBudget
metadata:
  name: dns-pdb
spec:
  minAvailable: 1
  selector:
    matchLabels:
      k8s-app: kube-dns

This tell GKE that, although there's usually 3 replicas of kube-dns, the application might be able to take 2 disruptions and sustain temporarily with only 1 replica, allowing the CA to evict these pods and reschedule them in other nodes.

As you probably noticed, this will put stress on DNS resolution in the cluster (in this particular example), so be careful.

Finally and regarding how to debug the CA. For now, consider that GKE is a managed version of Kubernetes where you don't really have direct access to tweak some features (for better or worse). You cannot set flags in the CA and access to logs could be through GCP support. The idea is to protect the workloads running in the cluster rather than to be cost-wise.

Downscaling in GKE is more about using different features in Kubernetes together until the CA conditions for downscaling are met.

-- yyyyahir
Source: StackOverflow