Auto-provisioned node pool is not getting cleaned up

11/20/2019

I have a Kubernetes cluster with auto-provisioning enabled on GKE.

gcloud beta container clusters create "some-name" --zone "us-central1-a" \
  --no-enable-basic-auth --cluster-version "1.13.11-gke.14" \
  --machine-type "n1-standard-1" --image-type "COS" \
  --disk-type "pd-standard" --disk-size "100" \
  --metadata disable-legacy-endpoints=true \
  --scopes "https://www.googleapis.com/auth/devstorage.read_only","https://www.googleapis.com/auth/logging.write","https://www.googleapis.com/auth/monitoring","https://www.googleapis.com/auth/servicecontrol","https://www.googleapis.com/auth/service.management.readonly","https://www.googleapis.com/auth/trace.append" \
  --num-nodes "1" --enable-stackdriver-kubernetes --enable-ip-alias \
  --network "projects/default-project/global/networks/default" \
  --subnetwork "projects/default-project/regions/us-central1/subnetworks/default" \
  --default-max-pods-per-node "110" \
  --enable-autoscaling --min-nodes "0" --max-nodes "8" \
  --addons HorizontalPodAutoscaling,KubernetesDashboard \
  --enable-autoupgrade --enable-autorepair \
  --enable-autoprovisioning --min-cpu 1 --max-cpu 40 --min-memory 1 --max-memory 64

I ran a deployment which wouldn't fit on the existing node (which has 1 CPU).

kubectl run say-lol --image ubuntu:18.04 --requests cpu=4 -- bash -c 'echo lolol && sleep 30'

The auto-provisioner correctly detected that a new node pool was needed, and it created a new cluster and started running the new deployment. However, it was not able to delete it after it was no longer needed.

kubectl delete deployment say-lol

After all pods are gone, the new cluster has been sitting idle for more than 20 hours.

$ kubectl get nodes
NAME                                                  STATUS   ROLES    AGE   VERSION
gke-some-name-default-pool-5003d6ff-pd1p        Ready    <none>   21h   v1.13.11-gke.14
gke-some-name-nap-n1-highcpu-8--585d94be-vbxw   Ready    <none>   21h   v1.13.11-gke.14

$ kubectl get deployments
No resources found in default namespace.

$ kubectl get events
No resources found in default namespace.

Why isn't it cleaning up the expensive node pool?

-- Andy Carlson
google-cloud-platform
google-kubernetes-engine
kubernetes

3 Answers

11/21/2019

I was reproducing on my two clusters and found out that culprit was highly related to the kube-dns pod. On cluster 1, for scaled up node, there was no kube-dns pod and scale down occurred after deleting say-lol. On cluster 2, because of the kube-dns pod, the secondary node did not scale down.

Following this doc/How to set PDBs to enable CA to move kube-system pods?

apiVersion: policy/v1beta1
kind: PodDisruptionBudget
metadata:
  name: kube-dns-pdb
  namespace: kube-system
spec:
  maxUnavailable: 1
  selector:
    matchLabels:
      k8s-app: kube-dns

I created a pdb to allow disruption of the kube-dns pod thus allowing downscaling. You can check if disruptions are allowed by running

kubectl get pdb -n kube-system

Allowed disruptions should have a non zero value for the process to work.

NAME           MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
kube-dns-pdb   N/A             1                 1                     28m
-- dany L
Source: StackOverflow

12/4/2019

In addition to the accepted answer, there is an approach using taints. If the un-schedulable pod has any tolerations, the auto-provisioner will create a nodes in the new node-pool with matching taints (see docs). Because the new nodes are tainted, other pods will not run on them and prevent them from scaling down. I find this approach simpler and easier to understand than the PDB approach.

-- Andy Carlson
Source: StackOverflow

11/21/2019

“When scaling down, cluster autoscaler honors a graceful termination period of 10 minutes for rescheduling the node's Pods onto a different node before forcibly terminating the node.

Occasionally, cluster autoscaler cannot scale down completely and an extra node exists after scaling down. This can occur when required system Pods are scheduled onto different nodes, because there is no trigger for any of those Pods to be moved to a different node.”
Please check this link “I have a couple of nodes with low utilization, but they are not scaled down. Why?”. To work around this limitation, you can configure a Pod disruption budget.

-- Ahmad P
Source: StackOverflow