I have a Kubernetes cluster with auto-provisioning enabled on GKE.
gcloud beta container clusters create "some-name" --zone "us-central1-a" \
--no-enable-basic-auth --cluster-version "1.13.11-gke.14" \
--machine-type "n1-standard-1" --image-type "COS" \
--disk-type "pd-standard" --disk-size "100" \
--metadata disable-legacy-endpoints=true \
--scopes "https://www.googleapis.com/auth/devstorage.read_only","https://www.googleapis.com/auth/logging.write","https://www.googleapis.com/auth/monitoring","https://www.googleapis.com/auth/servicecontrol","https://www.googleapis.com/auth/service.management.readonly","https://www.googleapis.com/auth/trace.append" \
--num-nodes "1" --enable-stackdriver-kubernetes --enable-ip-alias \
--network "projects/default-project/global/networks/default" \
--subnetwork "projects/default-project/regions/us-central1/subnetworks/default" \
--default-max-pods-per-node "110" \
--enable-autoscaling --min-nodes "0" --max-nodes "8" \
--addons HorizontalPodAutoscaling,KubernetesDashboard \
--enable-autoupgrade --enable-autorepair \
--enable-autoprovisioning --min-cpu 1 --max-cpu 40 --min-memory 1 --max-memory 64
I ran a deployment which wouldn't fit on the existing node (which has 1 CPU).
kubectl run say-lol --image ubuntu:18.04 --requests cpu=4 -- bash -c 'echo lolol && sleep 30'
The auto-provisioner correctly detected that a new node pool was needed, and it created a new cluster and started running the new deployment. However, it was not able to delete it after it was no longer needed.
kubectl delete deployment say-lol
After all pods are gone, the new cluster has been sitting idle for more than 20 hours.
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
gke-some-name-default-pool-5003d6ff-pd1p Ready <none> 21h v1.13.11-gke.14
gke-some-name-nap-n1-highcpu-8--585d94be-vbxw Ready <none> 21h v1.13.11-gke.14
$ kubectl get deployments
No resources found in default namespace.
$ kubectl get events
No resources found in default namespace.
Why isn't it cleaning up the expensive node pool?
I was reproducing on my two clusters and found out that culprit was highly related to the kube-dns pod. On cluster 1, for scaled up node, there was no kube-dns pod and scale down occurred after deleting say-lol
. On cluster 2, because of the kube-dns pod, the secondary node did not scale down.
Following this doc/How to set PDBs to enable CA to move kube-system pods?
apiVersion: policy/v1beta1
kind: PodDisruptionBudget
metadata:
name: kube-dns-pdb
namespace: kube-system
spec:
maxUnavailable: 1
selector:
matchLabels:
k8s-app: kube-dns
I created a pdb to allow disruption of the kube-dns pod thus allowing downscaling. You can check if disruptions are allowed by running
kubectl get pdb -n kube-system
Allowed disruptions should have a non zero value for the process to work.
NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE
kube-dns-pdb N/A 1 1 28m
In addition to the accepted answer, there is an approach using taints. If the un-schedulable pod has any tolerations, the auto-provisioner will create a nodes in the new node-pool with matching taints (see docs). Because the new nodes are tainted, other pods will not run on them and prevent them from scaling down. I find this approach simpler and easier to understand than the PDB approach.
“When scaling down, cluster autoscaler honors a graceful termination period of 10 minutes for rescheduling the node's Pods onto a different node before forcibly terminating the node.
Occasionally, cluster autoscaler cannot scale down completely and an extra node exists after scaling down. This can occur when required system Pods are scheduled onto different nodes, because there is no trigger for any of those Pods to be moved to a different node.”
Please check this link “I have a couple of nodes with low utilization, but they are not scaled down. Why?”. To work around this limitation, you can configure a Pod disruption budget.