How to scale a kubernetes cluster while limiting costs on GCP

5/6/2020

We have a GKE cluster set up on google cloud platform.

We have an activity that requires 'bursts' of computing power.

Imagine that we usually do 100 computations an hour on average, then suddently we need to be able to process 100000 in less then two minutes. However most of the time, everything is close to idle.

We do not want to pay for idle servers 99% of the time, and want to scale clusters depending on actual use (no data persistance needed, servers can be deleted afterwards). I looked up the documentation available on kubernetes regarding auto scaling, for adding more pods with HPA and adding more nodes with cluster autoscaler

However it doesn't seem like any of these solutions would actually reduce our costs or improve performances, because they do not seem to scale past the GCP plan:

Imagine that we have a google plan with 8 CPUs. My understanding is if we add more nodes with cluster autoscaler we will just instead of having e.g. 2 nodes using 4 CPUs each we will have 4 nodes using 2 cpus each. But the total available computing power will still be 8 CPU. Same reasoning go for HPA with more pods instead of more nodes.

If we have the 8 CPU payment plan but only use 4 of them, my understanding is we still get billed for 8 so scaling down is not really useful.

What we want is autoscaling to change our payment plan temporarly (imagine from n1-standard-8 to n1-standard-16) and get actual new computing power.

I can't believe we are the only ones with this use case but I cannot find any documentation on this anywhere! Did I misunderstand something ?

-- Xavier Burckel
google-cloud-platform
google-kubernetes-engine
kubernetes

1 Answer

5/8/2020

TL;DR:


GKE Pricing:

  • From GKE Pricing:

    Starting June 6, 2020, GKE will charge a cluster management fee of $0.10 per cluster per hour. The following conditions apply to the cluster management fee:

    • One zonal cluster per billing account is free.
    • The fee is flat, irrespective of cluster size and topology.
    • Billing is computed on a per-second basis for each cluster. The total amount is rounded to the nearest cent, at the end of each month.
  • From Pricing for Worker Nodes:

    GKE uses Compute Engine instances for worker nodes in the cluster. You are billed for each of those instances according to Compute Engine's pricing, until the nodes are deleted. Compute Engine resources are billed on a per-second basis with a one-minute minimum usage cost.

  • Enters, Cluster Autoscaler:

    automatically resize your GKE cluster’s node pools based on the demands of your workloads. When demand is high, cluster autoscaler adds nodes to the node pool. When demand is low, cluster autoscaler scales back down to a minimum size that you designate. This can increase the availability of your workloads when you need it, while controlling costs.


  • Cluster Autoscaler cannot scale the entire cluster to zero, at least one node must always be available in the cluster to run system pods.
  • Since you already have a persistent workload, this wont be a problem, what we will do is create a new node pool:

    A node pool is a group of nodes within a cluster that all have the same configuration. Every cluster has at least one default node pool, but you can add other node pools as needed.

  • For this example I'll create two node pools:

    • A default node pool with a fixed size of one node with a small instance size (emulating the cluster you already have).
    • A second node pool with more compute power to run the jobs (I'll call it power-pool).
      • Choose the machine type with the power you need to run your AI Jobs, for this example I'll create a n1-standard-8.
      • This power-pool will have autoscaling set to allow max 4 nodes, minimum 0 nodes.
      • If you like to add GPUs you can check this great: Guide Scale to almost zero + GPUs.

Taints and Tolerations:

  • Only the jobs related to the AI workload will run on the power-pool, for that use a node selector in the job pods to make sure they run in the power-pool nodes.
  • Set a anti-affinity rule to ensure that two of your training pods cannot be scheduled on the same node (optimizing the price-performance ratio, this is optional depending on your workload).
  • Add a taint to the power-pool to avoid other workloads (and system resources) to be scheduled on the autoscalable pool.
  • Add the tolerations to the AI Jobs to let them run on those nodes.

Reproduction:

  • Create the Cluster with the persistent default-pool:
PROJECT_ID="YOUR_PROJECT_ID"  
GCP_ZONE="CLUSTER_ZONE"  
GKE_CLUSTER_NAME="CLUSTER_NAME"  
AUTOSCALE_POOL="power-pool"  

gcloud container clusters create ${GKE_CLUSTER_NAME} \
--machine-type="n1-standard-1" \
--num-nodes=1 \
--zone=${GCP_ZONE} \
--project=${PROJECT_ID}
  • Create the auto-scale pool:
gcloud container node-pools create ${GKE_BURST_POOL} \
--cluster=${GKE_CLUSTER_NAME} \
--machine-type=n1-standard-8 \
--node-labels=load=on-demand \
--node-taints=reserved-pool=true:NoSchedule \
--enable-autoscaling \
--min-nodes=0 \
--max-nodes=4 \
--zone=${GCP_ZONE} \
--project=${PROJECT_ID}
  • Note about parameters:

    • --node-labels=load=on-demand: Add a label to the nodes in the power pool to allow selecting them in our AI job using a node selector.
    • --node-taints=reserved-pool=true:NoSchedule: Add a taint to the nodes to prevent any other workload from accidentally being scheduled in this node pool.
  • Here you can see the two pools we created, the static pool with 1 node and the autoscalable pool with 0-4 nodes.

enter image description here

Since we don't have workload running on the autoscalable node-pool, it shows 0 nodes running (and with no charge while there is no node in execution).

  • Now we'll create a job that create 4 parallel pods that run for 5 minutes.
    • This job will have the following parameters to differentiate from normal pods:
    • parallelism: 4: to use all 4 nodes to enhance performance
    • nodeSelector.load: on-demand: to assign to the nodes with that label.
    • podAntiAffinity: to declare that we do not want two pods with the same label app: greedy-job running in the same node (optional).
    • tolerations: to match the toleration to the taint that we attached to the nodes, so these pods are allowed to be scheduled in these nodes.
apiVersion: batch/v1  
kind: Job  
metadata:  
  name: greedy-job  
spec:  
  parallelism: 4  
  template:  
    metadata:  
      name: greedy-job  
      labels: 
        app: greedy-app  
    spec:  
      containers:  
      - name: busybox  
        image: busybox  
        args:  
        - sleep  
        - "300"  
      nodeSelector: 
        load: on-demand 
      affinity:  
        podAntiAffinity:  
          requiredDuringSchedulingIgnoredDuringExecution:  
          - labelSelector:  
              matchExpressions:  
              - key: app  
                operator: In  
                values:  
                - greedy-app  
            topologyKey: "kubernetes.io/hostname"  
      tolerations:  
      - key: reserved-pool  
        operator: Equal  
        value: "true"  
        effect: NoSchedule  
      restartPolicy: OnFailure
  • Now that our cluster is in standby we will use the job yaml we just created (I'll call it greedyjob.yaml). This job will run four processes that will run in parallel and that will complete after about 5 minutes.
$ kubectl get nodes
NAME                                                  STATUS   ROLES    AGE   VERSION
gke-autoscale-to-zero-cl-default-pool-9f6d80d3-x9lb   Ready    <none>   42m   v1.14.10-gke.27

$ kubectl get pods
No resources found in default namespace.

$ kubectl apply -f greedyjob.yaml 
job.batch/greedy-job created

$ kubectl get pods
NAME               READY   STATUS    RESTARTS   AGE
greedy-job-2xbvx   0/1     Pending   0          11s
greedy-job-72j8r   0/1     Pending   0          11s
greedy-job-9dfdt   0/1     Pending   0          11s
greedy-job-wqct9   0/1     Pending   0          11s
  • Our job was applied, but is in pending, let's see what's going on in those pods:
$ kubectl describe pod greedy-job-2xbvx
...
Events:
  Type     Reason            Age                From                Message
  ----     ------            ----               ----                -------
  Warning  FailedScheduling  28s (x2 over 28s)  default-scheduler   0/1 nodes are available: 1 node(s) didn't match node selector.
  Normal   TriggeredScaleUp  23s                cluster-autoscaler  pod triggered scale-up: [{https://content.googleapis.com/compute/v1/projects/owilliam/zones/us-central1-b/instanceGroups/gke-autoscale-to-zero-clus-power-pool-564148fd-grp 0->1 (max: 4)}]
  • The pod can't be scheduled on the current node due to the rules we defined, this triggers a Scale Up routine on our power-pool. This is a very dynamic process, after 90 seconds the first node is up and running:
$ kubectl get pods
NAME               READY   STATUS              RESTARTS   AGE
greedy-job-2xbvx   0/1     Pending             0          93s
greedy-job-72j8r   0/1     ContainerCreating   0          93s
greedy-job-9dfdt   0/1     Pending             0          93s
greedy-job-wqct9   0/1     Pending             0          93s

$ kubectl nodes
NAME                                                  STATUS   ROLES    AGE   VERSION
gke-autoscale-to-zero-cl-default-pool-9f6d80d3-x9lb   Ready    <none>   44m   v1.14.10-gke.27
gke-autoscale-to-zero-clus-power-pool-564148fd-qxkw   Ready    <none>   11s   v1.14.10-gke.27
  • Since we set pod anti-affinity rules, the second pod can't be scheduled on the node that was brought up and triggers the next scale up, take a look at the events on the second pod:
$ k describe pod greedy-job-2xbvx
...
Events:
  Type     Reason            Age                  From                Message
  ----     ------            ----                 ----                -------
  Normal   TriggeredScaleUp  2m45s                cluster-autoscaler  pod triggered scale-up: [{https://content.googleapis.com/compute/v1/projects/owilliam/zones/us-central1-b/instanceGroups/gke-autoscale-to-zero-clus-power-pool-564148fd-grp 0->1 (max: 4)}]
  Warning  FailedScheduling  93s (x3 over 2m50s)  default-scheduler   0/1 nodes are available: 1 node(s) didn't match node selector.
  Warning  FailedScheduling  79s (x3 over 83s)    default-scheduler   0/2 nodes are available: 1 node(s) didn't match node selector, 1 node(s) had taints that the pod didn't tolerate.
  Normal   TriggeredScaleUp  62s                  cluster-autoscaler  pod triggered scale-up: [{https://content.googleapis.com/compute/v1/projects/owilliam/zones/us-central1-b/instanceGroups/gke-autoscale-to-zero-clus-power-pool-564148fd-grp 1->2 (max: 4)}]
  Warning  FailedScheduling  3s (x3 over 68s)     default-scheduler   0/2 nodes are available: 1 node(s) didn't match node selector, 1 node(s) didn't match pod affinity/anti-affinity, 1 node(s) didn't satisfy existing pods anti-affinity rules.
  • The same process repeats until all requirements are satisfied:
$ kubectl get pods
NAME               READY   STATUS    RESTARTS   AGE
greedy-job-2xbvx   0/1     Pending   0          3m39s
greedy-job-72j8r   1/1     Running   0          3m39s
greedy-job-9dfdt   0/1     Pending   0          3m39s
greedy-job-wqct9   1/1     Running   0          3m39s

$ kubectl get nodes
NAME                                                  STATUS   ROLES    AGE     VERSION
gke-autoscale-to-zero-cl-default-pool-9f6d80d3-x9lb   Ready    <none>   46m     v1.14.10-gke.27
gke-autoscale-to-zero-clus-power-pool-564148fd-qxkw   Ready    <none>   2m16s   v1.14.10-gke.27
gke-autoscale-to-zero-clus-power-pool-564148fd-sf6q   Ready    <none>   28s     v1.14.10-gke.27

$ kubectl get pods
NAME               READY   STATUS    RESTARTS   AGE
greedy-job-2xbvx   0/1     Pending   0          5m19s
greedy-job-72j8r   1/1     Running   0          5m19s
greedy-job-9dfdt   1/1     Running   0          5m19s
greedy-job-wqct9   1/1     Running   0          5m19s

$ kubectl get nodes
NAME                                                  STATUS   ROLES    AGE     VERSION
gke-autoscale-to-zero-cl-default-pool-9f6d80d3-x9lb   Ready    <none>   48m     v1.14.10-gke.27
gke-autoscale-to-zero-clus-power-pool-564148fd-39m2   Ready    <none>   63s     v1.14.10-gke.27
gke-autoscale-to-zero-clus-power-pool-564148fd-qxkw   Ready    <none>   4m8s    v1.14.10-gke.27
gke-autoscale-to-zero-clus-power-pool-564148fd-sf6q   Ready    <none>   2m20s   v1.14.10-gke.27

$ kubectl get pods
NAME               READY   STATUS    RESTARTS   AGE
greedy-job-2xbvx   1/1     Running   0          6m12s
greedy-job-72j8r   1/1     Running   0          6m12s
greedy-job-9dfdt   1/1     Running   0          6m12s
greedy-job-wqct9   1/1     Running   0          6m12s

$ kubectl get nodes
NAME                                                  STATUS   ROLES    AGE     VERSION
gke-autoscale-to-zero-cl-default-pool-9f6d80d3-x9lb   Ready    <none>   48m     v1.14.10-gke.27
gke-autoscale-to-zero-clus-power-pool-564148fd-39m2   Ready    <none>   113s    v1.14.10-gke.27
gke-autoscale-to-zero-clus-power-pool-564148fd-ggxv   Ready    <none>   26s     v1.14.10-gke.27
gke-autoscale-to-zero-clus-power-pool-564148fd-qxkw   Ready    <none>   4m58s   v1.14.10-gke.27
gke-autoscale-to-zero-clus-power-pool-564148fd-sf6q   Ready    <none>   3m10s   v1.14.10-gke.27

enter image description here enter image description here Here we can see that all nodes are now up and running (thus, being billed by second)

  • Now all jobs are running, after a few minutes the jobs complete their tasks:
$ kubectl get pods
NAME               READY   STATUS      RESTARTS   AGE
greedy-job-2xbvx   1/1     Running     0          7m22s
greedy-job-72j8r   0/1     Completed   0          7m22s
greedy-job-9dfdt   1/1     Running     0          7m22s
greedy-job-wqct9   1/1     Running     0          7m22s

$ kubectl get pods
NAME               READY   STATUS      RESTARTS   AGE
greedy-job-2xbvx   0/1     Completed   0          11m
greedy-job-72j8r   0/1     Completed   0          11m
greedy-job-9dfdt   0/1     Completed   0          11m
greedy-job-wqct9   0/1     Completed   0          11m
  • Once the task is completed, the autoscaler starts downsizing the cluster.
  • You can learn more about the rules for this process here: GKE Cluster AutoScaler
$ while true; do kubectl get nodes ; sleep 60; done
NAME                                                  STATUS   ROLES    AGE     VERSION
gke-autoscale-to-zero-cl-default-pool-9f6d80d3-x9lb   Ready    <none>   54m     v1.14.10-gke.27
gke-autoscale-to-zero-clus-power-pool-564148fd-39m2   Ready    <none>   7m26s   v1.14.10-gke.27
gke-autoscale-to-zero-clus-power-pool-564148fd-ggxv   Ready    <none>   5m59s   v1.14.10-gke.27
gke-autoscale-to-zero-clus-power-pool-564148fd-qxkw   Ready    <none>   10m     v1.14.10-gke.27
gke-autoscale-to-zero-clus-power-pool-564148fd-sf6q   Ready    <none>   8m43s   v1.14.10-gke.27

NAME                                                  STATUS     ROLES    AGE   VERSION
gke-autoscale-to-zero-cl-default-pool-9f6d80d3-x9lb   Ready      <none>   62m   v1.14.10-gke.27
gke-autoscale-to-zero-clus-power-pool-564148fd-39m2   Ready      <none>   15m   v1.14.10-gke.27
gke-autoscale-to-zero-clus-power-pool-564148fd-ggxv   Ready      <none>   14m   v1.14.10-gke.27
gke-autoscale-to-zero-clus-power-pool-564148fd-qxkw   Ready      <none>   18m   v1.14.10-gke.27
gke-autoscale-to-zero-clus-power-pool-564148fd-sf6q   NotReady   <none>   16m   v1.14.10-gke.27
  • Once conditions are met, autoscaler flags the node as NotReady and starts removing them:
NAME                                                  STATUS     ROLES    AGE   VERSION
gke-autoscale-to-zero-cl-default-pool-9f6d80d3-x9lb   Ready      <none>   64m   v1.14.10-gke.27
gke-autoscale-to-zero-clus-power-pool-564148fd-39m2   NotReady   <none>   17m   v1.14.10-gke.27
gke-autoscale-to-zero-clus-power-pool-564148fd-ggxv   NotReady   <none>   16m   v1.14.10-gke.27
gke-autoscale-to-zero-clus-power-pool-564148fd-qxkw   Ready      <none>   20m   v1.14.10-gke.27

NAME                                                  STATUS     ROLES    AGE   VERSION
gke-autoscale-to-zero-cl-default-pool-9f6d80d3-x9lb   Ready      <none>   65m   v1.14.10-gke.27
gke-autoscale-to-zero-clus-power-pool-564148fd-39m2   NotReady   <none>   18m   v1.14.10-gke.27
gke-autoscale-to-zero-clus-power-pool-564148fd-ggxv   NotReady   <none>   17m   v1.14.10-gke.27
gke-autoscale-to-zero-clus-power-pool-564148fd-qxkw   NotReady   <none>   21m   v1.14.10-gke.27

NAME                                                  STATUS     ROLES    AGE   VERSION
gke-autoscale-to-zero-cl-default-pool-9f6d80d3-x9lb   Ready      <none>   66m   v1.14.10-gke.27
gke-autoscale-to-zero-clus-power-pool-564148fd-ggxv   NotReady   <none>   18m   v1.14.10-gke.27

NAME                                                  STATUS   ROLES    AGE   VERSION
gke-autoscale-to-zero-cl-default-pool-9f6d80d3-x9lb   Ready    <none>   67m   v1.14.10-gke.27

  • Here is the confirmation that the nodes were removed from GKE and from VMs(remember that every node is a Virtual Machine billed as Compute Engine):

Compute Engine: (note that gke-cluster-1-default-pool is from another cluster, I added it to the screenshot to show you that there is no other node from cluster gke-autoscale-to-zero other than the default persistent one.) enter image description here

GKE: enter image description here


Final Thoughts:

When scaling down, cluster autoscaler respects scheduling and eviction rules set on Pods. These restrictions can prevent a node from being deleted by the autoscaler. A node's deletion could be prevented if it contains a Pod with any of these conditions: An application's PodDisruptionBudget can also prevent autoscaling; if deleting nodes would cause the budget to be exceeded, the cluster does not scale down.

You can note that the process is really fast, in our example it took around 90 seconds to upscale a node and 5 minutes to finish downscaling a standby node, providing a HUGE improvement in your billing.

  • Preemptible VMs can reduce even further your billing, but you will have to consider the kind of workload you are running:

Preemptible VMs are Compute Engine VM instances that last a maximum of 24 hours and provide no availability guarantees. Preemptible VMs are priced lower than standard Compute Engine VMs and offer the same machine types and options.

I know you are still considering the best architecture for your app.

Using APP Engine and IA Platform are optimal solutions as well, but since you are currently running your workload on GKE I wanted to show you an example as requested.

If you have any further questions let me know in the comments.

-- willrof
Source: StackOverflow