GKE Control plane warm up

5/17/2021

I'm using Terraform v0.15.0 to deploy a zonal GKE cluster with 3 separate node pools which are being created right after the cluster. The initial node pool is being created with 1 node and is deleted when the cluster is ready. Then I use Helm provider to deploy some workload: first I deploy Nginx Ingress Chart. These two steps work fine. Right after Nginx Ingress Chart, I deploy our application that consists of 16 Deployments/Stateful Sets and related Services PVC, PV... That step finishes after 10s and marks as finished and all looks good but there are no workloads in the cluster except for Nginx Ingress deployment. Instead, I see REPAIR_CLUSTER operation in running state:

% gcloud container operations list
NAME                              TYPE              LOCATION           TARGET                    STATUS_MESSAGE  STATUS  START_TIME                      END_TIME
operation-1621007628555-9d3ea30c  REPAIR_CLUSTER    europe-west1       <project_name_here>                       RUNNING    2021-05-14T15:53:48.555309083Z

Cluster status is RECONCILING:

% gcloud container clusters list 
NAME                      LOCATION           MASTER_VERSION    MASTER_IP       MACHINE_TYPE   
<cluster_name_here>       europe-west1-b     1.17.17-gke.4900  <ip_here>       e2-standard-2  1.16.15-gke.7800    6          RECONCILING

I already tried:

  • auto_repair and auto_upgrade set to false;
  • deploy regional cluster;
  • use UNSPECIFIED release channel which makes Kubernetes version fixed(no autoupdates).

When REPAIR_CLUSTER operation finished, the cluster is ready to handle my workload but it makes my pipeline failed. So my question is how to avoid that operation or how to warm up Google's resources in advance.

-- Vi Kan
google-cloud-platform
google-kubernetes-engine
kubernetes
kubernetes-helm
terraform

0 Answers