(How) do node pool autoupgrades in GKE actually work?

2/26/2019

We have a fairly large kubernetes deployment on GKE, and we wanted to make our life a little easier by enabling auto-upgrades. The documentation on the topic tells you how to enable it, but not how it actually works.

We enabled the feature on a test cluster, but no nodes were ever upgraded (although the UI kept nagging us that "upgrades are available").

The docs say it would be updated to the "latest stable" version and that it occurs "at regular intervals at the discretion of the GKE team" - both of which is not terribly helpful.

The UI always says: "Next auto-upgrade: Not scheduled"

Has someone used this feature in production and can shed some light on what it'll actually do?

What I did:

  • I enabled the feature on the nodepools (not the cluster itself)
  • I set up a maintenance window
  • Cluster version was 1.11.7-gke.3
  • Nodepools had version 1.11.5-gke.X
  • The newest available version was 1.11.7-gke.6

What I expected:

  • The nodepool would be updated to either 1.11.7-gke.3 (the default cluster version) or 1.11.7-gke.6 (the most recent version)
  • The update would happen in the next maintenance window
  • The update would otherwise work like a "manual" update

What actually happened:

  • Nothing
  • The nodepools remained on 1.11.5-gke.X for more than a week

My question

  • Is the nodepool version supposed to update?
  • If so, at what time?
  • If so, to what version?
-- averell
google-kubernetes-engine
kubernetes

3 Answers

4/22/2020

I wanted to share two other possibilities as to why a node-pool may not be auto-upgrading or scheduled to upgrade.

One of our projects was having the similar issue where the master version had auto-upgraded to 1.14.10-gke.27 but our node-pool stayed stuck at 1.14.10-gke.24 for over a month.

Reaching a node quota

The node-pool upgrade might be failing due to a node quota (although I'm not sure the web console would say Next auto-upgrade: Not scheduled). From the node upgrades documentation, it suggests we can run the following to view any failed upgrade operations:

gcloud container operations list --filter="STATUS=DONE AND TYPE=UPGRADE_NODES AND targetLink:https://container.googleapis.com/v1/projects/[PROJECT_ID]/zones/[ZONE]/clusters/[CLUSTER_NAME]"

Automatic node upgrades are for minor+ versions only

After exhausting my troubleshooting steps, I reached out GCP Support and opened a case (Case 23113272 for anyone working at Google). They told me the following:

Automatic node upgrade: The node version could not necessary upgrade automatically, let me explain, exists three upgrades in a node: Minor versions (1.X), Patch releases (1.X.Y) and Security updates and bug fixes (1.X.Y-gke.N), please take a look at this documentation [2] the automatic node upgrade works from a minor version and in your case the upgrade was a security update that can't upgrade automatically.

I responded back and they confirmed that automatic node upgrades will only happen for minor versions and above. I have requested that they submit a request to update their documentation because (at the time of this response) it is not outlined anywhere in their node auto-upgrade documentation.

-- Kyle
Source: StackOverflow

4/16/2019

I'll finally answer this myself. The auto-upgrade does work, though it took several days to a week until the version was upgraded.

There is no indication of the planned upgrade date, or any feedback other than the version updating.

It will upgrade to the current master version of the cluster.

Addition: It still doesn't work reliably, and still no way to debug if it doesn't. One information I got was that the mechanism does not work if you initially provided a specific version for the node pool. As it is not possible to deduce the inner workings of the autoupdates, we had to resort to manually checking the status again.

-- averell
Source: StackOverflow

2/27/2019

This feature replaces the VMs (Kubernetes nodes) in your node pool running the "old" Kubernetes version with VMs running the "new" version.

The node pool "upgrade" operation is done in a rolling fashion: It's not like GKE deletes all your VMs and recreates them simultaneously (except when you have only 1 node in your cluster). By default, the nodes are replaced with newer nodes one-by-one (although this might change).

GKE internally uses mostly the features of managed instance groups to manage operations on node pools.

You can find documentation on how to schedule node upgrades by specifying certain "maintenance windows" so you are impacted minimally. (This article also gives a bit more insights on how upgrades happen.)

That said, you can disable auto-upgrades and upgrade your cluster manually (although this is not recommended). Some GKE users have thousands of nodes, therefore for them, upgrading VMs one-by-one are not feasible.

For that GKE offers an option that lets you choose "how many nodes are upgraded at a time":

gcloud container clusters upgrade \
    --concurrent-node-count=CONCURRENT_NODE_COUNT

Documentation of this flag says:

The number of nodes to upgrade concurrently. Valid values are [1, 20]. It is a recommended best practice to set this value to no higher than 3% of your cluster size.'

-- AhmetB - Google
Source: StackOverflow