How to scale up Kubernetes cluster with Terraform avoiding downtime?

2/9/2022

Here's the scenario: we have some applications running on a Kubernetes cluster on Azure. Currently our production cluster has one Nodepool with 3 nodes which are fairly low on resources because we still don't have that many active users/requests simultaneously.

Our backend APIs app is running on three pods, one on each node. I was told I will have need to increase resources soon (I'm thinking more memory or even replacing the VMs of the nodes with better ones).

We structured everything Kubernetes related using Terraform and I know that replacing VMs in a node is a destructive action, meaning the cluster will have to be replaces, new config and all deployments, services and etc will have to be reapplied.

I am fairly new to the Kubernetes and Terraform world, meaning I can do the basics to get an application up and running but I would like to learn what is the best practice when it comes to scaling and performance. How can I perform such increase in resources without having any downtime of our services?

I'm wondering if having an extra Nodepool would help while I replace the VM's of the other one (I might be absolutely wrong here)

If there's any link, course, tutorial you can point me to it's highly appreciated.

-- everspader
azure
azure-aks
kubernetes
terraform

1 Answer

2/9/2022

(Moved from comments)

In Azure, when you're performing cluster upgrade, there's a parameter called "max surge count" which is equal to 1 by default. What it means is when you update your cluster or node configuration, it will first create one extra node with the updated configuration - and only then it will safely drain and remove one of old ones. More on this here: Azure - Node Surge Upgrade

-- Vorgashor
Source: StackOverflow