Azure AKS - This container service is in a failed state

5/17/2021

I have recently upgrade my AKS cluster from 1.16.x to 1.18.17 (it was a jump of two versions). I did the upgrade using the Azure Portal, not the CLI.

The upgrade itself has worked, I can see my cluster is now on version 1.18.17 and on first glance everything seems to be working as expected, but at the top of the Overview panel, this message is displayed:

This container service is in a failed state. Click here to go to diagnose and solve problems.

With the cluster in this state I can't scale, or upgrade, as I get an error telling me the operation isn't available whilst the cluster is upgrading, or in a failed state.

The supporting page the error links to doesn't give me any useful information. It doesn't even mention the fact my cluster is in a failed state.

I've seen this error once before when I was approaching the limit of our VM Compute quota. At the moment though, I am only using 10%, and I don't have enough pods and nodes to push it over. The only other quotas which are maxed are network watchers and I don't think that's related.

The scaling operation links to this support document: aka.ms/aks-cluster-failed, and the suggestion there is about quota sizes, which I have already tried.

I'm really scratching my head with this one I can't find any useful support documents, blog posts or other questions, so any help would be greatly appreciated!

-- Jim
azure
azure-aks
kubernetes

1 Answer

5/17/2021

Answering my own question in the hope it can help others, or myself in the future.

I managed to get more information of the error by running an update with the azure cli Upgrade an Azure Kubernetes Service (AKS) cluster.

You can also use the cli to check for available updates. Check for available AKS cluster upgrades.

Using the cli seems to be a bit more informative when troubleshooting.

-- Jim
Source: StackOverflow