How to determine the cause of an AKS kubernetes cluster failure

1/18/2019

I have a production AKS kubernetes cluster that hosted in uk-south that has become unstable and unresponsive:

image 1

From the image, you can see that I have several pods in varying states of unready ie terminating/unknown, and the ones the report to be running are inaccessible.

I can see from the insights grid that the issue starts at around 9.50pm last night

image 2

I've scoured through the logs in the AKS service itself and through the Kibana logs for the apps running on the cluster around the time of the failure but I am struggling to see anything that looks to have caused this.

Luckily I have two clusters serving production under a traffic manager so have routed all traffic to the healthy one but my worry is that I need to understand what caused this, especially if the same happens on the other one as there will be production downtime while I spin up a new cluster.

My question is am I missing any obvious places to look for information on what caused the issue? any event logs that may point to what the problem is?

-- Declan McNulty
azure
azure-aks
azure-kubernetes
kubernetes

2 Answers

1/25/2019

Just a hunch but check for https://github.com/Azure/AKS/issues/305 theres steps to identify and correct this.

-- user10891134
Source: StackOverflow

1/18/2019

I would suggest examining K8s event log around the time your nodes went "not ready".

Try open "Insights" Nodes tab and choose timeframe up top around the time when things went wrong. See what node statuses are. Any pressures? You can see that in the property panel to the right of the node list. Property panel also contains a link to event logs for that timeframe... Note though, link to event logs on the node's property panel constructs a complicated query to show only events tagged with that node.

You can get this information with simpler queries (and run more fun queries as well) in the Logs. Open "Logs" tab in the left menu on the cluster and execute query similar to this one (change the time interval to the one you need):

let startDateTime = datetime('2019-01-01T13:45:00.000Z');
let endDateTime = datetime('2019-01-02T13:45:00.000Z');
KubeEvents_CL
| where TimeGenerated >= startDateTime and TimeGenerated < endDateTime
| order by TimeGenerated desc

See if you have events indicating what went wrong. Also of interest you can look at node inventory on your cluster. Nodes report K8s status. It was "Ready" prior to the problem... Then something went wrong - what is the status? Out of Disk by chance?

let startDateTime = datetime('2019-01-01T13:45:00.000Z');
let endDateTime = datetime('2019-01-02T13:45:00.000Z');
KubeNodeInventory
| where TimeGenerated >= startDateTime and TimeGenerated < endDateTime
| order by TimeGenerated desc
-- Vitaly
Source: StackOverflow