I'm currently trying to alert on Kubernetes pods stacking within an availability zone. I've managed to use two different metrics to the point where I can see how many pods for an application are running on a specific availability zone. However, due to scaling, I want the alert to be percentage based...so we can alert when a specific percentage of pods are running on one AZ (i.e. over 70%).
My current query:
sum(count(kube_pod_info{namespace="somenamespace", created_by_kind="StatefulSet"}) by (created_by_name, node) * on (node) group_left(az_info) kube_node_labels) by (created_by_name, az_info)
And some selected output:
{created_by_name="some-db-1",az_info="az1"} 1
{created_by_name="some-db-1",az_info="az2"} 4
{created_by_name="some-db-2",az_info="az1"} 2
{created_by_name="some-db-2",az_info="az2"} 3
For example, in the above output we can see that 4 db-1 pods are stacking on az2 as opposed to 1 pod on az1. In this scenario we would want to alert as 80% of db-1 pods are stacked on a single AZ.
As the output contains multiple pods on multiple AZs, it feels like it may be difficult to get the percentage using a single Prometheus query, but wondered if anyone with more experience could offer a solution?
Thanks!
your_expression
/ ignoring(created_by_name) group_left
sum without(created_by_name)(your_expression)
will give you the ratio of the whole for each, and then you can do > .8
on that.