Azure AKS Prometheus-operator double metrics

7/10/2020

I'm running Azure AKS Cluster 1.15.11 with prometheus-operator 8.15.6 installed as a helm chart and I'm seeing some different metrics displayed by Kubernetes Dashboard compared to the ones provided by prometheus Grafana.

An application pod which is being monitored has three containers in it. Kubernetes-dashboard shows that the memory consumption for this pod is ~250MB, standard prometheus-operator dashboard is displaying almost exactly double value for the memory consumption ~500MB.

At first we thought that there might be some misconfiguration on our monitoring setup. Since prometheus-operator is installed as standard helm chart, Daemon Set for node exporter ensures that every node has exactly one exporter deployed so duplicate exporters shouldn't be the reason. However, after migrating our cluster to different node pools I've noticed that when our application is running on user node pool instead of system node pool metrics does match exactly on both tools. I know that system node pool is running CoreDNS and tunnelfront but I assume these are running as separate components also I'm aware that overall it's not the best choice to run infrastructure and applications in the same node pool.

However, I'm still wondering why running application under system node pool causes metrics by prometheus to be doubled?

-- efomo
azure
kubernetes
monitoring
prometheus
prometheus-operator

1 Answer

9/4/2020

I ran into a similar problem (aks v1.14.6, prometheus-operator v0.38.1) where all my values were multiplied by a factor of 3. Turns out you have to remember to remove the extra endpoints called prometheus-operator-kubelet that are created in the kube-system-namespace during install before you remove / reinstall prometheus-operator since Prometheus aggregates the metric types collected for each endpoint.

Log in to the Prometheus-pod and check the status page. There should be as many endpoints as there are nodes in the cluster, otherwise you may have a surplus of endpoints:

Prometheus status page

-- Samir D
Source: StackOverflow