We are running as hosted Kubernetes cluster on Google Cloud (GKE) and scraping it with Prometheus.
My Question is similar to this one, but I'd like to know what are the most important metrics to look out for in the K8s Cluster and possibly alert on?
This is rather a K8s then a Prometheus question, but I'd really appreciate some hints. Please let me know if my question is to vague, so I can refine it.
etcd is the foundation of Kubernetes. So having a good set of alerts for it is important. We wrote this blog post and creating alerting rules for it and provided a base set at the end.
Further sources of important metrics in the Prometheus format are the Kubelet and cAdvisor, API servers, and the fairly new kube-state-metrics. For those, I'm not aware of any public alerting rule sets as for etcd, unfortunately.
Generally, you want to ensure that the components as applications work flawlessly, e.g:
up
metric)Then there's the Kubernetes business logic aspect, e.g:
That's no drop-in solution unfortunately, but writing alerting rules roughly covering the scope of the above examples should get you quite far.