How to alert on the Kubernetes Cluster health?

9/7/2016

We are running as hosted Kubernetes cluster on Google Cloud (GKE) and scraping it with Prometheus.

My Question is similar to this one, but I'd like to know what are the most important metrics to look out for in the K8s Cluster and possibly alert on?

This is rather a K8s then a Prometheus question, but I'd really appreciate some hints. Please let me know if my question is to vague, so I can refine it.

-- tex
google-kubernetes-engine
kubernetes
prometheus

1 Answer

10/17/2016

etcd is the foundation of Kubernetes. So having a good set of alerts for it is important. We wrote this blog post and creating alerting rules for it and provided a base set at the end.

Further sources of important metrics in the Prometheus format are the Kubelet and cAdvisor, API servers, and the fairly new kube-state-metrics. For those, I'm not aware of any public alerting rule sets as for etcd, unfortunately.

Generally, you want to ensure that the components as applications work flawlessly, e.g:

  • Are my kubelets/API servers running/reachable? (up metric)
  • Are their response latency and error rates within bounds?
  • Can the API servers reach etcd?

Then there's the Kubernetes business logic aspect, e.g:

  • Are there pods that have been in non-ready/crashloop state forever?
  • Do I have enough CPU/memory capacity in my cluster?
  • Are my deployment replica expectations fulfilled?

That's no drop-in solution unfortunately, but writing alerting rules roughly covering the scope of the above examples should get you quite far.

-- fabxc
Source: StackOverflow