If I start up a fresh clean empty minikube and helm install
the latest stable/prometheus-operator
with strictly default settings I see four active Prometheus alarms.
In this super simplified scenario where I have a clean fresh minikube that is running absolutely nothing other than Prometheus, there should be no problems and no alarms. Are these alarms bogus or broken? Is something wrong with my setup or should I submit a bug report and disable these alarms for the time being?
Here are my basic setup steps:
minikube delete
# Any lower memory/cpu settings will experience problems
minikube start --memory 10240 --cpus 4 --kubernetes-version v1.12.2
eval $(minikube docker-env)
helm init
helm repo update
# wait a minute for Helm Tiller to start up.
helm install --name my-prom stable/prometheus-operator
Wait several minutes for everything to start up, then run port forwarding on Prometheus server and on Grafana:
kubectl port-forward service/my-prom-prometheus-operato-prometheus 9090:9090
kubectl port-forward service/my-prom-grafana 8080:80
Then go to http://localhost:9090/alerts
and see:
DeadMansSwitch (1 active)
KubeControllerManagerDown (1 active)
KubeSchedulerDown (1 active)
TargetDown (1 active)
Are these bogus? Is something genuinely wrong? Should I disable these?
Two of these alarms are missing metrics:
absent(up{job="kube-controller-manager"} == 1)
absent(up{job="kube-scheduler"} == 1)
In http://localhost:9090/config
, I don't see either job configured but I do see very closely related a jobs with job_name
values of default/my-prom-prometheus-operato-kube-controller-manager/0
and default/my-prom-prometheus-operato-kube-scheduler/0
. This suggests that job_name
values are supposed to match and there is a bug where they do not match. I also don't see any collected metrics for either job. Are slashes allowed in job names?
The other two alarms:
vector(1)
. I have no idea what this is.up{job="kubelet"}
which has two metric values, one up with a value of 1.0 and one down with a value of 0.0. The up value is for endpoint="http-metrics"
and the down valie is for endpoint="cadvisor"
. Is that latter endpoint supposed to be up? Why wouldn't it be?I go to http://localhost:9090/graph
and run sum(up) by (job)
I see 1.0
values for all of:
{job="node-exporter"}
{job="my-prom-prometheus-operato-prometheus"}
{job="my-prom-prometheus-operato-operator"}
{job="my-prom-prometheus-operato-alertmanager"}
{job="kubelet"}
{job="kube-state-metrics"}
{job="apiserver"}
fyi, kubectl version
shows:
Client Version: version.Info{Major:"1", Minor:"12", GitVersion:"v1.12.2", GitCommit:"17c77c7898218073f14c8d573582e8d2313dc740", GitTreeState:"clean", BuildDate:"2018-10-30T21:39:16Z", GoVersion:"go1.11.1", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"12", GitVersion:"v1.12.2", GitCommit:"17c77c7898218073f14c8d573582e8d2313dc740", GitTreeState:"clean", BuildDate:"2018-10-24T06:43:59Z", GoVersion:"go1.10.4", Compiler:"gc", Platform:"linux/amd64"}
The Watchdog
alert (formerly named as DeadManSwitch
) is:
An alert meant to ensure that the entire alerting pipeline is functional. This alert is always firing, therefore it should always be firing in Alertmanager and always fire against a receiver.
In Minikube, the kube-controller-manager
and kube-scheduler
listen by default on 127.0.0.1, so Prometheus cannot scrape metrics from them. You need to start Minikube with these components listening on all interfaces:
minikube start --kubernetes-version v1.12.2 \
--bootstrapper=kubeadm \
--extra-config=scheduler.address=0.0.0.0 \
--extra-config=controller-manager.address=0.0.0.0
Another cause of TargetDown
is that the default service selectors created by Prometheus Operator helm chart don’t match the labels used by Minikube components. You need to match them by setting the kubeControllerManager.selector
and kubeScheduler.selector
helm parameters.
Take a look at this article: Trying Prometheus Operator with Helm + Minikube. It addresses all these problems, how to solve them and much more.
DeadManSwitchAlarm is vector(1) which is an alarm which always triggers, it is generally used to test that your alertmanager is working or not.
You are possibly hitting this issue,
https://github.com/coreos/prometheus-operator/issues/1001
Hope this helps.