We have set up full Prometheus stack - Prometheus/Grafana/Alertmanager/Node Explorer/Blackbox exporter using community helm charts in our Kubernetes cluster. Monitoring stack is deployed in its own namespace and our main software, comprised of microservices is deployed in the default namespace. Alerting is operating fine however blackbox exporter is not scraping correctly metrics (I guess) and FIRING regularly false positive alerts. We use the last for probing our microservices HTTP liveness/readiness endpoints.
My configuration (in values.yaml) related to the issue looks like:
- alert: InstanceDown
expr: up == 0
for: 5m
annotations:
title: 'Instance {{ $labels.instance }} down'
description: '{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes.'
- alert: ExporterIsDown
expr: up{job="prometheus-blackbox-exporter"} == 0
for: 5m
labels:
severity: warning
annotations:
summary: "Blackbox exporter is down"
description: "Blackbox exporter is down or not being scraped correctly"
...
...
...
extraScrapeConfigs: |
- job_name: 'prometheus-blackbox-exporter'
metrics_path: /probe
params:
module: [http_2xx]
static_configs:
- targets:
- http://service1.default.svc.cluster.local:8082/actuator/health/liveness
- http://service2.default.svc.cluster.local:8081/actuator/health/liveness
- http://service3.default.svc.cluster.local:8080/actuator/health/liveness
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: prometheus-blackbox-exporter:9115
These 2 alerts are firing on every hour but at that time endpoints are 100% reachable.
We're using the default prometheus-blackbox-exporter/values.yaml file:
config:
modules:
http_2xx:
prober: http
timeout: 5s
http:
valid_http_versions: ["HTTP/1.1", "HTTP/2.0"]
no_follow_redirects: false
preferred_ip_protocol: "ip4"
Mails accordingly look this way:
5] Firing
Labels
alertname = InstanceDown
instance = http://service1.default.svc.cluster.local:8082/actuator/health/liveness
job = prometheus-blackbox-exporter
severity = critical
another type of email
Labels
alertname = ExporterIsDown
instance = http://service1.default.svc.cluster.local:8082/actuator/health/liveness
job = prometheus-blackbox-exporter
severity = warning
Annotations
description = Blackbox exporter is down or not being scraped correctly
summary = Blackbox exporter is down
Another odd thing I noticed is that in Prometheus UI I don't see any probe_* metrics as shown here https://lapee79.github.io/en/article/monitoring-http-using-blackbox-exporter/ Not sure what we are doing wrong or missing to do but it's very annoying to get hundreds of false positive emails.
Answering my own question. It seems that I have typed:
replacement: prometheus-blackbox-exporter:9115
but instead it must be the service name:
replacement: stage-prometheus-blackbox-exporter:9115
According to documentation:
replacement: localhost:9115 # The blackbox exporter’s real hostname:port. For Windows and macOS replace with - host.docker.internal:9115
For Kubernetes though it should be blackbox-exporter's service name which is not well documented. Or at least I haven't found this anywhere.
To get the service:
kubectl get svc -l app.kubernetes.io/name=prometheus-blackbox-exporter