prometheus-blackbox-exporter is Firing false positive alerts

1/22/2021

We have set up full Prometheus stack - Prometheus/Grafana/Alertmanager/Node Explorer/Blackbox exporter using community helm charts in our Kubernetes cluster. Monitoring stack is deployed in its own namespace and our main software, comprised of microservices is deployed in the default namespace. Alerting is operating fine however blackbox exporter is not scraping correctly metrics (I guess) and FIRING regularly false positive alerts. We use the last for probing our microservices HTTP liveness/readiness endpoints.

My configuration (in values.yaml) related to the issue looks like:

- alert: InstanceDown
           expr: up == 0
           for: 5m
           annotations:
             title: 'Instance {{ $labels.instance }} down'
             description: '{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes.'
- alert: ExporterIsDown
           expr: up{job="prometheus-blackbox-exporter"} == 0
           for: 5m
           labels:
             severity: warning
           annotations:
             summary: "Blackbox exporter is down"
             description: "Blackbox exporter is down or not being scraped correctly"
...
...
...
extraScrapeConfigs:  |
   - job_name: 'prometheus-blackbox-exporter'
     metrics_path: /probe
     params:
       module: [http_2xx]
     static_configs:
       - targets:
         - http://service1.default.svc.cluster.local:8082/actuator/health/liveness
         - http://service2.default.svc.cluster.local:8081/actuator/health/liveness
         - http://service3.default.svc.cluster.local:8080/actuator/health/liveness
     relabel_configs:
       - source_labels: [__address__]
         target_label: __param_target
       - source_labels: [__param_target]
         target_label: instance
       - target_label: __address__
         replacement: prometheus-blackbox-exporter:9115

These 2 alerts are firing on every hour but at that time endpoints are 100% reachable.

We're using the default prometheus-blackbox-exporter/values.yaml file:

config:
  modules:
    http_2xx:
      prober: http
      timeout: 5s
      http:
        valid_http_versions: ["HTTP/1.1", "HTTP/2.0"]
        no_follow_redirects: false
        preferred_ip_protocol: "ip4"

Mails accordingly look this way:

5] Firing
Labels
alertname = InstanceDown
instance = http://service1.default.svc.cluster.local:8082/actuator/health/liveness
job = prometheus-blackbox-exporter
severity = critical

another type of email

Labels
alertname = ExporterIsDown
instance = http://service1.default.svc.cluster.local:8082/actuator/health/liveness
job = prometheus-blackbox-exporter
severity = warning
Annotations
description = Blackbox exporter is down or not being scraped correctly
summary = Blackbox exporter is down

Another odd thing I noticed is that in Prometheus UI I don't see any probe_* metrics as shown here https://lapee79.github.io/en/article/monitoring-http-using-blackbox-exporter/ Not sure what we are doing wrong or missing to do but it's very annoying to get hundreds of false positive emails.

-- Joro
kubernetes
prometheus
prometheus-blackbox-exporter

1 Answer

4/8/2021

Answering my own question. It seems that I have typed:

replacement: prometheus-blackbox-exporter:9115

but instead it must be the service name:

replacement: stage-prometheus-blackbox-exporter:9115

According to documentation:

replacement: localhost:9115 # The blackbox exporter’s real hostname:port. For Windows and macOS replace with - host.docker.internal:9115

For Kubernetes though it should be blackbox-exporter's service name which is not well documented. Or at least I haven't found this anywhere.

To get the service:

kubectl get svc -l app.kubernetes.io/name=prometheus-blackbox-exporter
-- Joro
Source: StackOverflow