Prometheus up metric shows 0 even the endpoint is reachable

10/13/2021

I have a simple pod with a nginx container which returns text healthy on path /. I have prometheus to scrape port 80 on path /. When I ran up == 0 in the prometheus dashboard it showed this pod which means this pod is not healthy. But I tried ssh into the container, it was running fine and I saw in the nginx log prometheus was pinging / and getting 200 response. Any idea why?

deployment.yml

apiVersion: apps/v1
kind: Deployment
metadata:
  ...
spec:
  ...
  template:
    metadata:
      labels:
        ...
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/path: "/"
        prometheus.io/port: "80"
    spec:
      containers:
        - name: nginx
          image: nginx
          volumeMounts:
            - name: nginx-conf
              mountPath: /etc/nginx
              readOnly: true
          ports:
            - containerPort: 80
      volumes:
        - name: nginx-conf
          configMap:
            name: nginx-conf
            items:
              - key: nginx.conf
                path: nginx.conf

nginx.conf

apiVersion: v1
kind: ConfigMap
metadata:
  name: nginx-conf
data:
  nginx.conf: |
    http {
      server {
        listen 80;

        location / {
          return 200 'healthy\n';
        }
      }
    }

nginx access log

192.168.88.81 - - [xxx +0000] "GET / HTTP/1.1" 200 8 "-" "Prometheus/2.26.0"
192.168.88.81 - - [xxx +0000] "GET / HTTP/1.1" 200 8 "-" "Prometheus/2.26.0"
192.168.88.81 - - [xxx +0000] "GET / HTTP/1.1" 200 8 "-" "Prometheus/2.26.0"
-- user3908406
kubernetes
prometheus

2 Answers

10/13/2021

When you configure these annotations to pods, the Prometheus expects that the given path returns Prometheus-readable metrics. But 'healthy\n' is not a valid Prometheus metrics type.

      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/path: "/"
        prometheus.io/port: "80"

Recommended Fix:

apiVersion: apps/v1
kind: Deployment
metadata:
  ...
spec:
  ...
  template:
    metadata:
      labels:
        ...
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/path: "/metrics"
        prometheus.io/port: "9113"
    spec:
      containers:
        - name: nginx
          image: nginx
          volumeMounts:
            - name: nginx-conf
              mountPath: /etc/nginx
              readOnly: true
          ports:
            - containerPort: 80
        - name: nginx-exporter
          args:
          - "-nginx.scrape-uri=http://localhost:80/stub_status" # nginx address
          image: nginx/nginx-prometheus-exporter:0.9.0
          ports:
            - containerPort: 9113
      volumes:
        - name: nginx-conf
          configMap:
            name: nginx-conf
            items:
              - key: nginx.conf
                path: nginx.conf

Now, try querying nginx_up from Prometheus. The nginx-prometheus-exporter also comes with a grafana dashboard, you can also give it a try.

-- Kamol Hasan
Source: StackOverflow

10/13/2021

When Prometheus scrapes an endpoint it expects metrics. Typical metrics look like this:

# HELP go_gc_duration_seconds A summary of the GC invocation durations.
# TYPE go_gc_duration_seconds summary
go_gc_duration_seconds{quantile="0"} 1.3234e-05
go_gc_duration_seconds{quantile="0.25"} 1.7335e-05

"healthy" doesn't meet the standard and thus it causes Prometheus to fail on scraping this target. There is the blackbox exporter, which is designed to check endpoints from users perspective (this is what black box monitoring is). The exporter can perform HTTP requests and make metrics of the results. For example it can check whether the response code was 200, or if the response body contains certain text. Here are sample metrics returned by this exporter (note probe_success, this is the same as up):

# HELP probe_dns_lookup_time_seconds Returns the time taken for probe dns lookup in seconds
# TYPE probe_dns_lookup_time_seconds gauge
probe_dns_lookup_time_seconds 0.026007318
# HELP probe_duration_seconds Returns how long the probe took to complete in seconds
# TYPE probe_duration_seconds gauge
probe_duration_seconds 0.550007522
# HELP probe_failed_due_to_regex Indicates if probe failed due to regex
# TYPE probe_failed_due_to_regex gauge
probe_failed_due_to_regex 0
# HELP probe_http_content_length Length of http content response
# TYPE probe_http_content_length gauge
probe_http_content_length -1
# HELP probe_http_duration_seconds Duration of http request by phase, summed over all redirects
# TYPE probe_http_duration_seconds gauge
probe_http_duration_seconds{phase="connect"} 0.098082009
probe_http_duration_seconds{phase="processing"} 0.154402544
probe_http_duration_seconds{phase="resolve"} 0.038066771
probe_http_duration_seconds{phase="tls"} 0.209702302
probe_http_duration_seconds{phase="transfer"} 0.047839785
# HELP probe_http_redirects The number of redirects
# TYPE probe_http_redirects gauge
probe_http_redirects 1
# HELP probe_http_ssl Indicates if SSL was used for the final redirect
# TYPE probe_http_ssl gauge
probe_http_ssl 1
# HELP probe_http_status_code Response HTTP status code
# TYPE probe_http_status_code gauge
probe_http_status_code 200
# HELP probe_http_uncompressed_body_length Length of uncompressed response body
# TYPE probe_http_uncompressed_body_length gauge
probe_http_uncompressed_body_length 87617
# HELP probe_http_version Returns the version of HTTP of the probe response
# TYPE probe_http_version gauge
probe_http_version 2
# HELP probe_ip_addr_hash Specifies the hash of IP address. It's useful to detect if the IP address changes.
# TYPE probe_ip_addr_hash gauge
probe_ip_addr_hash 8.57979034e+08
# HELP probe_ip_protocol Specifies whether probe ip protocol is IP4 or IP6
# TYPE probe_ip_protocol gauge
probe_ip_protocol 4
# HELP probe_ssl_earliest_cert_expiry Returns earliest SSL cert expiry in unixtime
# TYPE probe_ssl_earliest_cert_expiry gauge
probe_ssl_earliest_cert_expiry 1.639030838e+09
# HELP probe_ssl_last_chain_expiry_timestamp_seconds Returns last SSL chain expiry in timestamp seconds
# TYPE probe_ssl_last_chain_expiry_timestamp_seconds gauge
probe_ssl_last_chain_expiry_timestamp_seconds 1.639030838e+09
# HELP probe_ssl_last_chain_info Contains SSL leaf certificate information
# TYPE probe_ssl_last_chain_info gauge
probe_ssl_last_chain_info{fingerprint_sha256="ef4eaeb464efb33f5332b365a350b2b06588ea71837af27f83d45b726d19af2a"} 1
# HELP probe_success Displays whether or not the probe was a success
# TYPE probe_success gauge
probe_success 1
# HELP probe_tls_version_info Contains the TLS version used
# TYPE probe_tls_version_info gauge
probe_tls_version_info{version="TLS 1.2"} 1
-- anemyte
Source: StackOverflow