I deployed prometheus server (+ kube state metrics + node exporter + alertmanager) through the prometheus helm chart using the chart's default values, including the chart's default scrape_configs. The problem is that I expect certain metrics to be coming from a particular job but instead are coming from a different one.
For example, node_cpu_seconds_total
is being provided by the kubernetes-service-endpoints
job but I expect it to come from the kubernetes-nodes
job, i.e. node-exporter
. The returned metric's values are accurate but the problem is I don't have the labels that would normally come from kubernetes-nodes
(since kubernetes-nodes
job has role: node
vs role: endpoint
for kubernetes-service-endpoints
. I need these missing labels for advanced querying + dashboards.
Output of node_cpu_seconds_total{mode="idle"}
:
node_cpu_seconds_total{app="prometheus",chart="prometheus-7.0.2",component="node-exporter",cpu="0",heritage="Tiller",instance="10.80.20.46:9100",job="kubernetes-service-endpoints",kubernetes_name="get-prometheus-node-exporter",kubernetes_namespace="default",mode="idle",release="get-prometheus"} | 423673.44 node_cpu_seconds_total{app="prometheus",chart="prometheus-7.0.2",component="node-exporter",cpu="0",heritage="Tiller",instance="10.80.20.52:9100",job="kubernetes-service-endpoints",kubernetes_name="get-prometheus-node-exporter",kubernetes_namespace="default",mode="idle",release="get-prometheus"} | 417097.16
There are no errors in the logs and I do have other kubernetes-nodes
metrics such as up
and storage_operation_errors_total
so node-exporter
is getting scraped.
I also verified manually that node-exporter
has this particular metric, node_cpu_seconds_total
, with curl <node IP>:9100/metrics | grep node_cpu
and it has results.
Does the job order definition matter? Would one job override the other's metrics if they have the same name? Should I be dropping metrics for the kubernetes-service-endpoints
job? I'm new to prometheus so any detailed help is appreciated.
From the scrape configs, the kubernetes-nodes job probes https://kubernetes.default.svc:443/api/v1/nodes/${node_name}/proxy/metrics
, while kubernetes-service-endpoints job probes every endpoints of those services with prometheus.io/scrape: true
defined, which includes node-exporter. So in your configs, the node_cpu_seconds_total metrics is definitely come from kuberenetes-service-endpoints job.
I was able to figure out how to add the "missing" labels by navigating to the prometheus service-discovery status UI page. This page shows all the "Discovered Labels" that can be processed and kept through relabel_configs. What is processed/kept shows next to "Discovered Labels" under "Target Labels". So then it was just a matter of modifying the kubernetes-service-endpoints
job config in scrape_configs
so I add more taget labels. Below is exactly what I changed in the chart's scrape_configs
. With this new config, I get namespace
, service
, pod
, and node
added to all metrics if the metric didn't already have them (see honor_labels
).
- job_name: 'kubernetes-service-endpoints'
+ honor_labels: true
kubernetes_sd_configs:
- role: endpoints
relabel_configs:
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
action: replace
target_label: __scheme__
regex: (https?)
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
action: replace
target_label: __address__
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
- action: labelmap
regex: __meta_kubernetes_service_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
action: replace
- target_label: kubernetes_namespace
+ target_label: namespace
- source_labels: [__meta_kubernetes_service_name]
action: replace
- target_label: kubernetes_name
+ target_label: service
+ - source_labels: [__meta_kubernetes_pod_name]
+ action: replace
+ target_label: pod
+ - source_labels: [__meta_kubernetes_pod_node_name]
+ action: replace
+ target_label: node