Is there a way to monitor kube cronjob.
I have a kube cronjob which runs every 10mins on my cluster.. is there a way to collect metrics everytime my cronjob fails due to some error or notify when my cronjob has not been completed after a certain period of time.
You can get the info you want from here.
CronJobs create Jobs on a schedule, so you can simple look at kube_job_status_failed for the jobs that are created, one caveat is the job name has an epoch time at the end.
To ensure alerts resolve themselves I'm using the following query in alert manager:
increase(kube_job_status_failed{job=~"mytestjob-.*"}[5m]) > 1
My cron is:
*/5 * * * *`, and I set `backoffLimit: 2
to limit number of failures per run.
I was able to simplify this medium post (label_replace was not working for me for some reason) https://medium.com/@tristan_96324/prometheus-k8s-cronjob-alerts-94bee7b90511
My cron query looks like this (we have "component" labels on all cronjobs instead of "cronjob", but you can use your favorite label)
clamp_max(max(
kube_job_status_start_time
* ON(job) GROUP_RIGHT()
kube_job_labels{label_component!=""}
) BY (job, label_component)
== ON(label_component) GROUP_LEFT()
max(
kube_job_status_start_time
* ON(job) GROUP_RIGHT()
kube_job_labels{label_component!=""}
) BY (label_component), 1) * ON(job) GROUP_LEFT()
kube_job_status_failed
Plug this into the prometheus expression dashboard to make sure you get results (1 means the cron failed the last time, 0 means it succeeded or hasn't run yet).
For alerting, add != 0
, and the query will return with ANY cronjob that failed.
I'm using these rules with kube-state-metrics:
groups:
- name: job.rules
rules:
- alert: CronJobRunning
expr: time() -kube_cronjob_next_schedule_time > 3600
for: 1h
labels:
severity: warning
annotations:
description: CronJob {{$labels.namespaces}}/{{$labels.cronjob}} is taking more than 1h to complete
summary: CronJob didn't finish after 1h
- alert: JobCompletion
expr: kube_job_spec_completions - kube_job_status_succeeded > 0
for: 1h
labels:
severity: warning
annotations:
description: Job completion is taking more than 1h to complete
cronjob {{$labels.namespaces}}/{{$labels.job}}
summary: Job {{$labels.job}} didn't finish to complete after 1h
- alert: JobFailed
expr: kube_job_status_failed > 0
for: 1h
labels:
severity: warning
annotations:
description: Job {{$labels.namespaces}}/{{$labels.job}} failed to complete
summary: Job failed
The tricky part here is the cronjobs themselves have no useful status, you have to match them to the jobs they create. I've written up an article on how to achieve this:
https://medium.com/@tristan_96324/prometheus-k8s-cronjob-alerts-94bee7b90511
The article goes into a bit of detail as to how things work, but the alert config is as follow:
groups:
- name: kube-cron
rules:
- record: job_cronjob:kube_job_status_start_time:max
expr: |
label_replace(
label_replace(
max(
kube_job_status_start_time
* ON(exported_job) GROUP_RIGHT()
kube_job_labels{label_cronjob!=""}
) BY (exported_job, label_cronjob)
== ON(label_cronjob) GROUP_LEFT()
max(
kube_job_status_start_time
* ON(exported_job) GROUP_RIGHT()
kube_job_labels{label_cronjob!=""}
) BY (label_cronjob),
"job", "$1", "exported_job", "(.+)"),
"cronjob", "$1", "label_cronjob", "(.+)")
- record: job_cronjob:kube_job_status_failed:sum
expr: |
clamp_max(
job_cronjob:kube_job_status_start_time:max,
1)
* ON(job) GROUP_LEFT()
label_replace(
label_replace(
(kube_job_status_failed != 0),
"job", "$1", "exported_job", "(.+)"),
"cronjob", "$1", "label_cronjob", "(.+)")
- alert: CronJobStatusFailed
expr: |
job_cronjob:kube_job_status_failed:sum
* ON(cronjob) GROUP_RIGHT()
kube_cronjob_labels
> 0
for: 1m
annotations:
description: '{{ $labels.cronjob }} last run has failed {{$value }} times.'
The jobTemplate must include a label called cronjob
that matches the name of the cronjob object.
The way to monitoring cronjobs with Prometheus is to have them push a metric indicating the last time they succeeded to the pushgateway. You can then alert on if the cronjob hasn't succeeded recently enough.
The kube-state-metrics exporter also includes various CronJob related metrics: https://github.com/kubernetes/kube-state-metrics/blob/master/Documentation/cronjob-metrics.md, but unfortunately doesn't seem to include success CronJob success/failure.