I want to get an alert when the recent job of a cronjob fails. The expr kube_job_status_failed{job_name=~"cronjobname.*"}==1
works for most of time. But if a job fails and it's kept, even the next job succeeds, I still get an alert because there are two records in prometheus, one of which is the failure record, the other one is the success record.
I found I can get the latest job timestamp from kube_cronjob_status_last_schedule_time{cronjob="cronjobname"}
, then use kube_job_status_failed{job_name="cronjobname-TIMESTAMP"}
to query the last job status.
I wonder whether we have a way in one query to concatenate the jobname
from the result of the first query and filter in the second? like kube_job_status_failed{job_name=string_concatenate("cronjobname-", kube_cronjob_status_last_schedule_time{cronjob="cronjobname"})}
With promql, you won't be able to have something the way you describe it. Moreover, I am not sure the last schedule time is always the same as the job start time; if there is a slowness or a reschedule somewhere by example.
You can follow the approach indicated in this article. An alternative one would be using the job metrics to determine:
the timestamp of the last failed job per cronjob
- record: job_cronjob:kube_job_status_start_time:last_failed
expr: max((kube_job_status_start_time AND kube_job_status_failed == 1)
* ON(job,namespace) GROUP_LEFT
kube_job_labels{label_cronjob!=""}
) BY(label_cronjob)
the timestamp of the last successful job per cronjob
- record: job_cronjob:kube_job_status_start_time:last_suceeded
expr: max((kube_job_status_start_time AND kube_job_status_suceeded == 1)
* ON(job,namespace) GROUP_LEFT
kube_job_labels{label_cronjob!=""}
) BY(label_cronjob)
And alert if failed one is more recent than successful one:
- alert: CronJobStatusFailed
expr: job_cronjob:kube_job_status_start_time:last_failed
> job_cronjob:kube_job_status_start_time:last_suceeded
for: 1m
annotations:
description: '{{ $labels.label_cronjob}} last run has failed.'