Is there a way to filter metric by a string value a part of which comes from the result of another query in prometheus query?

9/19/2019

I want to get an alert when the recent job of a cronjob fails. The expr kube_job_status_failed{job_name=~"cronjobname.*"}==1 works for most of time. But if a job fails and it's kept, even the next job succeeds, I still get an alert because there are two records in prometheus, one of which is the failure record, the other one is the success record.

I found I can get the latest job timestamp from kube_cronjob_status_last_schedule_time{cronjob="cronjobname"}, then use kube_job_status_failed{job_name="cronjobname-TIMESTAMP"} to query the last job status.

I wonder whether we have a way in one query to concatenate the jobname from the result of the first query and filter in the second? like kube_job_status_failed{job_name=string_concatenate("cronjobname-", kube_cronjob_status_last_schedule_time{cronjob="cronjobname"})}

-- Aaron LUO
kubernetes
kubernetes-cronjob
prometheus
promql

1 Answer

9/19/2019

With promql, you won't be able to have something the way you describe it. Moreover, I am not sure the last schedule time is always the same as the job start time; if there is a slowness or a reschedule somewhere by example.

You can follow the approach indicated in this article. An alternative one would be using the job metrics to determine:

the timestamp of the last failed job per cronjob

- record: job_cronjob:kube_job_status_start_time:last_failed
  expr: max((kube_job_status_start_time AND kube_job_status_failed == 1)
            * ON(job,namespace) GROUP_LEFT
            kube_job_labels{label_cronjob!=""}
           ) BY(label_cronjob)

the timestamp of the last successful job per cronjob

- record: job_cronjob:kube_job_status_start_time:last_suceeded
  expr: max((kube_job_status_start_time AND kube_job_status_suceeded == 1)
            * ON(job,namespace) GROUP_LEFT
            kube_job_labels{label_cronjob!=""}
           ) BY(label_cronjob)

And alert if failed one is more recent than successful one:

- alert: CronJobStatusFailed
  expr:   job_cronjob:kube_job_status_start_time:last_failed
        > job_cronjob:kube_job_status_start_time:last_suceeded
  for: 1m
  annotations:
    description: '{{ $labels.label_cronjob}} last run has failed.'
-- Michael Doubez
Source: StackOverflow