Is there a way to monitor kube cron jobs using prometheus

11/17/2017

Is there a way to monitor kube cronjob.

I have a kube cronjob which runs every 10mins on my cluster.. is there a way to collect metrics everytime my cronjob fails due to some error or notify when my cronjob has not been completed after a certain period of time.

-- user3587892
kubernetes
prometheus

6 Answers

2/28/2018

You can get the info you want from here.

CronJobs create Jobs on a schedule, so you can simple look at kube_job_status_failed for the jobs that are created, one caveat is the job name has an epoch time at the end.

To ensure alerts resolve themselves I'm using the following query in alert manager:

increase(kube_job_status_failed{job=~"mytestjob-.*"}[5m]) > 1

My cron is:

*/5 * * * *`, and I set `backoffLimit: 2

to limit number of failures per run.

-- user1751972
Source: StackOverflow

11/28/2018

I was able to simplify this medium post (label_replace was not working for me for some reason) https://medium.com/@tristan_96324/prometheus-k8s-cronjob-alerts-94bee7b90511

My cron query looks like this (we have "component" labels on all cronjobs instead of "cronjob", but you can use your favorite label)

clamp_max(max(
    kube_job_status_start_time
    * ON(job) GROUP_RIGHT()
    kube_job_labels{label_component!=""}
  ) BY (job, label_component)
  == ON(label_component) GROUP_LEFT()
  max(
    kube_job_status_start_time
    * ON(job) GROUP_RIGHT()
    kube_job_labels{label_component!=""}
) BY (label_component), 1) * ON(job) GROUP_LEFT() 
kube_job_status_failed

Plug this into the prometheus expression dashboard to make sure you get results (1 means the cron failed the last time, 0 means it succeeded or hasn't run yet).

For alerting, add != 0, and the query will return with ANY cronjob that failed.

-- Lindsay Landry
Source: StackOverflow

1/17/2018

I'm using these rules with kube-state-metrics:

groups:
- name: job.rules
  rules:
  - alert: CronJobRunning
    expr: time() -kube_cronjob_next_schedule_time > 3600
    for: 1h
    labels:
      severity: warning
    annotations:
      description: CronJob {{$labels.namespaces}}/{{$labels.cronjob}} is taking more than 1h to complete
      summary: CronJob didn't finish after 1h

  - alert: JobCompletion
    expr: kube_job_spec_completions - kube_job_status_succeeded  > 0
    for: 1h
    labels:
      severity: warning
    annotations:
      description: Job completion is taking more than 1h to complete
        cronjob {{$labels.namespaces}}/{{$labels.job}}
      summary: Job {{$labels.job}} didn't finish to complete after 1h

  - alert: JobFailed
    expr: kube_job_status_failed  > 0
    for: 1h
    labels:
      severity: warning
    annotations:
      description: Job {{$labels.namespaces}}/{{$labels.job}} failed to complete
      summary: Job failed
-- Camil
Source: StackOverflow

3/4/2018

The tricky part here is the cronjobs themselves have no useful status, you have to match them to the jobs they create. I've written up an article on how to achieve this:

https://medium.com/@tristan_96324/prometheus-k8s-cronjob-alerts-94bee7b90511

The article goes into a bit of detail as to how things work, but the alert config is as follow:

groups:
- name: kube-cron
  rules:
  - record: job_cronjob:kube_job_status_start_time:max
    expr: |
      label_replace(
        label_replace(
          max(
            kube_job_status_start_time
            * ON(exported_job) GROUP_RIGHT()
            kube_job_labels{label_cronjob!=""}
          ) BY (exported_job, label_cronjob)
          == ON(label_cronjob) GROUP_LEFT()
          max(
            kube_job_status_start_time
            * ON(exported_job) GROUP_RIGHT()
            kube_job_labels{label_cronjob!=""}
          ) BY (label_cronjob),
          "job", "$1", "exported_job", "(.+)"),
        "cronjob", "$1", "label_cronjob", "(.+)")

  - record: job_cronjob:kube_job_status_failed:sum
    expr: |
  clamp_max(
        job_cronjob:kube_job_status_start_time:max,
      1)
      * ON(job) GROUP_LEFT()
      label_replace(
        label_replace(
          (kube_job_status_failed != 0),
          "job", "$1", "exported_job", "(.+)"),
        "cronjob", "$1", "label_cronjob", "(.+)")


  - alert: CronJobStatusFailed
    expr: |
      job_cronjob:kube_job_status_failed:sum
      * ON(cronjob) GROUP_RIGHT()
      kube_cronjob_labels
      > 0
    for: 1m
    annotations:
      description: '{{ $labels.cronjob }} last run has failed {{$value }} times.'

The jobTemplate must include a label called cronjob that matches the name of the cronjob object.

-- Tristan Colgate
Source: StackOverflow

11/17/2017

The way to monitoring cronjobs with Prometheus is to have them push a metric indicating the last time they succeeded to the pushgateway. You can then alert on if the cronjob hasn't succeeded recently enough.

-- brian-brazil
Source: StackOverflow

11/17/2017

The kube-state-metrics exporter also includes various CronJob related metrics: https://github.com/kubernetes/kube-state-metrics/blob/master/Documentation/cronjob-metrics.md, but unfortunately doesn't seem to include success CronJob success/failure.

-- tom.wilkie
Source: StackOverflow