How to cleanup failed CronJob spawned Jobs once a more recent job passes

10/20/2019

I am running management tasks using Kubernetes CronJobs and have Prometheus alerting on when one of the spawned Jobs fails using kube-state-metrics:

kube_job_status_failed{job="kube-state-metrics"}  > 0

I want to have it so that when a more recent Job passes then the failed ones are cleaned up so that the alert stops firing.

Does the CronJob resource support this behaviour on its own?

Workarounds would be to make the Job clean up failed ones as the last step or to create a much more complicated alert rule to take the most recent Job as the definitive status, but they are not the nicest solutions IMO.

Kubernetes version: v1.15.1

-- dippynark
kube-state-metrics
kubernetes
kubernetes-cronjob
prometheus

2 Answers

10/20/2019

As a workaround the following query would show CronJobs where the last finished Job has failed

(max by(owner_name, namespace) (kube_job_status_start_time * on(job_name) group_left(owner_name) ((kube_job_status_succeeded / kube_job_status_succeeded == 1) + on(job_name) group_left(owner_name) (0 * kube_job_owner{owner_is_controller="true",owner_kind="CronJob"}))))
< bool
(max by(owner_name, namespace) (kube_job_status_start_time * on(job_name) group_left(owner_name) ((kube_job_status_failed / kube_job_status_failed == 1) + on(job_name) group_left(owner_name) (0 * kube_job_owner{owner_is_controller="true",owner_kind="CronJob"})))) == 1
-- dippynark
Source: StackOverflow

10/21/2019

There's a great Kubernetes guide on cleaning up jobs.

Specifically, the ttlSecondsAfterFinished defined in the JobSpec API.

This should do what you're asking, I.E. If a bunch of failed jobs occur, when one succeeds, the time before they should all be removed.

-- Dandy
Source: StackOverflow