I am running management tasks using Kubernetes CronJobs and have Prometheus alerting on when one of the spawned Jobs fails using kube-state-metrics:
kube_job_status_failed{job="kube-state-metrics"} > 0
I want to have it so that when a more recent Job passes then the failed ones are cleaned up so that the alert stops firing.
Does the CronJob resource support this behaviour on its own?
Workarounds would be to make the Job clean up failed ones as the last step or to create a much more complicated alert rule to take the most recent Job as the definitive status, but they are not the nicest solutions IMO.
Kubernetes version: v1.15.1
As a workaround the following query would show CronJobs where the last finished Job has failed
(max by(owner_name, namespace) (kube_job_status_start_time * on(job_name) group_left(owner_name) ((kube_job_status_succeeded / kube_job_status_succeeded == 1) + on(job_name) group_left(owner_name) (0 * kube_job_owner{owner_is_controller="true",owner_kind="CronJob"}))))
< bool
(max by(owner_name, namespace) (kube_job_status_start_time * on(job_name) group_left(owner_name) ((kube_job_status_failed / kube_job_status_failed == 1) + on(job_name) group_left(owner_name) (0 * kube_job_owner{owner_is_controller="true",owner_kind="CronJob"})))) == 1
There's a great Kubernetes guide on cleaning up jobs.
Specifically, the ttlSecondsAfterFinished
defined in the JobSpec API.
This should do what you're asking, I.E. If a bunch of failed jobs occur, when one succeeds, the time before they should all be removed.