How can I programatically determine if a job
has failed for good and will not retry any more? I've seen the following on failed jobs:
status:
conditions:
- lastProbeTime: 2018-04-25T22:38:34Z
lastTransitionTime: 2018-04-25T22:38:34Z
message: Job has reach the specified backoff limit
reason: BackoffLimitExceeded
status: "True"
type: Failed
However, the documentation doesn't explain why conditions
is a list. Can there be multiple conditions? If so, which one do I rely on? Is it a guarantee that there will only be one with status: "True"
?
If so, which one do I rely on?
You might not have to choose, considering commit dd84bba64
When a job is complete, the controller will indefinitely update its conditions with a Complete condition.
This change makes the controller exit the reconcilation as soon as the job is already found to be marked as complete.
JobConditions
is similar as PodConditions
. You may read about PodConditions
in official docs.
Anyway, To determine a successful pod, I follow another way. Let's look at it.
There are two fields in Job Spec.
One is spec.completion
(default value 1), which says,
Specifies the desired number of successfully finished pods the job should be run with.
Another is spec.backoffLimit
(default value 6), which says,
Specifies the number of retries before marking this job failed.
Now In JobStatus
There are two fields in JobStatus too. Succeeded
and Failed
. Succeeded
means how many times the Pod completed successfully and Failed
denotes, The number of pods which reached phase Failed.
Success
is equal or bigger than the spec.completion
, the job will become completed
.Failed
is equal or bigger than the spec.backOffLimit
, the job will become failed
.So, the logic will be here,
if job.Status.Succeeded >= *job.Spec.Completion {
return "completed"
} else if job.Status.Failed >= *job.Spec.BackoffLimit {
return "failed"
}