How to determine if a job is failed

4/27/2018

How can I programatically determine if a job has failed for good and will not retry any more? I've seen the following on failed jobs:

status:
  conditions:
  - lastProbeTime: 2018-04-25T22:38:34Z
    lastTransitionTime: 2018-04-25T22:38:34Z
    message: Job has reach the specified backoff limit
    reason: BackoffLimitExceeded
    status: "True"
    type: Failed

However, the documentation doesn't explain why conditions is a list. Can there be multiple conditions? If so, which one do I rely on? Is it a guarantee that there will only be one with status: "True"?

-- rcorre
kubernetes

2 Answers

4/27/2018

If so, which one do I rely on?

You might not have to choose, considering commit dd84bba64

When a job is complete, the controller will indefinitely update its conditions with a Complete condition.
This change makes the controller exit the reconcilation as soon as the job is already found to be marked as complete.

-- VonC
Source: StackOverflow

4/27/2018

JobConditions is similar as PodConditions. You may read about PodConditions in official docs.

Anyway, To determine a successful pod, I follow another way. Let's look at it.


There are two fields in Job Spec.

One is spec.completion (default value 1), which says,

Specifies the desired number of successfully finished pods the job should be run with.

Another is spec.backoffLimit (default value 6), which says,

Specifies the number of retries before marking this job failed.


Now In JobStatus

There are two fields in JobStatus too. Succeeded and Failed. Succeeded means how many times the Pod completed successfully and Failed denotes, The number of pods which reached phase Failed.

  • Once the Success is equal or bigger than the spec.completion, the job will become completed.
  • Once the Failed is equal or bigger than the spec.backOffLimit, the job will become failed.

So, the logic will be here,

if job.Status.Succeeded >= *job.Spec.Completion {
    return "completed"
} else if job.Status.Failed >= *job.Spec.BackoffLimit {
    return "failed"
}
-- Abdullah Al Maruf - Tuhin
Source: StackOverflow