Understanding backoffLimit in Kubernetes Job

2/22/2019

I’ve created a Cronjob in kubernetes with schedule(8 * * * *), with job’s backoffLimit defaulting to 6 and pod’s RestartPolicy to Never, the pods are deliberately configured to FAIL. As I understand, (for podSpec with restartPolicy : Never) Job controller will try to create backoffLimit number of pods and then it marks the job as Failed, so, I expected that there would be 6 pods in Error state.

This is the actual Job’s status:

status:
  conditions:
  - lastProbeTime: 2019-02-20T05:11:58Z
    lastTransitionTime: 2019-02-20T05:11:58Z
    message: Job has reached the specified backoff limit
    reason: BackoffLimitExceeded
    status: "True"
    type: Failed
  failed: 5

Why were there only 5 failed pods instead of 6? Or is my understanding about backoffLimit in-correct?

-- goutham
kubernetes

2 Answers

2/14/2020

Use spec.backoffLimit to specify the number of retries before considering a Job as failed. The back-off limit is set to 6 by default.

-- Nilay Tiwari
Source: StackOverflow

2/26/2019

In short: You might not be seeing all created pods because period of schedule in the cronjob is to short.

As described in documentation:

Failed Pods associated with the Job are recreated by the Job controller with an exponential back-off delay (10s, 20s, 40s …) capped at six minutes. The back-off count is reset if no new failed Pods appear before the Job’s next status check.

If new job is scheduled before Job controller has a chance to recreate a pod (having in mind the delay after previous failure), Job controller starts counting from one again.

I reproduced your issue in GKE using following .yaml:

apiVersion: batch/v1beta1
kind: CronJob
metadata:
  name: hellocron
spec:
  schedule: "*/3 * * * *" #Runs every 3 minutes
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: hellocron
            image: busybox
            args:
            - /bin/cat
            - /etc/os
          restartPolicy: Never
      backoffLimit: 6
  suspend: false

This job will fail because file /etc/os doesn't exist.

And here is an output of kubectl describe for one of the jobs:

Name:           hellocron-1551194280
Namespace:      default
Selector:       controller-uid=b81cdfb8-39d9-11e9-9eb7-42010a9c00d0
Labels:         controller-uid=b81cdfb8-39d9-11e9-9eb7-42010a9c00d0
                job-name=hellocron-1551194280
Annotations:    <none>
Controlled By:  CronJob/hellocron
Parallelism:    1
Completions:    1
Start Time:     Tue, 26 Feb 2019 16:18:07 +0100
Pods Statuses:  0 Running / 0 Succeeded / 6 Failed
Pod Template:
  Labels:  controller-uid=b81cdfb8-39d9-11e9-9eb7-42010a9c00d0
           job-name=hellocron-1551194280
  Containers:
   hellocron:
    Image:      busybox
    Port:       <none>
    Host Port:  <none>
    Args:
      /bin/cat
      /etc/os
    Environment:  <none>
    Mounts:       <none>
  Volumes:        <none>
Events:
  Type     Reason                Age   From            Message
  ----     ------                ----  ----            -------
  Normal   SuccessfulCreate      26m   job-controller  Created pod: hellocron-1551194280-4lf6h
  Normal   SuccessfulCreate      26m   job-controller  Created pod: hellocron-1551194280-85khk
  Normal   SuccessfulCreate      26m   job-controller  Created pod: hellocron-1551194280-wrktb
  Normal   SuccessfulCreate      26m   job-controller  Created pod: hellocron-1551194280-6942s
  Normal   SuccessfulCreate      25m   job-controller  Created pod: hellocron-1551194280-662zv
  Normal   SuccessfulCreate      22m   job-controller  Created pod: hellocron-1551194280-6c6rh
  Warning  BackoffLimitExceeded  17m   job-controller  Job has reached the specified backoff limit

Note the delay between creation of pods hellocron-1551194280-662zv and hellocron-1551194280-6c6rh.

-- MWZ
Source: StackOverflow