How do I make sure my cronjob job does NOT retry on failure?

4/22/2020

I have a Kubernetes Cronjob that runs on GKE and runs Cucumber JVM tests. In case a Step fails due to assertion failure, some resource being unavailable, etc., Cucumber rightly throws an exception which leads the Cronjob job to fail and the Kubernetes pod's status changes to ERROR. This leads to creation of a new pod that tries to run the same Cucumber tests again, which fails again and retries again.

I don't want any of these retries to happen. If a Cronjob job fails, I want it to remain in the failed status and not retry at all. Based on this, I have already tried setting backoffLimit: 0 in combination with restartPolicy: Never in combination with concurrencyPolicy: Forbid, but it still retries by creating new pods and running the tests again.

What am I missing? Here's my kube manifest for the Cronjob:

apiVersion: batch/v1beta1
kind: CronJob
metadata:
  name: quality-apatha
  namespace: default
  labels:
    app: quality-apatha
spec:
  schedule: "*/1 * * * *"
  concurrencyPolicy: Forbid
  jobTemplate:
    spec:
      backoffLimit: 0
      template:
        spec:
          containers:
            - name: quality-apatha
              image: FOO-IMAGE-PATH
              imagePullPolicy: "Always"
              resources:
                limits:
                  cpu: 500m
                  memory: 512Mi
              env:
                - name: FOO
                  value: BAR
              volumeMounts:
                - name: FOO
                  mountPath: BAR
              args:
                - java
                - -cp
                - qe_java.job.jar:qe_java-1.0-SNAPSHOT-tests.jar
                - org.junit.runner.JUnitCore
                - com.liveramp.qe_java.RunCucumberTest
          restartPolicy: Never
          volumes:
            - name: FOO
              secret:
                secretName: BAR

Is there any other Kubernetes Kind I can use to stop the retrying?

Thank you!

-- Core_Dumped
cucumber-jvm
google-kubernetes-engine
kubernetes
kubernetes-cronjob
kubernetes-pod

1 Answer

4/22/2020

To make things as simple as possible I tested it using this example from the official kubernetes documentation, applying to it minor modifications to illustrate what really happens in different scenarios.

I can confirm that when backoffLimit is set to 0 and restartPolicy to Never everything works exactly as expected and there are no retries. Note that every single run of your Job which in your example is scheduled to run at intervals of 60 seconds (schedule: "*/1 * * * *") IS NOT considerd a retry.

Let's take a closer look at the following example (base yaml avialable here):

apiVersion: batch/v1beta1
kind: CronJob
metadata:
  name: hello
spec:
  schedule: "*/1 * * * *"
  jobTemplate:
    spec:
      backoffLimit: 0
      template:
        spec:
          containers:
          - name: hello
            image: busybox
            args:
            - /bin/sh
            - -c
            - non-existing-command
          restartPolicy: Never

It spawns new cron job every 60 seconds according to the schedule, no matter if it fails or runs successfully. In this particular example it is configured to fail as we are trying to run non-existing-command.

You can check what's happening by running:

$ kubectl get pods
NAME                     READY   STATUS              RESTARTS   AGE
hello-1587558720-pgqq9   0/1     Error               0          61s
hello-1587558780-gpzxl   0/1     ContainerCreating   0          1s

As you can see there are no retries. Although the first Pod failed, the new one is spawned exactly 60 seconds later according to our specification. I'd like to emphasize it again. This is not a retry.

On the other hand when we modify the above example and set backoffLimit: 3, we can observe the retries. As you can see, now new Pods are created much more often than every 60 seconds. This are retries.

$ kubectl get pods
NAME                     READY   STATUS   RESTARTS   AGE
hello-1587565260-7db6j   0/1     Error    0          106s
hello-1587565260-tcqhv   0/1     Error    0          104s
hello-1587565260-vnbcl   0/1     Error    0          94s
hello-1587565320-7nc6z   0/1     Error    0          44s
hello-1587565320-l4p8r   0/1     Error    0          14s
hello-1587565320-mjnb6   0/1     Error    0          46s
hello-1587565320-wqbm2   0/1     Error    0          34s

What we can see above are 3 retries (Pod creation attempts), related with hello-1587565260 job and 4 retries (including the orignal 1st try not counted in backoffLimit: 3) related with hello-1587565320 job.

As you can see the jobs themselves are still run according to the schedule, at 60 second intervals:

kubectl get jobs
NAME               COMPLETIONS   DURATION   AGE
hello-1587565260   0/1           2m12s      2m12s
hello-1587565320   0/1           72s        72s
hello-1587565380   0/1           11s        11s

However due to our backoffLimit set this time to 3, every time the Pod responsible for running the job fails, 3 additional retries occur.

I hope this helped to dispel any possible confusions about running cronJobs in kubernetes.

If you are rather interested in running something just once, not at regular intervals, take a look at simple Job instead of CronJob.

Also consider changing your Cron configuration if you still want to run this particular job on regular basis but let's say once in 24 h, not every minute.

-- mario
Source: StackOverflow