Kubernetes CronJob Stops Scheduling Jobs

4/23/2019

Not sure what I am doing wrong, but I am experiencing an issue where CronJobs stop scheduling new Jobs. It seems like this happens only after a couple of failures to launch a new Job. In my specific case, Jobs were not able to start due an inability to pull the container image.

I'm not really finding any settings that would lead to this, but I'm no expert on Kubernetes CronJobs. Configuration below:

apiVersion: batch/v1beta1
kind: CronJob
metadata:
  labels:
    app.kubernetes.io/instance: cron-deal-report
    app.kubernetes.io/managed-by: Tiller
    app.kubernetes.io/name: cron
    helm.sh/chart: cron-0.1.0
  name: cron-deal-report
spec:
  concurrencyPolicy: Forbid
  failedJobsHistoryLimit: 1
  jobTemplate:
    metadata:
      creationTimestamp: null
    spec:
      template:
        spec:
          containers:
          - args:
            - -c
            - npm run script
            command:
            - /bin/sh
            env:
            image: <redacted>
            imagePullPolicy: Always
            name: cron
            resources: {}
            securityContext:
              runAsUser: 1000
            terminationMessagePath: /dev/termination-log
            terminationMessagePolicy: File
          dnsPolicy: ClusterFirst
          restartPolicy: Never
          schedulerName: default-scheduler
          securityContext: {}
          terminationGracePeriodSeconds: 30
  schedule: 0/15 * * * *
  successfulJobsHistoryLimit: 3
  suspend: false
status: {}
-- Randy L
kubernetes

1 Answer

4/24/2019

How kubernetes jobs handle failures

As per Jobs - Run to Completion - Handling Pod and Container Failures:

An entire Pod can also fail, for a number of reasons, such as when the pod is kicked off the node (node is upgraded, rebooted, deleted, etc.), or if a container of the Pod fails and the .spec.template.spec.restartPolicy = "Never". When a Pod fails, then the Job controller starts a new Pod.

You are using restartPolicy: Never for your jobTemplate, so, see the next quote on Pod backoff failure policy:

There are situations where you want to fail a Job after some amount of retries due to a logical error in configuration etc. To do so, set .spec.backoffLimit to specify the number of retries before considering a Job as failed. The back-off limit is set by default to 6. The back-off count is reset if no new failed Pods appear before the Job’s next status check.

The .spec.backoffLimit is not defined in your jobTemplate, so it's using the default (6).

Following, as per Job Termination and Cleanup:

By default, a Job will run uninterrupted unless a Pod fails, at which point the Job defers to the .spec.backoffLimit described above. Another way to terminate a Job is by setting an active deadline. Do this by setting the .spec.activeDeadlineSeconds field of the Job to a number of seconds.

That's your case: If your containers fail to pull the image six consecutive times, your Job will be considered as failed.


Cronjobs

As per Cron Job Limitations:

A cron job creates a job object about once per execution time of its schedule [...]. The Cronjob is only responsible for creating Jobs that match its schedule, and the Job in turn is responsible for the management of the Pods it represents.

This means that all pod/container failures should be handled by the Job Controller (i.e., adjusting the jobTemplate).

"Retrying" a Job:

You do not need to recreate a Cronjob in case its Job of fails. You only need to wait for the next schedule.

If you want to run a new Job before the next schedule, you can use the Cronjob template to create a Job manually with:

kubectl create job --from=cronjob/my-cronjob-name my-manually-job-name

What you should do:

If your containers are unable to download the images constantly, you have the following options:

  • Explicit set and tune backoffLimit to a higher value.
  • Use restartPolicy: OnFailure for your containers, so the Pod will stay on the node, and only the container will be re-run.
  • Consider using imagePullPolicy: IfNotPresent. If you are not retagging your images, there is no need to force a re-pull for every job start.
-- Eduardo Baitello
Source: StackOverflow