Not sure what I am doing wrong, but I am experiencing an issue where CronJobs stop scheduling new Jobs. It seems like this happens only after a couple of failures to launch a new Job. In my specific case, Jobs were not able to start due an inability to pull the container image.
I'm not really finding any settings that would lead to this, but I'm no expert on Kubernetes CronJobs. Configuration below:
apiVersion: batch/v1beta1
kind: CronJob
metadata:
labels:
app.kubernetes.io/instance: cron-deal-report
app.kubernetes.io/managed-by: Tiller
app.kubernetes.io/name: cron
helm.sh/chart: cron-0.1.0
name: cron-deal-report
spec:
concurrencyPolicy: Forbid
failedJobsHistoryLimit: 1
jobTemplate:
metadata:
creationTimestamp: null
spec:
template:
spec:
containers:
- args:
- -c
- npm run script
command:
- /bin/sh
env:
image: <redacted>
imagePullPolicy: Always
name: cron
resources: {}
securityContext:
runAsUser: 1000
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
dnsPolicy: ClusterFirst
restartPolicy: Never
schedulerName: default-scheduler
securityContext: {}
terminationGracePeriodSeconds: 30
schedule: 0/15 * * * *
successfulJobsHistoryLimit: 3
suspend: false
status: {}
As per Jobs - Run to Completion - Handling Pod and Container Failures:
An entire Pod can also fail, for a number of reasons, such as when the pod is kicked off the node (node is upgraded, rebooted, deleted, etc.), or if a container of the Pod fails and the
.spec.template.spec.restartPolicy = "Never"
. When a Pod fails, then the Job controller starts a new Pod.
You are using restartPolicy: Never
for your jobTemplate
, so, see the next quote on Pod backoff failure policy:
There are situations where you want to fail a Job after some amount of retries due to a logical error in configuration etc. To do so, set
.spec.backoffLimit
to specify the number of retries before considering a Job as failed. The back-off limit is set by default to 6. The back-off count is reset if no new failed Pods appear before the Job’s next status check.
The .spec.backoffLimit
is not defined in your jobTemplate
, so it's using the default (6
).
Following, as per Job Termination and Cleanup:
By default, a Job will run uninterrupted unless a Pod fails, at which point the Job defers to the
.spec.backoffLimit
described above. Another way to terminate a Job is by setting an active deadline. Do this by setting the.spec.activeDeadlineSeconds
field of the Job to a number of seconds.
That's your case: If your containers fail to pull the image six consecutive times, your Job will be considered as failed.
As per Cron Job Limitations:
A cron job creates a job object about once per execution time of its schedule [...]. The Cronjob is only responsible for creating Jobs that match its schedule, and the Job in turn is responsible for the management of the Pods it represents.
This means that all pod/container failures should be handled by the Job Controller (i.e., adjusting the jobTemplate
).
"Retrying" a Job:
You do not need to recreate a Cronjob in case its Job of fails. You only need to wait for the next schedule.
If you want to run a new Job before the next schedule, you can use the Cronjob template to create a Job manually with:
kubectl create job --from=cronjob/my-cronjob-name my-manually-job-name
If your containers are unable to download the images constantly, you have the following options:
backoffLimit
to a higher value.restartPolicy: OnFailure
for your containers, so the Pod will stay on the node, and only the container will be re-run.imagePullPolicy: IfNotPresent
. If you are not retagging your images, there is no need to force a re-pull for every job start.