Kubernetes: change backoffLimit default value

10/21/2021

Is it possible to configure backoffLimit globally (for example, change default limit from 6 to 2 for all jobs in cluster not specifying backoffLimit: 2 for each job)?

-- Natalia Efimtseva
kubernetes

2 Answers

10/21/2021

It seems that the default values, with the spec.backOffLimit being included, are hardcoded directly into Kubernetes code.

From apis/batch/v1/defaults.go

func SetDefaults_Job(obj *batchv1.Job) {
	// For a non-parallel job, you can leave both `.spec.completions` and
	// `.spec.parallelism` unset.  When both are unset, both are defaulted to 1.
	if obj.Spec.Completions == nil && obj.Spec.Parallelism == nil {
		obj.Spec.Completions = utilpointer.Int32Ptr(1)
		obj.Spec.Parallelism = utilpointer.Int32Ptr(1)
	}
	if obj.Spec.Parallelism == nil {
		obj.Spec.Parallelism = utilpointer.Int32Ptr(1)
	}
	if obj.Spec.BackoffLimit == nil {
		obj.Spec.BackoffLimit = utilpointer.Int32Ptr(6)
	}
	labels := obj.Spec.Template.Labels
	if labels != nil && len(obj.Labels) == 0 {
		obj.Labels = labels
	}
	if utilfeature.DefaultFeatureGate.Enabled(features.IndexedJob) && obj.Spec.CompletionMode == nil {
		mode := batchv1.NonIndexedCompletion
		obj.Spec.CompletionMode = &mode
	}
	if utilfeature.DefaultFeatureGate.Enabled(features.SuspendJob) && obj.Spec.Suspend == nil {
		obj.Spec.Suspend = utilpointer.BoolPtr(false)
	}
}

So I think it cannot be changed without changing the code, at the moment.

-- AndD
Source: StackOverflow

10/21/2021

No, it's not possible since backoffLimit is configured on Pod level as per the official documentation:

There are situations where you want to fail a Job after some amount of retries due to a logical error in configuration etc. To do so, set .spec.backoffLimit to specify the number of retries before considering a Job as failed. The back-off limit is set by default to 6. Failed Pods associated with the Job are recreated by the Job controller with an exponential back-off delay (10s, 20s, 40s ...) capped at six minutes. The back-off count is reset when a Job's Pod is deleted or successful without any other Pods for the Job failing around that time.

-- Jakub Siemaszko
Source: StackOverflow