Kubernetes Pod backoff failure policy
From the k8s documentation:
There are situations where you want to fail a Job after some amount of retries due to a logical error in configuration etc. To do so, set .spec.backoffLimit to specify the number of retries before considering a Job as failed. The back-off limit is set by default to 6.
Spring cloud dataflow:
When a job has failed, we actually don't want a retry. In other words, we want to set the backoffLimit: 1
in our Sprint Cloud Dataflow config file.
We have tried to set it like the following:
deployer.kubernetes.spec.backoffLimit: 1
or even
deployer.kubernetes.backoffLimit: 1
But both is not transmitted to our Kubernetes Cluster.
After 6 tries, we see the following message:
status: conditions: - lastProbeTime: '2019-10-22T17:45:46Z' lastTransitionTime: '2019-10-22T17:45:46Z' message: Job has reached the specified backoff limit reason: BackoffLimitExceeded status: 'True' type: Failed failed: 6 startTime: '2019-10-22T17:33:01Z'
Actually we want to fail fast (1 or 2 tries maximum)
Question: How can we properly set this property, so that all task triggered by SCDF will fail maximum once on Kubernetes?
Update (23.10.2019)
We have also tried the property:
deployer:
kubernetes:
maxCrashLoopBackOffRestarts: Never # No retry for failed tasks
But the jobs are still failing 6 times instead of 1.
Update (26.10.2019)
For completeness sake:
Yalm config snippet taken from the running pod:
spec:
backoffLimit: 6
completions: 1
parallelism: 1
In the official documentation, it says:
`maxCrashLoopBackOffRestarts` - Maximum allowed restarts for app that is in a CrashLoopBackOff. Values are `Always`, `IfNotPresent`, `Never`
But maxCrashLoopBackOffRestarts
takes an integer. So I guess the documentation is not accurate.
The pod is then restarted 6 times.
I have tried to set those properties unsuccessfully:
spring.cloud.dataflow.task.platform.kubernetes.accounts.defaults.maxCrashLoopBackOffRestarts: 0
spring.cloud.deployer.kubernetes.maxCrashLoopBackOffRestarts: 0
spring.cloud.scheduler.kubernetes.maxCrashLoopBackOffRestarts: 0
None of those has worked.
Any idea?
To override the default restart limit, you'd have to use SCDF's maxCrashLoopBackOffRestarts
deployer property. All of the supported properties are documented in the ref. guide.
You can configure to override this property "globally" in SCDF or individually override it at each stream/task deployment level, as well. More info here.
Thanks to ilayaperumalg it's much clearer why it's not working:
It looks like the property maxCrashLoopBackOffRestarts is applicable for determining the status of the runtime application instance while the property you refer to as backoffLimit is applicable to the JobSpec which is currently not being supported. We can add this as a feature to support your case.