Sprint Cloud Dataflow with Kubernetes: BackoffLimit

10/22/2019

Kubernetes Pod backoff failure policy

From the k8s documentation:

There are situations where you want to fail a Job after some amount of retries due to a logical error in configuration etc. To do so, set .spec.backoffLimit to specify the number of retries before considering a Job as failed. The back-off limit is set by default to 6.

Spring cloud dataflow:

When a job has failed, we actually don't want a retry. In other words, we want to set the backoffLimit: 1 in our Sprint Cloud Dataflow config file.

We have tried to set it like the following:

deployer.kubernetes.spec.backoffLimit: 1

or even

deployer.kubernetes.backoffLimit: 1

But both is not transmitted to our Kubernetes Cluster.

After 6 tries, we see the following message:

status: conditions: - lastProbeTime: '2019-10-22T17:45:46Z' lastTransitionTime: '2019-10-22T17:45:46Z' message: Job has reached the specified backoff limit reason: BackoffLimitExceeded status: 'True' type: Failed failed: 6 startTime: '2019-10-22T17:33:01Z'

Actually we want to fail fast (1 or 2 tries maximum)

Question: How can we properly set this property, so that all task triggered by SCDF will fail maximum once on Kubernetes?

Update (23.10.2019)

We have also tried the property:

deployer:
   kubernetes:
      maxCrashLoopBackOffRestarts: Never # No retry for failed tasks

But the jobs are still failing 6 times instead of 1.

Update (26.10.2019)

For completeness sake:

  1. I am scheduling a task in SCDF
  2. The task is triggered on Kubernetes (more specifically Openshift)
  3. When I check the configuration on the K8s-platform, I see that it still has a backoffLimit of 6, instead of 1:

Yalm config snippet taken from the running pod:

  spec:
    backoffLimit: 6
    completions: 1
    parallelism: 1

In the official documentation, it says:

`maxCrashLoopBackOffRestarts` - Maximum allowed restarts for app that is in a CrashLoopBackOff. Values are `Always`, `IfNotPresent`, `Never`

But maxCrashLoopBackOffRestarts takes an integer. So I guess the documentation is not accurate.

The pod is then restarted 6 times.

I have tried to set those properties unsuccessfully:

spring.cloud.dataflow.task.platform.kubernetes.accounts.defaults.maxCrashLoopBackOffRestarts: 0
spring.cloud.deployer.kubernetes.maxCrashLoopBackOffRestarts: 0
spring.cloud.scheduler.kubernetes.maxCrashLoopBackOffRestarts: 0

None of those has worked.

Any idea?

-- KeyMaker00
kubernetes
spring
spring-cloud-dataflow

2 Answers

10/22/2019

To override the default restart limit, you'd have to use SCDF's maxCrashLoopBackOffRestarts deployer property. All of the supported properties are documented in the ref. guide.

You can configure to override this property "globally" in SCDF or individually override it at each stream/task deployment level, as well. More info here.

-- Sabby Anandan
Source: StackOverflow

11/1/2019

Thanks to ilayaperumalg it's much clearer why it's not working:

It looks like the property maxCrashLoopBackOffRestarts is applicable for determining the status of the runtime application instance while the property you refer to as backoffLimit is applicable to the JobSpec which is currently not being supported. We can add this as a feature to support your case.

Github Link

-- KeyMaker00
Source: StackOverflow