Kubernets Spark Operator: Configuring Automatic Application Restart and Failure Handling

3/8/2021

On kubernetes a Container restart policy can be configured with an optional field .spec.restartPolicy which can be set to type: OnFailure. I read here that there is a cap of 300 secs (5 mins) on an exponential back-off delay before a failed pod is restarted. My first confusion, does this cap of 300 secs apply to only the default configuration or does it affect for example, the below configuration. Also, I am wondering if increasing the number of retries for example onFailureRetries:6 with an interval of onFailureRetryInterval:9 (considering the 300 sec cap) makes sense considering pressure on the resources on the cluster? Is there a resource available to help know which best configuration to use or will this be based on user experience, or I just have to try and see what makes sense for my cluster?

restartPolicy:
type: OnFailure
onFailureRetries: 3
onFailureRetryInterval: 10
onSubmissionFailureRetries: 5
onSubmissionFailureRetryInterval: 20
-- Azeem
kubernetes
spark-operator

0 Answers