Spark property that kills the spark application after a certain number of failed executors

11/24/2020

I have a spark application running with a driver and two executors. These executors have been failing and new ones are being created endlessly. I'm looking for a way to tell the spark operator (through a spark property maybe?) to stop trying and to permanently fail the spark application after a few executors are lost.

At first I thought "spark.task.maxFailures": "2" would help me but it doesn't actually fail the application.

We were first using Spark version 2.4.4 and the spark application didn't use to keep trying to spawn new executors indefinitely like this. We upgraded Spark to version 3.0.1 and only now we see this behavior.

Update - adding logs of the how the pods are failing and being recreated. kubectl get logs

NAME                                           READY   STATUS      RESTARTS   AGE
bats-4ec00395a68a46939381e78e5545198b-driver   0/1     Completed   0          7h51m
bats-4ec00395a68a46939381e78e5545198b-driver   0/1     Terminating   0          8h
bats-4ec00395a68a46939381e78e5545198b-driver   0/1     Terminating   0          8h
bats-4ec00395a68a46939381e78e5545198b-driver   0/1     Pending       0          0s
bats-4ec00395a68a46939381e78e5545198b-driver   0/1     Pending       0          0s
bats-4ec00395a68a46939381e78e5545198b-driver   0/1     ContainerCreating   0          0s
bats-4ec00395a68a46939381e78e5545198b-driver   1/1     Running             0          4s
4ec00395a68a46939381e78e5545198b-bba8d875fb41e88d-exec-1   0/1     Pending             0          1s
4ec00395a68a46939381e78e5545198b-bba8d875fb41e88d-exec-1   0/1     Pending             0          1s
4ec00395a68a46939381e78e5545198b-bba8d875fb41e88d-exec-2   0/1     Pending             0          0s
4ec00395a68a46939381e78e5545198b-bba8d875fb41e88d-exec-2   0/1     Pending             0          0s
4ec00395a68a46939381e78e5545198b-bba8d875fb41e88d-exec-1   0/1     ContainerCreating   0          1s
4ec00395a68a46939381e78e5545198b-bba8d875fb41e88d-exec-2   0/1     ContainerCreating   0          0s
4ec00395a68a46939381e78e5545198b-bba8d875fb41e88d-exec-2   1/1     Running             0          3s
4ec00395a68a46939381e78e5545198b-bba8d875fb41e88d-exec-1   1/1     Running             0          5s
4ec00395a68a46939381e78e5545198b-bba8d875fb41e88d-exec-2   0/1     OOMKilled           0          5m
4ec00395a68a46939381e78e5545198b-bba8d875fb41e88d-exec-2   0/1     Terminating         0          5m30s
4ec00395a68a46939381e78e5545198b-bba8d875fb41e88d-exec-2   0/1     Terminating         0          5m30s
4ec00395a68a46939381e78e5545198b-bba8d875fb41e88d-exec-3   0/1     Pending             0          0s
4ec00395a68a46939381e78e5545198b-bba8d875fb41e88d-exec-3   0/1     Pending             0          0s
4ec00395a68a46939381e78e5545198b-bba8d875fb41e88d-exec-3   0/1     ContainerCreating   0          0s
4ec00395a68a46939381e78e5545198b-bba8d875fb41e88d-exec-3   1/1     Running             0          3s
4ec00395a68a46939381e78e5545198b-bba8d875fb41e88d-exec-1   0/1     OOMKilled           0          9m50s
4ec00395a68a46939381e78e5545198b-bba8d875fb41e88d-exec-1   0/1     Terminating         0          10m
4ec00395a68a46939381e78e5545198b-bba8d875fb41e88d-exec-1   0/1     Terminating         0          10m
4ec00395a68a46939381e78e5545198b-bba8d875fb41e88d-exec-4   0/1     Pending             0          0s
4ec00395a68a46939381e78e5545198b-bba8d875fb41e88d-exec-4   0/1     Pending             0          0s
4ec00395a68a46939381e78e5545198b-bba8d875fb41e88d-exec-4   0/1     ContainerCreating   0          0s
4ec00395a68a46939381e78e5545198b-bba8d875fb41e88d-exec-4   1/1     Running             0          4s
4ec00395a68a46939381e78e5545198b-bba8d875fb41e88d-exec-3   0/1     OOMKilled           0          5m13s
4ec00395a68a46939381e78e5545198b-bba8d875fb41e88d-exec-3   0/1     Terminating         0          5m30s
4ec00395a68a46939381e78e5545198b-bba8d875fb41e88d-exec-3   0/1     Terminating         0          5m30s
4ec00395a68a46939381e78e5545198b-bba8d875fb41e88d-exec-5   0/1     Pending             0          0s
4ec00395a68a46939381e78e5545198b-bba8d875fb41e88d-exec-5   0/1     Pending             0          0s
4ec00395a68a46939381e78e5545198b-bba8d875fb41e88d-exec-5   0/1     ContainerCreating   0          0s
4ec00395a68a46939381e78e5545198b-bba8d875fb41e88d-exec-5   1/1     Running             0          4s
4ec00395a68a46939381e78e5545198b-bba8d875fb41e88d-exec-4   0/1     OOMKilled           0          4m37s
4ec00395a68a46939381e78e5545198b-bba8d875fb41e88d-exec-4   0/1     Terminating         0          5m
4ec00395a68a46939381e78e5545198b-bba8d875fb41e88d-exec-4   0/1     Terminating         0          5m
4ec00395a68a46939381e78e5545198b-bba8d875fb41e88d-exec-6   0/1     Pending             0          0s
4ec00395a68a46939381e78e5545198b-bba8d875fb41e88d-exec-6   0/1     Pending             0          0s
4ec00395a68a46939381e78e5545198b-bba8d875fb41e88d-exec-6   0/1     ContainerCreating   0          0s
4ec00395a68a46939381e78e5545198b-bba8d875fb41e88d-exec-6   1/1     Running             0          4s

Process will go on till around 30-40 executors are recreated and then the driver restarts and same process unfolds again. I have set the spark application's

restartPolicy:
 type:Never
-- ak1984
apache-spark
kubernetes

0 Answers