Spark on Kubernetes: Is it possible to keep the crashed pods when a job fails?

6/4/2020

I have the strange problem that a Spark job ran on Kubernetes fails with a lot of "Missing an output location for shuffle X" in jobs where there is a lot of shuffling going on. Increasing executor memory does not help. The same job run on just a single node of the Kubernetes cluster in local* mode runs fine however so I suspect it has to do with Kubernetes or underlying Docker. When an executor dies, the pods are deleted immediately so I cannot track down why it failed. Is there an option that keeps failed pods around so I can view their logs?

-- rabejens
apache-spark
kubernetes

3 Answers

8/13/2021

There is a deleteOnTermination setting in the spark application yaml. See the spark-on-kubernetes README.md.

deleteOnTermination - (Optional) DeleteOnTermination specify whether executor pods should be deleted in case of failure or normal termination.<br><br> Maps to spark.kubernetes.executor.deleteOnTermination that is available since Spark 3.0.

-- Sparkles
Source: StackOverflow

6/4/2020

You can view the logs of the previous terminated pod like this:

kubectl logs -p <terminated pod name>

Also use spec.ttlSecondsAfterFinished field of a Job as mentioned here

-- Arghya Sadhu
Source: StackOverflow

6/7/2020

Executors are deleted by default on any failures and you cannot do anything with that unless you customize Spark on K8s code or use some advanced K8s tooling. What you can do (and most probably is the easiest approach to start with) is configuring some external log collectors, eg. Grafana Loki which can be deployed with 1 click to any K8s cluster, or some ELK stack components. These will help you to persist logs even after pods are deleted.

-- Aliaksandr Sasnouskikh
Source: StackOverflow