I have the strange problem that a Spark job ran on Kubernetes fails with a lot of "Missing an output location for shuffle X" in jobs where there is a lot of shuffling going on. Increasing executor memory does not help. The same job run on just a single node of the Kubernetes cluster in local* mode runs fine however so I suspect it has to do with Kubernetes or underlying Docker. When an executor dies, the pods are deleted immediately so I cannot track down why it failed. Is there an option that keeps failed pods around so I can view their logs?
There is a deleteOnTermination
setting in the spark application yaml. See the spark-on-kubernetes README.md.
deleteOnTermination
- (Optional)DeleteOnTermination
specify whether executor pods should be deleted in case of failure or normal termination.<br><br> Maps tospark.kubernetes.executor.deleteOnTermination
that is available since Spark 3.0.
You can view the logs of the previous terminated pod like this:
kubectl logs -p <terminated pod name>
Also use spec.ttlSecondsAfterFinished
field of a Job as mentioned here
Executors are deleted by default on any failures and you cannot do anything with that unless you customize Spark on K8s code or use some advanced K8s tooling. What you can do (and most probably is the easiest approach to start with) is configuring some external log collectors, eg. Grafana Loki which can be deployed with 1 click to any K8s cluster, or some ELK stack components. These will help you to persist logs even after pods are deleted.