Spark on k8s - emptyDir not mounted to directory

2/26/2019

I kicked off some Spark job on Kubernetes with quite big volume of data, and job failed that there is no enough space in /var/data/spark-xxx directory.

As the Spark documentation says on https://github.com/apache/spark/blob/master/docs/running-on-kubernetes.md

Spark uses temporary scratch space to spill data to disk during shuffles and other operations. When using Kubernetes as the resource manager the pods will be created with an emptyDir volume mounted for each directory listed in SPARK_LOCAL_DIRS. If no directories are explicitly specified then a default directory is created and configured appropriately

Seems like /var/data/spark-xx directory is the default one for emptyDir. Thus, I tried to map that emptyDir to Volume (with bigger space) which is already mapped to Driver and Executors Pods.

I mapped it in the properties file and I can see that is mounted in the shell:

spark.kubernetes.driver.volumes.persistentVolumeClaim.checkvolume.mount.path=/checkpoint
spark.kubernetes.driver.volumes.persistentVolumeClaim.checkvolume.mount.readOnly=false
spark.kubernetes.driver.volumes.persistentVolumeClaim.checkvolume.options.claimName=sparkstorage
spark.kubernetes.executor.volumes.persistentVolumeClaim.checkvolume.mount.path=/checkpoint
spark.kubernetes.executor.volumes.persistentVolumeClaim.checkvolume.mount.readOnly=false
spark.kubernetes.executor.volumes.persistentVolumeClaim.checkvolume.options.claimName=sparkstorage

I am wondering if it's possible to mount emptyDir somehow on my persistent storage, so I can spill more data and avoid job failures?

-- Tomasz Krol
apache-spark
kubernetes

1 Answer

4/10/2020

I found that spark 3.0 has considered this problem and has completed the feature.

Spark supports using volumes to spill data during shuffles and other operations. To use a volume as local storage, the volume's name should starts with spark-local-dir-, for example:

--conf spark.kubernetes.driver.volumes.[VolumeType].spark-local-dir-[VolumeName].mount.path=<mount path>
--conf spark.kubernetes.driver.volumes.[VolumeType].spark-local-dir-[VolumeName].mount.readOnly=false

Reference:

-- Merlin Wang
Source: StackOverflow