When I launch the SparkPi
example on a self-hosted kubernetes cluster, the executor pods are quickly created -> have an error status -> are deleted -> are replaced by new executors pods.
I tried the same command on a Google Kubernetes Engine with success. I check the RBAC rolebinding
to make sure that the service account has right to create the pod.
Guessing when the next executor pod will be ready, I can see using kubectl describe pod <predicted_executor_pod_with_number>
that the pod is actually created:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 1s default-scheduler Successfully assigned default/examplepi-1563878435019-exec-145 to slave-node04
Normal Pulling 0s kubelet, slave-node04 Pulling image "myregistry:5000/imagery:c5b8e0e64cc98284fc4627e838950c34ccb22676.5"
Normal Pulled 0s kubelet, slave-node04 Successfully pulled image "myregistry:5000/imagery:c5b8e0e64cc98284fc4627e838950c34ccb22676.5"
Normal Created 0s kubelet, slave-node04 Created container executor
This is my spark-submit
call:
/opt/spark/bin/spark-submit \
--master k8s://https://mycustomk8scluster:6443 \
--name examplepi \
--deploy-mode cluster \
--driver-memory 2G \
--executor-memory 2G \
--conf spark.executor.instances=2 \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
--conf spark.driver.extraJavaOptions=-Dlog4j.configuration=file:///opt/spark/work-dir/log4j.properties \
--conf spark.kubernetes.container.image=myregistry:5000/imagery:c5b8e0e64cc98284fc4627e838950c34ccb22676.5 \
--conf spark.kubernetes.executor.container.image=myregistry:5000/imagery:c5b8e0e64cc98284fc4627e838950c34ccb22676.5 \
--conf spark.kubernetes.container.image.pullPolicy=Always \
--conf spark.kubernetes.driver.pod.name=pi-driver \
--conf spark.driver.allowMultipleContexts=true \
--conf spark.kubernetes.local.dirs.tmpfs=true \
--class com.olameter.sdi.imagery.IngestFromGrpc \
--class org.apache.spark.examples.SparkPi \
local:///opt/spark/examples/jars/spark-examples_2.11-2.4.3.jar 100
I expect that the required executor (2) should be created. If the driver script cannot create it, I would at least expect some log to be able to diagnose the issue.
The issue was related to Hadoop + Spark integration. I was using Spark binary without Hadoop spark-2.4.3-bin-without-hadoop.tgz
+ Hadoop 3.1.2. The configuration using environment variables seemed to be problematic for the Spark Executor.
I compiled Spark with Hadoop 3.1.2 to solve this issue. See: https://spark.apache.org/docs/latest/building-spark.html#specifying-the-hadoop-version-and-enabling-yarn.