I'm trying to set up a DAG that will create a Spark Cluster in the first task, submit spark applications to the cluster in interim tasks, and have finally teardown the Spark Cluster in the last task.
The approach I'm attempting right now is to use KubernetesPodOperators to create Spark Master and Worker pods. The issue is that they run a spark daemon which never exits. The fact that the command called on the pod never exits means that those tasks gets stuck in airflow in a running phase. So, I'm wondering if there's a way to run the spark daemon and then continue on to the next tasks in the DAG?
The approach I'm attempting right now is to use KubernetesPodOperators to create Spark Master and Worker pods.
Apache Spark provides working support for executing jobs in a Kubernetes cluster. It delivers a driver that is capable of starting executors in pods to run jobs.
You don't need to create Master and Worker pods directly in Airflow.
Rather build a Docker image containing Apache Spark with Kubernetes backend. An example Dockerfile is provided in the project.
Then submit the given jobs to the cluster in a container based off this image by using KubernetesPodOperator
. The following sample job is adapted from documentation provided in Apache Spark to submit spark jobs directly to a Kubernetes cluster.
from airflow.operators.kubernetes_pod_operator import KubernetesPodOperator
kubernetes_full_pod = KubernetesPodOperator(
task_id='spark-job-task-ex',
name='spark-job-task',
namespace='default',
image='<prebuilt-spark-image-name>',
cmds=['bin/spark-submit'],
arguments=[
'--master k8s://https://<k8s-apiserver-host>:<k8s-apiserver-port>',
'--deploy-mode cluster',
'--name spark-pi',
' --class org.apache.spark.examples.SparkPi',
'--conf spark.executor.instances=5',
'--conf spark.kubernetes.container.image=<prebuilt-spark-image-name>',
'local:///path/to/examples.jar'
],
#...
)