Like some article that I previously read. It said that in new Kubernetes version, already include Spark capabilities. But with some different ways such as using KubernetesPodOperator instead of using BashOperator / PythonOperator to do SparkSubmit.
Is that the best practice to Combine Airflow + Kubernetes is to remove Spark and using KubernetesPodOperator to execute the task?
Which is have a better performance since Kubernetes have AutoScaling that Spark doesn’t have.
Need someone expert in Kubernetes to help me explain this. I’m still newbie with this Kubernetes, Spark, and Airflow things. :slight_smile:
Thank You.
One more solution which may help you is to use Apache Livy on Kubernetes (PR: https://github.com/apache/incubator-livy/pull/167) with Airflow HttpOperator.
in new Kubernetes version, already include Spark capabilities
I think you got that backwards. New versions of Spark can run tasks in a Kubernetes cluster.
using KubernetesPodOperator instead of using BashOperator / PythonOperator to do SparkSubmit
Using Kubernetes would allow you to run containers with whatever isolated dependencies you wanted.
Meaning
spark-submit
must be available on all Airflow nodes.remove Spark and using KubernetesPodOperator to execute the task
There is still good reasons to run Spark with Airflow, but instead you would be packaging a Spark driver container to execute spark-submit
inside a container against the Kubernetes cluster. This way, you only need docker
installed, not Spark (and all dependencies)
Kubernetes have AutoScaling that Spark doesn’t have
Spark does have Dynamic Resource Allocation...