Airflow + Kubernetes VS Airflow + Spark

10/11/2018

Like some article that I previously read. It said that in new Kubernetes version, already include Spark capabilities. But with some different ways such as using KubernetesPodOperator instead of using BashOperator / PythonOperator to do SparkSubmit.

Is that the best practice to Combine Airflow + Kubernetes is to remove Spark and using KubernetesPodOperator to execute the task?

Which is have a better performance since Kubernetes have AutoScaling that Spark doesn’t have.

Need someone expert in Kubernetes to help me explain this. I’m still newbie with this Kubernetes, Spark, and Airflow things. :slight_smile:

Thank You.

-- Xnuxer
airflow
apache-spark
kubernetes

2 Answers

10/22/2019

One more solution which may help you is to use Apache Livy on Kubernetes (PR: https://github.com/apache/incubator-livy/pull/167) with Airflow HttpOperator.

-- Aliaksandr Sasnouskikh
Source: StackOverflow

10/11/2018

in new Kubernetes version, already include Spark capabilities

I think you got that backwards. New versions of Spark can run tasks in a Kubernetes cluster.

using KubernetesPodOperator instead of using BashOperator / PythonOperator to do SparkSubmit

Using Kubernetes would allow you to run containers with whatever isolated dependencies you wanted.

Meaning

  1. With BashOperator, you must distribute the files to some shared filesystem or to all the nodes that ran the Airflow tasks. For example, spark-submit must be available on all Airflow nodes.
  2. Similarly with Python, you ship out some zip or egg files that include your pip/conda dependency environment

remove Spark and using KubernetesPodOperator to execute the task

There is still good reasons to run Spark with Airflow, but instead you would be packaging a Spark driver container to execute spark-submit inside a container against the Kubernetes cluster. This way, you only need docker installed, not Spark (and all dependencies)

Kubernetes have AutoScaling that Spark doesn’t have

Spark does have Dynamic Resource Allocation...

-- OneCricketeer
Source: StackOverflow