I am evaluating Apache airflow for production use in a data environment and I would like to know if with airflow you can run operators in self contained docker environments on an auto scaling Kubernetes cluster.
I found the following operator: KubernetesPodOperator
which seems to do the job, but the only examples I have found have been on Google Cloud. I would like to run this on AWS, however I haven't found any examples of how this would be done. I believe AWS EKS or AWS fargate might fit the bill but not sure.
Can anyone with airflow experience please let me know if this is possible? I have looked online and haven't found anything clear yet.
You can use Apache Airflow DAG operators in any cloud provider, not only GKE.
Airflow-on-kubernetes-part-1-a-different-kind-of-operator as like as Airflow Kubernetes Operator articles provide basic examples how to use DAG's.
Also Explore Airflow KubernetesExecutor on AWS and kops article provides good explanation, with an example on how to use airflow-dags
and airflow-logs
volume on AWS.
Example:
from airflow.operators.python_operator import PythonOperator
from airflow.models import DAG
from datetime import datetime
import time
import os
args = {
'owner': 'airflow',
"start_date": datetime(2018, 10, 4),
}
dag = DAG(
dag_id='test_kubernetes_executor',
default_args=args,
schedule_interval=None
)
def print_stuff():
print("Hi Airflow")
for i in range(2):
one_task = PythonOperator(
task_id="one_task" + str(i),
python_callable=print_stuff,
dag=dag
)
second_task = PythonOperator(
task_id="two_task" + str(i),
python_callable=print_stuff,
dag=dag
)
third_task = PythonOperator(
task_id="third_task" + str(i),
python_callable=print_stuff,
dag=dag
)
one_task >> second_task >> third_task
We have been using Fargate and Airflow in production and the experience so far has been good.
We have been using it for transient workloads and it is turning out to be cheaper for us than having a dedicated Kubernetes cluster. Also, there is no management overhead of any kind.