Apache airflow run dag operators using kubernetes in AWS

2/3/2019

I am evaluating Apache airflow for production use in a data environment and I would like to know if with airflow you can run operators in self contained docker environments on an auto scaling Kubernetes cluster.

I found the following operator: KubernetesPodOperator which seems to do the job, but the only examples I have found have been on Google Cloud. I would like to run this on AWS, however I haven't found any examples of how this would be done. I believe AWS EKS or AWS fargate might fit the bill but not sure.

Can anyone with airflow experience please let me know if this is possible? I have looked online and haven't found anything clear yet.

-- maldman
airflow
aws-eks
aws-fargate
kubernetes

2 Answers

2/4/2019

You can use Apache Airflow DAG operators in any cloud provider, not only GKE.

Airflow-on-kubernetes-part-1-a-different-kind-of-operator as like as Airflow Kubernetes Operator articles provide basic examples how to use DAG's.

Also Explore Airflow KubernetesExecutor on AWS and kops article provides good explanation, with an example on how to use airflow-dags and airflow-logs volume on AWS.

Example:

from airflow.operators.python_operator import PythonOperator
from airflow.models import DAG
from datetime import datetime
import time
import os

args = {
    'owner': 'airflow',
    "start_date": datetime(2018, 10, 4),
}

dag = DAG(
    dag_id='test_kubernetes_executor',
    default_args=args,
    schedule_interval=None
)

def print_stuff():
    print("Hi Airflow")

for i in range(2):
    one_task = PythonOperator(
        task_id="one_task" + str(i),
        python_callable=print_stuff,
        dag=dag
    )

    second_task = PythonOperator(
        task_id="two_task" + str(i),
        python_callable=print_stuff,
        dag=dag
    )

    third_task = PythonOperator(
        task_id="third_task" + str(i),
        python_callable=print_stuff,
        dag=dag
    )

    one_task >> second_task >> third_task
-- VKR
Source: StackOverflow

7/10/2019

We have been using Fargate and Airflow in production and the experience so far has been good.

We have been using it for transient workloads and it is turning out to be cheaper for us than having a dedicated Kubernetes cluster. Also, there is no management overhead of any kind.

Github — Airflow DAG with ECSOperatorConfig

-- GTO
Source: StackOverflow