I am writing an Airflow DAG to pull data from an API and store it in a database I own. Following best practices outlined in We're All Using Airflow Wrong, I'm writing the DAG as a sequence of KubernetesPodOperator
s that run pretty simple Python functions as the entry point to the Docker image.
The problem I'm trying to solve is that this DAG should only pull data for the execution_date
.
If I was using a PythonOperator
(doc), I could use the provide_context
argument to make the execution date available to the function. But judging from the KubernetesPodOperator's documentation, it seems that the Kubernetes operator has no argument that does what provide_context
does.
My best guess is that you could use the arguments
command to pass in a date range, and since it's templated, you can reference it like this:
my_pod_operator = KubernetesPodOperator(
# ... other args here
arguments=['python', 'my_script.py', '{{ ds }}'],
# arguments continue
)
And then you'd get the start date like you'd get any other argument provided to a Python file run as a script, by using sys.argv
.
Is this the right way of doing it?
Thanks for the help.
Yes, that is the correct way of doing it.
Each Operator would have template_fields
. All the parameters listed in template_fields
can render Jinja2 templates and Airflow Macros.
For KubernetesPodOperator, if you check docs, you would find:
template_fields = ['cmds', 'arguments', 'env_vars', 'config_file']
which means you can pass '{{ ds }}'
to any of the four params listed above.