I have a long-running Cloud Composer Airflow task that kicks off a job using the KubernetesPodOperator
. Sometimes it finishes successfully after about two hours, but more often it gets marked as failed with the following error in the Airflow worker log:
[2019-06-24 18:49:34,718] {jobs.py:2685} WARNING - The recorded hostname airflow-worker-xxxxxxxxxx-aaaaa does not match this instance's hostname airflow-worker-xxxxxxxxxx-bbbbb
Traceback (most recent call last):
File "/usr/local/bin/airflow", line 7, in <module>
exec(compile(f.read(), __file__, 'exec'))
...
File "/usr/local/lib/airflow/airflow/jobs.py", line 2686, in heartbeat_callback
raise AirflowException("Hostname of job runner does not match")
airflow.exceptions.AirflowException: Hostname of job runner does not match
After the task is marked as failed, the actual KubernetesPodOperator
job still finishes successfully without any errors. Both of the workers referenced in the log, airflow-worker-xxxxxxxxxx-aaaaa
and airflow-worker-xxxxxxxxxx-bbbbb
, are still up and running.
This Airflow PR made it possible to override the hostname, but I can't tell if that's an appropriate solution in this case, since none of the workers appear to have died or changed during the task run. Is it normal for a running task to be reassigned to a different worker? And if so, why does the the Airflow source fail the task in the event of a hostname mismatch?
I think the root cause may be a known Airflow issue that makes the Scheduler to try redelivering a task after some time. In case the task goes to other workers, the hostname of the task would be updated to the new one, in case the previous worker completes the task, the hostname would be different and there will be an error. If the cluster is busy (considering the task takes 2 hours, it's likely), your task may be queued for a long time before being picked up by a worker.
Some ideas that may solve this:
visibility_timeout
worker_concurrency
, so a worker can process more tasksAny how, it's a bit hard to troubleshoot this without checking logs and the environment, so if this is still happening, feel free to contact GCP support.