Long-running Airflow task gets incorrectly marked as failed due to hostname mismatch

6/27/2019

I have a long-running Cloud Composer Airflow task that kicks off a job using the KubernetesPodOperator. Sometimes it finishes successfully after about two hours, but more often it gets marked as failed with the following error in the Airflow worker log:

[2019-06-24 18:49:34,718] {jobs.py:2685} WARNING - The recorded hostname airflow-worker-xxxxxxxxxx-aaaaa does not match this instance's hostname airflow-worker-xxxxxxxxxx-bbbbb
Traceback (most recent call last):
  File "/usr/local/bin/airflow", line 7, in <module>
    exec(compile(f.read(), __file__, 'exec'))
...

  File "/usr/local/lib/airflow/airflow/jobs.py", line 2686, in heartbeat_callback
    raise AirflowException("Hostname of job runner does not match")
airflow.exceptions.AirflowException: Hostname of job runner does not match

After the task is marked as failed, the actual KubernetesPodOperator job still finishes successfully without any errors. Both of the workers referenced in the log, airflow-worker-xxxxxxxxxx-aaaaa and airflow-worker-xxxxxxxxxx-bbbbb, are still up and running.

This Airflow PR made it possible to override the hostname, but I can't tell if that's an appropriate solution in this case, since none of the workers appear to have died or changed during the task run. Is it normal for a running task to be reassigned to a different worker? And if so, why does the the Airflow source fail the task in the event of a hostname mismatch?

-- chmod_007
airflow
google-cloud-composer
google-kubernetes-engine

1 Answer

7/17/2019

I think the root cause may be a known Airflow issue that makes the Scheduler to try redelivering a task after some time. In case the task goes to other workers, the hostname of the task would be updated to the new one, in case the previous worker completes the task, the hostname would be different and there will be an error. If the cluster is busy (considering the task takes 2 hours, it's likely), your task may be queued for a long time before being picked up by a worker.

Some ideas that may solve this:

  • Increase visibility_timeout
  • Increase worker_concurrency, so a worker can process more tasks
  • Increase node count to have more workers

Any how, it's a bit hard to troubleshoot this without checking logs and the environment, so if this is still happening, feel free to contact GCP support.

-- IƱigo
Source: StackOverflow