Airflow Scheduler liveness probe crashing (version 2.0)


I have just upgraded my Airflow from 1.10.13 to 2.0. I am running it in Kubernetes (AKS Azure) with Kubernetes Executor. Unfortunately, I see my Scheduler getting killed every 15-20 mins due to Liveness probe failing. Hence my pod keeps restarting.

I had no issues in 1.10.13.

This is my Liveness probe:

import os

from import SchedulerJob
from airflow.utils.db import create_session
from import get_hostname
import sys

with create_session() as session:
  job = session.query(SchedulerJob).filter_by(hostname=get_hostname()).order_by(

sys.exit(0 if job.is_alive() else 1)

When I look in the scheduler logs I see the following:

[2021-02-16 12:18:21,883] {} DEBUG - Waiting for <ForkProcess name='DagFileProcessor489-Process' pid=12812 parent=9286 stopped exitcode=0>
[2021-02-16 12:18:22,228] {} DEBUG - No tasks to consider for execution.
[2021-02-16 12:18:22,232] {} DEBUG - 0 running task instances
[2021-02-16 12:18:22,232] {} DEBUG - 0 in queue
[2021-02-16 12:18:22,232] {} DEBUG - 32 open slots
[2021-02-16 12:18:22,232] {} DEBUG - Calling the <class 'airflow.executors.kubernetes_executor.KubernetesExecutor'> sync method
[2021-02-16 12:18:22,233] {} DEBUG - Syncing KubernetesExecutor
[2021-02-16 12:18:22,233] {} DEBUG - KubeJobWatcher alive, continuing
[2021-02-16 12:18:22,234] {} DEBUG - Received message of type DagParsingStat
[2021-02-16 12:18:22,234] {} DEBUG - Received message of type DagParsingStat
[2021-02-16 12:18:22,236] {} DEBUG - Received message of type DagParsingStat
[2021-02-16 12:18:22,246] {} DEBUG - Next timed event is in 0.143059
[2021-02-16 12:18:22,246] {} DEBUG - Ran scheduling loop in 0.05 seconds
[2021-02-16 12:18:22,422] {} DEBUG - No tasks to consider for execution.
[2021-02-16 12:18:22,426] {} DEBUG - 0 running task instances
[2021-02-16 12:18:22,426] {} DEBUG - 0 in queue
[2021-02-16 12:18:22,426] {} DEBUG - 32 open slots
[2021-02-16 12:18:22,427] {} DEBUG - Calling the <class 'airflow.executors.kubernetes_executor.KubernetesExecutor'> sync method
[2021-02-16 12:18:22,427] {} DEBUG - Syncing KubernetesExecutor
[2021-02-16 12:18:22,427] {} DEBUG - KubeJobWatcher alive, continuing
[2021-02-16 12:18:22,439] {} INFO - Resetting orphaned tasks for active dag runs
[2021-02-16 12:18:22,452] {} DEBUG - Disposing DB connection pool (PID 12819)
[2021-02-16 12:18:22,460] {} DEBUG - Waiting for <ForkProcess name='DagFileProcessor490-Process' pid=12819 parent=9286 stopped exitcode=0>
[2021-02-16 12:18:23,009] {} DEBUG - Disposing DB connection pool (PID 12826)
[2021-02-16 12:18:23,017] {} DEBUG - Waiting for <ForkProcess name='DagFileProcessor491-Process' pid=12826 parent=9286 stopped exitcode=0>
[2021-02-16 12:18:23,594] {} DEBUG - Disposing DB connection pool (PID 12833)

... Many of these Disposing DB connection pool entries here

[2021-02-16 12:20:08,212] {} DEBUG - Waiting for <ForkProcess name='DagFileProcessor675-Process' pid=14146 parent=9286 stopped exitcode=0>
[2021-02-16 12:20:08,916] {} DEBUG - Disposing DB connection pool (PID 14153)
[2021-02-16 12:20:08,924] {} DEBUG - Waiting for <ForkProcess name='DagFileProcessor676-Process' pid=14153 parent=9286 stopped exitcode=0>
[2021-02-16 12:20:09,475] {} DEBUG - Disposing DB connection pool (PID 14160)
[2021-02-16 12:20:09,484] {} DEBUG - Waiting for <ForkProcess name='DagFileProcessor677-Process' pid=14160 parent=9286 stopped exitcode=0>
[2021-02-16 12:20:10,044] {} DEBUG - Disposing DB connection pool (PID 14167)
[2021-02-16 12:20:10,053] {} DEBUG - Waiting for <ForkProcess name='DagFileProcessor678-Process' pid=14167 parent=9286 stopped exitcode=0>
[2021-02-16 12:20:10,610] {} DEBUG - Disposing DB connection pool (PID 14180)
[2021-02-16 12:23:42,287] {} INFO - Exiting gracefully upon receiving signal 15
[2021-02-16 12:23:43,290] {} INFO - Sending Signals.SIGTERM to GPID 9286
[2021-02-16 12:23:43,494] {} INFO - Waiting up to 5 seconds for processes to exit...
[2021-02-16 12:23:43,503] {} INFO - Process psutil.Process(pid=14180, status='terminated', started='12:20:09') (14180) terminated with exit code None
[2021-02-16 12:23:43,503] {} INFO - Process psutil.Process(pid=9286, status='terminated', exitcode=0, started='12:13:35') (9286) terminated with exit code 0
[2021-02-16 12:23:43,506] {} INFO - Sending Signals.SIGTERM to GPID 9286
[2021-02-16 12:23:43,506] {} INFO - Exited execute loop
[2021-02-16 12:23:43,523] {} DEBUG - Calling callbacks: []
[2021-02-16 12:23:43,525] {} DEBUG - Disposing DB connection pool (PID 7)
-- stoicky

2 Answers


For mine case the problem was with the workers. Which had a db connection issues. Fixing it solved the issue for scheduler as well.

Note: Check the workers logs as well.

-- Tara Prasad Gurung
Source: StackOverflow


I managed to fix my restart by setting up the following configs:

delete_option_kwargs = {"grace_period_seconds": 10}
enable_tcp_keepalive = True
tcp_keep_idle = 30
tcp_keep_intvl = 30
tcp_keep_cnt = 30

I have another Airflow instance running in AWS - Kubernetes. That one runs fine with any version, I realized the problem is with Azure Kubernetes, the rest api calls to the api server.

Just in case this helps someone else....

-- stoicky
Source: StackOverflow