Airflow with Kubernetes - Airflow spawning Kube pods but pods aren't doing anything

2/7/2020

First time setting up Airflow/working with K8s for the most part so just trying to get it running locally and to run a couple of simple tasks in a small DAG. I had things running fine using the other executors, but given that I'd like to utilize K8s functionality once we are in production, I'm trying to get it set up locally.

The setup is pretty simple - generic testing DAG that ran fine with the other executors, and a relatively untouched config file as well for Airflow (main things to note are: using KubernetesExecutor, postgresql+psyocopg2 SQLAlchemy backend, and with in_cluster set to False as we aren't running Airflow itself in K8s - everything else is standard).

Airflow launches the local webserver just fine, along with the scheduler, and starts scheduling tasks when I initiate a DAG run, but the tasks are thrown into a queued state and never move from it. I am guessing it has something to do with the pod statuses that I am seeing for the tasks:

NAME                                                                 READY   STATUS             RESTARTS   AGE
testinglocalprintingdate-00b9b3a324b04913bf98d935ae076885   0/1     InvalidImageName   0          79s
testinglocalprintingdate-2d4a912ac30c4987af69d9ce62e36989   0/1     InvalidImageName   0          81s
testinglocalprintingdate-5a655060809647c69f4258fc32d9513d   0/1     InvalidImageName   0          77s
testinglocalprintingdate-9c3ccfebb34b4d0a84d6e8f43e144e69   0/1     InvalidImageName   0          75s
testinglocalprintingdate-d1b8d59260954638b0bc018b7743985b   0/1     InvalidImageName   0          73s

In addition, I am seeing these errors every minute or so (linked to this kube_client_request_args = {"_request_timeout" : [60,60] } in the Airflow config - changing the number from 60,60 to anything else has no effect):

[2020-02-07 17:22:32,244] {kubernetes_executor.py:337} ERROR - Unknown error in KubernetesJobWatcher. Failing
Traceback (most recent call last):
  File "/Users/genericuser/.pyenv/versions/3.7.4/lib/python3.7/site-packages/urllib3/response.py", line 425, in _error_catcher
    yield
  File "/Users/genericuser/.pyenv/versions/3.7.4/lib/python3.7/site-packages/urllib3/response.py", line 752, in read_chunked
    self._update_chunk_length()
  File "/Users/genericuser/.pyenv/versions/3.7.4/lib/python3.7/site-packages/urllib3/response.py", line 682, in _update_chunk_length
    line = self._fp.fp.readline()
  File "/Users/genericuser/.pyenv/versions/3.7.4/lib/python3.7/socket.py", line 589, in readinto
    return self._sock.recv_into(b)
  File "/Users/genericuser/.pyenv/versions/3.7.4/lib/python3.7/ssl.py", line 1071, in recv_into
    return self.read(nbytes, buffer)
  File "/Users/genericuser/.pyenv/versions/3.7.4/lib/python3.7/ssl.py", line 929, in read
    return self._sslobj.read(len, buffer)
socket.timeout: The read operation timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/genericuser/.pyenv/versions/3.7.4/lib/python3.7/site-packages/airflow/contrib/executors/kubernetes_executor.py", line 335, in run
    self.worker_uuid, self.kube_config)
  File "/Users/genericuser/.pyenv/versions/3.7.4/lib/python3.7/site-packages/airflow/contrib/executors/kubernetes_executor.py", line 359, in _run
    **kwargs):
  File "/Users/genericuser/.pyenv/versions/3.7.4/lib/python3.7/site-packages/kubernetes/watch/watch.py", line 144, in stream
    for line in iter_resp_lines(resp):
  File "/Users/genericuser/.pyenv/versions/3.7.4/lib/python3.7/site-packages/kubernetes/watch/watch.py", line 48, in iter_resp_lines
    for seg in resp.read_chunked(decode_content=False):
  File "/Users/genericuser/.pyenv/versions/3.7.4/lib/python3.7/site-packages/urllib3/response.py", line 781, in read_chunked
    self._original_response.close()
  File "/Users/genericuser/.pyenv/versions/3.7.4/lib/python3.7/contextlib.py", line 130, in __exit__
    self.gen.throw(type, value, traceback)
  File "/Users/genericuser/.pyenv/versions/3.7.4/lib/python3.7/site-packages/urllib3/response.py", line 430, in _error_catcher
    raise ReadTimeoutError(self._pool, None, "Read timed out.")
urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='192.168.64.2', port=8443): Read timed out.
Process KubernetesJobWatcher-3:
Traceback (most recent call last):
  File "/Users/genericuser/.pyenv/versions/3.7.4/lib/python3.7/site-packages/urllib3/response.py", line 425, in _error_catcher
    yield
  File "/Users/genericuser/.pyenv/versions/3.7.4/lib/python3.7/site-packages/urllib3/response.py", line 752, in read_chunked
    self._update_chunk_length()
  File "/Users/genericuser/.pyenv/versions/3.7.4/lib/python3.7/site-packages/urllib3/response.py", line 682, in _update_chunk_length
    line = self._fp.fp.readline()
  File "/Users/genericuser/.pyenv/versions/3.7.4/lib/python3.7/socket.py", line 589, in readinto
    return self._sock.recv_into(b)
  File "/Users/genericuser/.pyenv/versions/3.7.4/lib/python3.7/ssl.py", line 1071, in recv_into
    return self.read(nbytes, buffer)
  File "/Users/genericuser/.pyenv/versions/3.7.4/lib/python3.7/ssl.py", line 929, in read
    return self._sslobj.read(len, buffer)
socket.timeout: The read operation timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/genericuser/.pyenv/versions/3.7.4/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/Users/genericuser/.pyenv/versions/3.7.4/lib/python3.7/site-packages/airflow/contrib/executors/kubernetes_executor.py", line 335, in run
    self.worker_uuid, self.kube_config)
  File "/Users/genericuser/.pyenv/versions/3.7.4/lib/python3.7/site-packages/airflow/contrib/executors/kubernetes_executor.py", line 359, in _run
    **kwargs):
  File "/Users/genericuser/.pyenv/versions/3.7.4/lib/python3.7/site-packages/kubernetes/watch/watch.py", line 144, in stream
    for line in iter_resp_lines(resp):
  File "/Users/genericuser/.pyenv/versions/3.7.4/lib/python3.7/site-packages/kubernetes/watch/watch.py", line 48, in iter_resp_lines
    for seg in resp.read_chunked(decode_content=False):
  File "/Users/genericuser/.pyenv/versions/3.7.4/lib/python3.7/site-packages/urllib3/response.py", line 781, in read_chunked
    self._original_response.close()
  File "/Users/genericuser/.pyenv/versions/3.7.4/lib/python3.7/contextlib.py", line 130, in __exit__
    self.gen.throw(type, value, traceback)
  File "/Users/genericuser/.pyenv/versions/3.7.4/lib/python3.7/site-packages/urllib3/response.py", line 430, in _error_catcher
    raise ReadTimeoutError(self._pool, None, "Read timed out.")
urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='192.168.64.2', port=8443): Read timed out.
[2020-02-07 17:22:32,597] {kubernetes_executor.py:442} ERROR - Error while health checking kube watcher process. Process died for unknown reasons
[2020-02-07 17:22:32,615] {kubernetes_executor.py:346} INFO - Event: and now my watch begins starting at resource_version: 0

I've been trying to debug this for a couple of days to no avail - so any help would be appreciated.

-- chevchelios
airflow
airflow-scheduler
kubernetes
python

0 Answers