Airflow is receiving incorrect POD status from Kubernetes

4/13/2021

We are using Airflow to schedule Spark job on Kubernetes. Recently, I have encountered a scenario where:

  • airflow received error 404 with message "pods pod-name not found"
  • I manually checked that POD was actually working fine at that time. In fact, I was able to collect logs using kubectl logs -f -n namespace podname

What happened due to this is that airflow created another POD for running the same job which resulted in race condition.

Airflow is using Kubernetes Python client's read_namespaced_pod API()

def read_pod(self, pod):
    """Read POD information"""
    try:
        return self._client.read_namespaced_pod(pod.metadata.name, pod.metadata.namespace)
    except BaseHTTPError as e:
        raise AirflowException(
            'There was an error reading the kubernetes API: {}'.format(e)
        )

I believe read_namespaced_pod() calls Kubernetes API. In order to investigate this further, I would like to like check logs of Kubernetes API server.

Can you please share steps to check what is happening on Kubernetes side ?

Note: Kubernetes version is 1.18 and Airflow version is 1.10.10.

-- Harjindersingh Mistry
airflow
kubernetes

1 Answer

4/13/2021

Answering the question from the perspective of logs/troubleshooting:

I believe read_namespaced_pod() calls Kubernetes API. In order to investigate this further, I would like to like check logs of Kubernetes API server.

Yes, you are correct, this function calls the Kubernetes API. You can check the logs of Kubernetes API server by running:

  • $ kubectl logs -n kube-system KUBERNETES_API_SERVER_POD_NAME

I would also consider checking the kube-controller-manager:

  • $ kubectl logs -n kube-system KUBERNETES_CONTROLLER_MANAGER_POD_NAME

The example output of it:

I0413 12:33:12.840270       1 event.go:291] "Event occurred" object="default/nginx-6799fc88d8" kind="ReplicaSet" apiVersion="apps/v1" type="Normal" reason="SuccessfulCreate" message="Created pod: nginx-6799fc88d8-kchp7"

A side note!

Above commands will work assuming that your kubernetes-apiserver and kubernetes-controller-manager Pod is visible to you


Can you please share steps to check what is happening on Kubernetes side ?

This question targets the basics of troubleshooting/logs checking.

For that you can use following commands (and the ones mentioned earlier):

  • $ kubectl get RESOURCE RESOURCE_NAME:
    • example: $ kubectl get pod airflow-pod-name

also you can add -o yaml for more information

  • $ kubectl describe RESOURCE RESOURCE_NAME:
    • example: $ kubectl describe pod airflow-pod-name
  • $ kubectl logs POD_NAME:
    • example: $ kubectl logs airflow-pod-name

Additional resources:

-- Dawid Kruk
Source: StackOverflow