airflow on kubernetes worker pod crashed immediately

2/22/2022

Hi guys I'm in trouble with this one.

I set airflow on Kubernetes infra

I use AWS EKS, AWS EFS (for persistence volume )

Airflow: 2.2.3-python3.8
Kubernetes: 1.21

airflow uid: 50000, gid: 0

I refer to this blog to deploy this infra.

My Dockerfile

#  Licensed to the Apache Software Foundation (ASF) under one   *
#  or more contributor license agreements.  See the NOTICE file *
#  distributed with this work for additional information        *
#  regarding copyright ownership.  The ASF licenses this file   *
#  to you under the Apache License, Version 2.0 (the            *
#  "License"); you may not use this file except in compliance   *
#  with the License.  You may obtain a copy of the License at   *
#                                                               *
#    http://www.apache.org/licenses/LICENSE-2.0                 *
#                                                               *
#  Unless required by applicable law or agreed to in writing,   *
#  software distributed under the License is distributed on an  *
#  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY       *
#  KIND, either express or implied.  See the License for the    *
#  specific language governing permissions and limitations      *
#  under the License.                                           *

FROM apache/airflow:2.2.3-python3.8
RUN usermod -g 0 airflow

# install deps
USER root
RUN apt-get update -y && apt-get install -y \
    libczmq-dev \
    libssl-dev \
    inetutils-telnet \
    python3-dev \
    build-essential \
    postgresql postgresql-contrib \
    bind9utils \
    gcc \
    git \
    && apt-get clean

# vim
RUN apt-get update \
  && apt-get install -y --no-install-recommends \
         vim \
  && apt-get autoremove -yqq --purge \
  && apt-get clean \
  && rm -rf /var/lib/apt/lists/*

USER airflow
RUN pip install --upgrade pip
COPY requirement.txt /tmp/requirement.txt

RUN pip install -r /tmp/requirement.txt


COPY airflow-test-env-init.sh /tmp/airflow-test-env-init.sh

COPY bootstrap.sh /bootstrap.sh

ENTRYPOINT ["/bootstrap.sh"]

My airflow.cfg file

#  Licensed to the Apache Software Foundation (ASF) under one   *
#  or more contributor license agreements.  See the NOTICE file *
#  distributed with this work for additional information        *
#  regarding copyright ownership.  The ASF licenses this file   *
#  to you under the Apache License, Version 2.0 (the            *
#  "License"); you may not use this file except in compliance   *
#  with the License.  You may obtain a copy of the License at   *
#                                                               *
#    http://www.apache.org/licenses/LICENSE-2.0                 *
#                                                               *
#  Unless required by applicable law or agreed to in writing,   *
#  software distributed under the License is distributed on an  *
#  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY       *
#  KIND, either express or implied.  See the License for the    *
#  specific language governing permissions and limitations      *
#  under the License.                                           *

# Note: The airflow image used in this example is obtained by   *
# building the image from the local docker subdirectory.        *
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: airflow
  namespace: airflow
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  namespace: airflow
  name: airflow
rules:
  - apiGroups: [""] # "" indicates the core API group
    resources: ["pods"]
    verbs: ["get", "list", "watch", "create", "update", "delete"]
  - apiGroups: [ "" ]
    resources: [ "pods/log" ]
    verbs: [ "get", "list" ]
  - apiGroups: [ "" ]
    resources: [ "pods/exec" ]
    verbs: [ "create", "get" ]
  - apiGroups: ["batch", "extensions"]
    resources: ["jobs"]
    verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: airflow
  namespace: airflow
subjects:
  - kind: ServiceAccount
    name: airflow # Name of the ServiceAccount
    namespace: airflow
roleRef:
  kind: Role # This must be Role or ClusterRole
  name: airflow # This must match the name of the Role
                #   or ClusterRole you wish to bind to
  apiGroup: rbac.authorization.k8s.io
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: airflow
  namespace: airflow
spec:
  replicas: 1
  selector:
    matchLabels:
      name: airflow
  template:
    metadata:
      labels:
        name: airflow
    spec:
      serviceAccountName: airflow
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: lifecycle
                operator: NotIn
                values:
                - Ec2Spot
      containers:
      - name: webserver
        image: {{AIRFLOW_IMAGE}}:{{AIRFLOW_TAG}}
        imagePullPolicy: Always
        ports:
        - name: webserver
          containerPort: 8080
        args: ["webserver"]
        env:
        - name: AIRFLOW_KUBE_NAMESPACE
          valueFrom:
            fieldRef:
              fieldPath: metadata.namespace
        - name: SQL_ALCHEMY_CONN
          valueFrom:
            secretKeyRef:
              name: airflow-secrets
              key: sql_alchemy_conn
        volumeMounts:
        - name: airflow-configmap
          mountPath: /opt/airflow/airflow.cfg
          subPath: airflow.cfg
        - name: {{POD_AIRFLOW_VOLUME_NAME}}
          mountPath: /opt/airflow/dags
        - name: {{POD_AIRFLOW_VOLUME_NAME}}
          mountPath: /opt/airflow/logs
      - name: scheduler
        image: {{AIRFLOW_IMAGE}}:{{AIRFLOW_TAG}}
        imagePullPolicy: Always
        args: ["scheduler"]
        env:
        - name: AIRFLOW_KUBE_NAMESPACE
          valueFrom:
            fieldRef:
              fieldPath: metadata.namespace
        - name: SQL_ALCHEMY_CONN
          valueFrom:
            secretKeyRef:
              name: airflow-secrets
              key: sql_alchemy_conn
        volumeMounts:
        - name: airflow-configmap
          mountPath: /opt/airflow/airflow.cfg
          subPath: airflow.cfg
        - name: {{POD_AIRFLOW_VOLUME_NAME}}
          mountPath: /opt/airflow/dags
        - name: {{POD_AIRFLOW_VOLUME_NAME}}
          mountPath: /opt/airflow/logs
      - name: git-sync
        image: k8s.gcr.io/git-sync/git-sync:v3.4.0
        imagePullPolicy: IfNotPresent
        envFrom:
          - configMapRef:
              name: airflow-gitsync
          - secretRef:
              name: airflow-secrets
        volumeMounts:
          - name: {{POD_AIRFLOW_VOLUME_NAME}}
            mountPath: /git
      volumes:
      - name: {{POD_AIRFLOW_VOLUME_NAME}}
        persistentVolumeClaim:
          claimName: airflow-efs-pvc
      - name: airflow-dags-fake
        emptyDir: {}
      - name: airflow-configmap
        configMap:
          name: airflow-configmap
      securityContext:
        runAsUser: 50000
        fsGroup: 0
---
apiVersion: v1
kind: Service
metadata:
  name: airflow
  namespace: airflow
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-type: "nlb"
    service.beta.kubernetes.io/aws-load-balancer-backend-protocol: tcp
    service.beta.kubernetes.io/aws-load-balancer-ssl-ports: "443"
    service.beta.kubernetes.io/aws-load-balancer-ssl-cert: {{AOK_SSL_ENDPOINT}}
spec:
  type: LoadBalancer
  ports:
    - protocol: TCP
      port: 80
      targetPort: 8080
      nodePort: 30031
      name: http
    - protocol: TCP
      port: 443
      targetPort: 8080
      nodePort: 30032
      name: https
  selector:
    name: airflow

Get logs and describe pods

NAME                                                               READY   STATUS             RESTARTS   AGE
airflow-bfd79c998-d5gjf                                            3/3     Running            0          2m14s
examplebashoperatoralsorunthis.26319976af6747c5a6b09a0b99b44bfa    0/1     CrashLoopBackOff   1          15s
examplebashoperatorrunme0.9fd08bc8182a4bb7ad3d41cbb57942ff         0/1     CrashLoopBackOff   1          17s
examplebashoperatorrunme1.20e9bd925aaf4b4eb7645ad181267a8f         0/1     CrashLoopBackOff   1          17s
examplebashoperatorrunme2.58fb15f683184e83b4e714bd0e27ccb8         0/1     CrashLoopBackOff   1          16s
examplebashoperatorthiswillskip.71370cbbaa324a21915d73f4e07dc307   0/1     CrashLoopBackOff   1          13s

kubectl logs -n airflow -f 
examplebashoperatoralsorunthis.26319976af6747c5a6b09a0b99b44bfa --previous
unable to retrieve container logs for docker://b81e5ea6ffa99d21b62b46500a865fbc7bfb6560683f8d8bfba4786ea02f361a

kubectl describe pod examplebashoperatoralsorunthis.26319976af6747c5a6b09a0b99b44bfa -n airflow

Name:         examplebashoperatoralsorunthis.26319976af6747c5a6b09a0b99b44bfa
Namespace:    airflow
Priority:     0
Node:         ip-xxx.xxx.xxx.xxx.my region.compute.internal/xxx.xxx.xxx.xxx
Start Time:   Tue, 22 Feb 2022 22:22:27 +0900
Labels:       airflow-worker=144
              airflow_version=2.2.3
              dag_id=example_bash_operator
              kubernetes_executor=True
              run_id=manual__2022-02-22T132224.6817590000-81c9256fb
              task_id=also_run_this
              try_number=1
Annotations:  dag_id: example_bash_operator
              kubernetes.io/psp: eks.privileged
              run_id: manual__2022-02-22T13:22:24.681759+00:00
              task_id: also_run_this
              try_number: 1
Status:       Running
IP:           xxx.xxx.xxx.xxx
IPs:
  IP:  xxx.xxx.xxx.xxx
Containers:
  base:
    Container ID:  docker://f2e0648c4a6a585b753529964d4bc26bc5c5c061e4c74a9c9e71aab00b1505e0
    Image:         xxxxxxxxxxxx.dkr.ecr.my-region**strong text**.amazonaws.com/my-repo:latest
    Image ID:      docker-pullable://xxxxxxxxxxxx.dkr.ecr.my region.amazonaws.com/repo@xxxxxxxxxxxx
    Port:          <none>
    Host Port:     <none>
    Args:
      airflow
      tasks
      run
      example_bash_operator
      also_run_this
      manual__2022-02-22T13:22:24.681759+00:00
      --local
      --subdir
      DAGS_FOLDER/example_bash_operator.py
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Tue, 22 Feb 2022 22:23:54 +0900
      Finished:     Tue, 22 Feb 2022 22:23:54 +0900
    Ready:          False
    Restart Count:  4
    Environment:
      AIRFLOW_IS_K8S_EXECUTOR_POD:  True
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-bh4kp (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  kube-api-access-bh4kp:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason     Age                 From               Message
  ----     ------     ----                ----               -------
  Normal   Scheduled  2m7s                default-scheduler  Successfully assigned airflow/examplebashoperatoralsorunthis.26319976af6747c5a6b09a0b99b44bfa to ip-xxx.xxx.xxx.xx.my-region.compute.internal
  Normal   Pulled     2m5s                kubelet            Successfully pulled image "xxxxxxxxxxxx.dkr.ecr.my-region.amazonaws.com/repo:latest" in 94.764374ms
  Normal   Pulled     2m4s                kubelet            Successfully pulled image "xxxxxxxxxxxx.dkr.ecr.my-region.amazonaws.com/repo:latest" in 93.874971ms
  Normal   Pulled     108s                kubelet            Successfully pulled image "xxxxxxxxxxxx.dkr.ecr.my-region.amazonaws.com/repo:latest" in 106.66327ms
  Normal   Created    81s (x4 over 2m5s)  kubelet            Created container base
  Normal   Started    81s (x4 over 2m5s)  kubelet            Started container base
  Normal   Pulled     81s                 kubelet            Successfully pulled image "xxxxxxxxxxxx.dkr.ecr.my-region.amazonaws.com/repo:latest" in 82.336875ms
  Warning  BackOff    54s (x7 over 2m3s)  kubelet            Back-off restarting failed container
  Normal   Pulling    40s (x5 over 2m5s)  kubelet            Pulling image "xxxxxxxxxxxx.dkr.ecr.my-region.amazonaws.com/repo:latest"
  Normal   Pulled     40s                 kubelet            Successfully pulled image "xxxxxxxxxxxx.dkr.ecr.my-region.amazonaws.com/repo:latest" in 91.959453ms

All the process is done rightly ( i think ) but when i start the dag (manually or schedule ) the worker was crashed immediately and doesn't make a log file... ( in persistence volume i set) AWS Console error message Error crash loop back

kubectl get pods -n airflow message kubectl cli get pods

I need help... please somebody help me out of this hell...

-- jihwan Kim
airflow
amazon-eks
amazon-web-services
kubernetes

0 Answers