Airflow on top of EKS connection timeout when triggering aws lambda

4/30/2020

Issue

We want to use Airflow for triggering a lambda that runs more than 10 minutes (AWS lambdas can run up to 15 minutes) If I run the Airflow setup in local minikube (virtualbox based) cluster then Airflow is able to trigger the lambda which runs 12 minutes and receives a response.

    [2020-04-30 09:34:55,716] {taskinstance.py:900} INFO - Executing <Task(PythonOperator): long-running-lambda-12> on 2020-04-29T09:22:44.302263+00:00
    [2020-04-30 09:34:55,718] {standard_task_runner.py:53} INFO - Started process 211 to run task
    [2020-04-30 09:34:55,791] {logging_mixin.py:112} INFO - Running %s on host %s <TaskInstance: long_running_lambda.long-running-lambda-12 2020-04-29T09:22:44.302263+00:00 [running]> a1-airflow-worker-0.a1-airflow-headless.default.svc.cluster.local
    [2020-04-30 09:34:55,825] {logging_mixin.py:112} INFO - [2020-04-30 09:34:55,825] {awslambda_utils.py:19} INFO - Executing lambda long-running with payload 
     {"runtime_minutes": 12}
    [2020-04-30 09:46:56,773] {logging_mixin.py:112} INFO - [2020-04-30 09:46:56,773] {awslambda_utils.py:26} INFO - LAMBDA RESULTS: 200.
    [2020-04-30 09:46:56,773] {logging_mixin.py:112} INFO - [2020-04-30 09:46:56,773] {awslambda_utils.py:27} INFO - LAST LINES OF LAMBDA LOG: START RequestId: e9bb5b0c-5a15-457d-9ee0-4ef48bb15c71 Version: $LATEST
    I slept for 60 seconds
    I slept for 60 seconds
    I slept for 60 seconds
    I slept for 60 seconds
    I slept for 60 seconds
    I slept for 60 seconds
    I slept for 60 seconds
    I slept for 60 seconds
    I slept for 60 seconds
    I slept for 60 seconds
    I slept for 60 seconds
    I slept for 60 seconds
    completed successfully execution
    END RequestId: e9bb5b0c-5a15-457d-9ee0-4ef48bb15c71
    REPORT RequestId: e9bb5b0c-5a15-457d-9ee0-4ef48bb15c71  Duration: 720425.31 ms  Billed Duration: 720500 ms  Memory Size: 128 MB Max Memory Used: 48 MB

If I, however, run the same setup on EKS fargate I receive this response after an hour or so. I can run shorter lambdas without any issues. The read timeout is set to 900.

...
socket.timeout: The read operation timed out
...
  File "/opt/bitnami/airflow/venv/lib/python3.6/site-packages/urllib3/connectionpool.py", line 423, in _make_request
    self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
  File "/opt/bitnami/airflow/venv/lib/python3.6/site-packages/urllib3/connectionpool.py", line 331, in _raise_timeout
    self, url, "Read timed out. (read timeout=%s)" % timeout_value
urllib3.exceptions.ReadTimeoutError: AWSHTTPSConnectionPool(host='lambda.eu-west-1.amazonaws.com', port=443): Read timed out. (read timeout=900)
... 
  File "/opt/bitnami/airflow/venv/lib/python3.6/site-packages/botocore/httpsession.py", line 289, in send
    raise ReadTimeoutError(endpoint_url=request.url, error=e)
...

The same issue occurred on top of the Astronomers platform as well which runs on Google Cloud Platform Kubernetes when running with Celery executors

Hypothesis.

As this issue did not occur on local MacBook it is unlikely the issue resides in the software components running on top of Kubernetes: (airflow, celery)  The same issue occurred on top of the Astronomers platform as well which runs on GCP. The component which should be similar in GCP and AWS is Kubernetes. 

Debug Todo.

  1. Try setting up airflow on EC2 private instance in the same subnet. Will give me info if the issue is is in NAT or Kubernetes VPC
  2. flow logs
  3. KUBERNETES idle connections search
  4. Try airflow with Kubernetes executors
  5. Try triggering lambda with bash command.
  6. I need to install additional packages. DockerOperator?
-- Andres Namm
airflow
amazon-web-services
aws-lambda
kubernetes
network-programming

0 Answers