Issue
We want to use Airflow for triggering a lambda that runs more than 10 minutes (AWS lambdas can run up to 15 minutes) If I run the Airflow setup in local minikube (virtualbox based) cluster then Airflow is able to trigger the lambda which runs 12 minutes and receives a response.
[2020-04-30 09:34:55,716] {taskinstance.py:900} INFO - Executing <Task(PythonOperator): long-running-lambda-12> on 2020-04-29T09:22:44.302263+00:00
[2020-04-30 09:34:55,718] {standard_task_runner.py:53} INFO - Started process 211 to run task
[2020-04-30 09:34:55,791] {logging_mixin.py:112} INFO - Running %s on host %s <TaskInstance: long_running_lambda.long-running-lambda-12 2020-04-29T09:22:44.302263+00:00 [running]> a1-airflow-worker-0.a1-airflow-headless.default.svc.cluster.local
[2020-04-30 09:34:55,825] {logging_mixin.py:112} INFO - [2020-04-30 09:34:55,825] {awslambda_utils.py:19} INFO - Executing lambda long-running with payload
{"runtime_minutes": 12}
[2020-04-30 09:46:56,773] {logging_mixin.py:112} INFO - [2020-04-30 09:46:56,773] {awslambda_utils.py:26} INFO - LAMBDA RESULTS: 200.
[2020-04-30 09:46:56,773] {logging_mixin.py:112} INFO - [2020-04-30 09:46:56,773] {awslambda_utils.py:27} INFO - LAST LINES OF LAMBDA LOG: START RequestId: e9bb5b0c-5a15-457d-9ee0-4ef48bb15c71 Version: $LATEST
I slept for 60 seconds
I slept for 60 seconds
I slept for 60 seconds
I slept for 60 seconds
I slept for 60 seconds
I slept for 60 seconds
I slept for 60 seconds
I slept for 60 seconds
I slept for 60 seconds
I slept for 60 seconds
I slept for 60 seconds
I slept for 60 seconds
completed successfully execution
END RequestId: e9bb5b0c-5a15-457d-9ee0-4ef48bb15c71
REPORT RequestId: e9bb5b0c-5a15-457d-9ee0-4ef48bb15c71 Duration: 720425.31 ms Billed Duration: 720500 ms Memory Size: 128 MB Max Memory Used: 48 MB
If I, however, run the same setup on EKS fargate I receive this response after an hour or so. I can run shorter lambdas without any issues. The read timeout is set to 900.
...
socket.timeout: The read operation timed out
...
File "/opt/bitnami/airflow/venv/lib/python3.6/site-packages/urllib3/connectionpool.py", line 423, in _make_request
self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
File "/opt/bitnami/airflow/venv/lib/python3.6/site-packages/urllib3/connectionpool.py", line 331, in _raise_timeout
self, url, "Read timed out. (read timeout=%s)" % timeout_value
urllib3.exceptions.ReadTimeoutError: AWSHTTPSConnectionPool(host='lambda.eu-west-1.amazonaws.com', port=443): Read timed out. (read timeout=900)
...
File "/opt/bitnami/airflow/venv/lib/python3.6/site-packages/botocore/httpsession.py", line 289, in send
raise ReadTimeoutError(endpoint_url=request.url, error=e)
...
The same issue occurred on top of the Astronomers platform as well which runs on Google Cloud Platform Kubernetes when running with Celery executors
Hypothesis.
As this issue did not occur on local MacBook it is unlikely the issue resides in the software components running on top of Kubernetes: (airflow, celery) The same issue occurred on top of the Astronomers platform as well which runs on GCP. The component which should be similar in GCP and AWS is Kubernetes.
Debug Todo.