Limited pods in Kubernetes (EKS) and Airflow

1/24/2020

Despite increasing the values ​​of the variables that modify Airflow concurrency levels, I never get more than nine simultaneous pods.

I have an EKS cluster with two m4.large nodes, with capacity for 20 pods each. The whole system occupies 15 pods, so I have room to have 25 more pods but they never reach more than nine. I have created an escalation policy because the scheduler gets a bit stressed by throwing 500 dags at the same time, but EKS creates an additional cluster that all it does is distribute the nine pods. I have also tested with two m4.2xlarge nodes, with capacity for almost 120 pods and the result the same despite multiplying by 4 the performance of the system and increasing the number of threads from 2 to 6.

These are the environment variable values ​​that I handle.

AIRFLOW__CORE__PARALLELISM = 1000
AIRFLOW__CORE__NON_POOLED_TASK_SLOT_COUNT = 1000
AIRFLOW__CORE__DAG_CONCURRENCY = 1000
AIRFLOW__CORE__SQL_ALCHEMY_POOL_SIZE = 0
AIRFLOW__CORE__SQL_ALCHEMY_MAX_OVERFLOW = -1

That could be happening?

-- Siro
airflow
concurrency
eks
kubernetes

2 Answers

2/6/2020

Ok, I've already seen where the problem is. Kubernetes does not manage pods well without requests or limits. I have added requests and limits and now the nodes are filled completely with 20 pods each.

Now I have another problem. The pods don't seem to disappear when they finish. The pods only print "Hello world", despite this, in dag_run there are dags that take from 49 seconds to 22 minutes. With the fact that although there are more pods in each node, the whole system still takes more than 20 minutes to complete, as before.

-- Siro
Source: StackOverflow

2/14/2020

Something is wrong. If I have two nodes that can host 100 pods. And every pod takes a minute to finish, if I run five hundred pods simultaneously, all the work should end in five minutes. But it always takes between 16-20 minutes. The nodes are never full of pods at full capacity and the pods finish their work but take some time to be deleted. What makes it so slow?

Use Airflow 1.10.9 with this configuration:

ENV AIRFLOW__CORE__PARALLELISM=100
ENV AIRFLOW__CORE__NON_POOLED_TASK_SLOT_COUNT=100
ENV AIRFLOW__CORE__DAG_CONCURRENCY=100
ENV AIRFLOW__CORE__MAX_ACTIVE_RUNS_PER_DAG=100

ENV AIRFLOW__CORE__SQL_ALCHEMY_POOL_SIZE=0
ENV AIRFLOW__CORE__SQL_ALCHEMY_MAX_OVERFLOW=-1

ENV AIRFLOW__KUBERNETES_WORKER_PODS_CREATION_BATCH_SIZE=10
ENV AIRFLOW__SCHEDULER__MAX_THREADS=6
-- Siro
Source: StackOverflow