Cloud composer cluster schedules worker pods on the same node

9/2/2019

Environment

I am running Cloud Composer cluster (composer-1.6.0-airflow-1.10.1) with 3 nodes with default yaml GKE files provided when creating composer environment. We have 3 worker celery nodes running 4 worker threads each (celery.dag_concurrency)

The problem

I have noticed that two celery worker pods are scheduled on the same cluster node (let's say node A), the third pod is on node B. Node C has some supporting pods running but its cpu and memory utilisation is marginal.

Previously, we used 10 worker threads per worker and it led to all three worker pods being scheduled on the same node! causing pods to be evicted every few minutes due to node going out of memory.

I would expect that each pod is scheduled on a separate node for the best resource utilisation.

GKE Master version - 1.11.10-gke.5

Total size - 3 nodes
Node spec:
 Image type - Container-Optimised OS (cos)
 Machine type - n1-standard-1
 Boot disk type - Standard persistent disk
 Boot disk size (per node) - 100 GB
 Pre-emptible nodes - Disabled

Workaround

By default Cloud Composer doesn't specify requested memory for worker pods. By setting requested memory in such a way that prevents scheduling two worker pods on the same node kind of fixes the problem. In my case I set requested memory to 1.5Gi

-- Pawel
google-cloud-composer
google-kubernetes-engine
kubernetes

1 Answer

9/2/2019

Cloud Composer's worker pods attempt to avoid co-scheduling by using pod anti-affinity during scheduling, but it is not always effective. For example, it is still possible for multiple pods to be scheduled on the same node when other nodes are not yet available (such as when the cluster is coming back online after a GKE version upgrade, or an Airflow upgrade, etc).

In these cases, the solution is to delete Airflow workers using the GKE workload interface, which lead to them be re-created and eventually balanced. Similarly, the evictions you've observed are somewhat disruptive, but also serve to eventually balance the workers across the nodes.

This is notably somewhat inconvenient, so it's being tracked in the public issue tracker under issue #136548942 as a feature request. I would recommend following along there.

-- hexacyanide
Source: StackOverflow