Pods start to be in pending state too long

3/19/2020

I have a cluster where jobs are created in order of what my users do. Sometimes I can have 0 job in parallel and sometimes 20 to 100. I have set the following limits for each container:

cpu limit: 512m
memory limit: 512Mi;
cpu request: 256m;
memroy request: 128Mi;

I have by default 2 nodes and each one has:

7.91 CPU allocable
10.16 GB allocable

The node pool can scale to 5 nodes max.

But when the cluster starts to have 8 and more jobs in parallel, the new jobs start to be in pending, waiting for other jobs to get down. If a job is selected to start directly it will be completed in 6 to 7 seconds. But when the cluster starts to struggle from 8 or 10 jobs, each job take approximately 20 seconds to be completed, because it blocked in pending state or in ContainerCreating state.

I have IfNotPresent as imagePullPolicy and each image has a version.

I suppose the cluster will start struggling with 28 jobs knowing my allocable resources, then creates a new node and so on. Why am I wrong ? Is it possible to force each container to start without the pending state ? I have found a new scheduler, but i am not sure if it can help me poseidon-firmament-alternate-scheduler ?

-- Limmy
cluster-computing
google-cloud-platform
kubernetes

0 Answers