Scheduling jobs fails with org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException

1/12/2022

Thank you for reading this SO question, it may seem long, but I'll try to get as most information as possible in it to help to get the answer.

Summary

We are currently experiencing a scheduling issue with our Flink cluster.

The symptoms are that some/most/all (it depends, the symptoms are not always the same) of our tasks are shown as SCHEDULED but fail after a timeout. The jobs are then shown as RUNNING.

The failing exception is the following one:

Caused by: java.util.concurrent.CompletionException: org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: Slot request bulk is not fulfillable! Could not allocate the required slot within slot request timeout

After analysis, we assume (we cannot prove it, as there are not that much logs for that part of the code) that the failure is due to a deadlock/race condition that is happening when several jobs are being submitted at the same time to the Flink cluster, even though we have enough slots available in the cluster.

We actually have the error with 52 available task slots, and have 12 jobs that are not scheduled.

Additional information

  • Flink version: 1.13.1 commit a7f3192
  • Flink cluster in session mode
  • 2 Job managers using k8s HA mode (resource requests: 2 CPU, 4Gb Ram, limits sets on memory to 4Gb)
  • 50 task managers with 2 slots each (resource requests: 2 CPUs, 2GB Ram. No limits set).
  • Our Flink cluster is shut down every night, and restarted every morning. The error seems to occur when a lot of jobs needs to be scheduled. The jobs are configured to restore their state, and we do not see any issues for jobs that are being scheduled and run correctly, it seems to really be related to a scheduling issue.

Questions

  • May it be that the issue described in FLINK-23409 is actually the same, but occurs only when there is a race condition when scheduling several jobs?
  • Is there any way to increase logging in the scheduler to debug this issue?
  • Is it a known issue? If yes, is there any workaround/solution to resolve it?

P.S: a while ago, I asked more or less the same question on the ML, but dropped it, I'm sorry if this is considered as cross-asking, it's not intended t. We are just opening a new thread as we have more information and the issue re-occur.

-- Gil De Grove
apache-flink
kubernetes

0 Answers