Overloaded AKS API server because Flink on Ververica is not respecting provided high-availability configurations

2/14/2022

I am using Flink 1.14.2 on Ververica in their Community Edition on Azure Kubernetes Service.

To have the best possible availability about 40 pods are set to high-availability mode. More are run in normal mode.

About every 1-5 days it can be observered that jobs restart because they can't find the leader of the job. After some investigations, the reason for this seems to be a overload of the AKS cluster so that the pods can't retrieve/update ConfigMaps that are used to define the leader. \ The cluster is already set to the paid tier with a guaranteed availability.

To reduce the load on the API server from the high-availability pods, I wanted configure the jobs with less strict update requirements of the leader election using the following configs:

high-availability.kubernetes.leader-election.lease-duration: 33s 
high-availability.kubernetes.leader-election.renew-deadline: 25s 
high-availability.kubernetes.leader-election.retry-period: 8s 

Flink Documentation

This didn't solve my problems, which was proven after looking at the logs:

com.ververica.platform.flink.ha.kubernetes.HaServicesFactory [] - HaConfig(
  namespace=default, 
  configMapName=job-xxx-flink-ha-jobmanager, 
  leaderElectionConfigMapName=job-xxx-flink-ha-jobmanager-leader-election, 
  leaderElectionLeaseDuration=PT15S, 
  leaderElectionRenewDeadline=PT10S, 
  leaderElectionRetryPeriod=PT2S, 
  checkpointGcCleanUpAfter=PT15M, 
  httpClientConnectTimeout=Optional.empty, 
  httpClientReadTimeout=Optional.empty, 
  httpClientWriteTimeout=Optional.empty
)

The properties are picked up by Flink however when the high-availability service is instantiated the new values are not being used.

I would assume that the properties names might differ from the Flink names, as in the latest version of Ververica's documentation a different prefix is used: high-availability.vvp-kubernetes.xxx Ververica \ However, trying vvp-kubernetes instead of kubernetes didn't work either.

Since the Kubenetes cluster is managed by Azure, I can't scale up the API server.

How can i make sure the Flink is using the high-availability configurations that I passed to relieve the load on the API server?

Thank you for you help in advance!

-- Nico R
apache-flink
azure-aks
flink-streaming
high-availability
kubernetes

0 Answers