I am using Flink 1.14.2 on Ververica in their Community Edition on Azure Kubernetes Service.
To have the best possible availability about 40 pods are set to high-availability mode. More are run in normal mode.
About every 1-5 days it can be observered that jobs restart because they can't find the leader of the job. After some investigations, the reason for this seems to be a overload of the AKS cluster so that the pods can't retrieve/update ConfigMaps that are used to define the leader. \ The cluster is already set to the paid tier with a guaranteed availability.
To reduce the load on the API server from the high-availability pods, I wanted configure the jobs with less strict update requirements of the leader election using the following configs:
high-availability.kubernetes.leader-election.lease-duration: 33s
high-availability.kubernetes.leader-election.renew-deadline: 25s
high-availability.kubernetes.leader-election.retry-period: 8s
This didn't solve my problems, which was proven after looking at the logs:
com.ververica.platform.flink.ha.kubernetes.HaServicesFactory [] - HaConfig(
namespace=default,
configMapName=job-xxx-flink-ha-jobmanager,
leaderElectionConfigMapName=job-xxx-flink-ha-jobmanager-leader-election,
leaderElectionLeaseDuration=PT15S,
leaderElectionRenewDeadline=PT10S,
leaderElectionRetryPeriod=PT2S,
checkpointGcCleanUpAfter=PT15M,
httpClientConnectTimeout=Optional.empty,
httpClientReadTimeout=Optional.empty,
httpClientWriteTimeout=Optional.empty
)
The properties are picked up by Flink however when the high-availability service is instantiated the new values are not being used.
I would assume that the properties names might differ from the Flink names, as in the latest version of Ververica's documentation a different prefix is used: high-availability.vvp-kubernetes.xxx
Ververica \
However, trying vvp-kubernetes
instead of kubernetes
didn't work either.
Since the Kubenetes cluster is managed by Azure, I can't scale up the API server.
How can i make sure the Flink is using the high-availability configurations that I passed to relieve the load on the API server?
Thank you for you help in advance!