Flink HA JobManager cluster cannot elect a leader

9/8/2018

I'm trying to deploy Apache Flink 1.6 on kubernetes. With following the tutorial at job manager high availabilty page. I already have a working Zookeeper 3.10 cluster from its logs I can see that it's healthy and doesn't configured to Kerberos or SASL.All ACL rules are let's every client to write and read znodes. When I start the cluster everything works as expected every JobManager and TaskManager pods are successfully getting into Running state and I can see the connected TaskManager instances from the master JobManager's web-ui. But when I delete the master JobManager's pod, the other JobManager pod's cannot elect a leader with following error message on any JobManager-UI in the cluster.

{
  "errors": [
    "Service temporarily unavailable due to an ongoing leader election. Please refresh."
  ]
}

Even if I restart this page nothing changes. It stucks at this error message. My suspicion is, the problem is related with high-availability.storageDir option. I already have a working (tested with CloudExplorer) minio s3 deployment to my k8s cluster. But flink cannot write anything to the s3 server. Here you can find every config from github-gist.

-- rfum
apache-flink
kubernetes

1 Answer

9/9/2018

According to the logs it looks as if the TaskManager cannot connect to the new leader. I assume that this is the same for the web ui. The logs say that it tries to connect to flink-job-manager-0.flink-job-svc.flink.svc.cluster.local/10.244.3.166:44013. I cannot say from the logs whether flink-job-manager-1 binds to this IP. But my suspicion is that the headless service might return multiple IPs and Flink picks the wrong/old one. Could you log into the flink-job-manager-1 pod and check what its IP address is?

I think you should be able to resolve this problem by defining for each JobManager a dedicated service or if you use the pod hostname instead.

-- Till Rohrmann
Source: StackOverflow