Kubernetes Kafka to Zookeeper - "ZooKeeperClientTimeoutException" error


For context, I am bringing up Kafka and Zookeeper locally on an Ubuntu machine using Kubernetes, through Helm:

  - name: kafka
    version: 12.7.3
    repository: https://charts.bitnami.com/bitnami

I've looked at existing questions for this error, but none seem to be related to my issue exactly. For these existing questions, I see that the issue seems to involve docker networks, or communication. However, on my local setup, I can see that Kafka can communicate to Zookeeper successfully and initiate a TCP connection. I saw the following tshark logs, where .83 is Kafka and .80 is Zookeeper:

 57 118.532604170 TCP 74 449782181 [SYN] Seq=0 Win=64800 Len=0 MSS=1440 SACK_PERM=1 TSval=3500466016 TSecr=0 WS=128
   58 118.532617080 TCP 74 218144978 [SYN, ACK] Seq=0 Ack=1 Win=64260 Len=0 MSS=1440 SACK_PERM=1 TSval=1996498322 TSecr=3500466016 WS=128
   59 118.532633329 TCP 66 449782181 [ACK] Seq=1 Ack=1 Win=64896 Len=0 TSval=3500466016 TSecr=1996498322
   60 118.535617526 TCP 115 449782181 [PSH, ACK] Seq=1 Ack=1 Win=64896 Len=49 TSval=3500466019 TSecr=1996498322
   61 118.535644624 TCP 66 218144978 [ACK] Seq=1 Ack=50 Win=64256 Len=0 TSval=1996498325 TSecr=3500466019
   62 118.537006985 TCP 107 218144978 [PSH, ACK] Seq=1 Ack=50 Win=64256 Len=41 TSval=1996498326 TSecr=3500466019
   63 118.537047974 TCP 66 449782181 [ACK] Seq=50 Ack=42 Win=64896 Len=0 TSval=3500466020 TSecr=1996498326
   64 118.540259005 TCP 78 449782181 [PSH, ACK] Seq=50 Ack=42 Win=64896 Len=12 TSval=3500466024 TSecr=1996498326
   65 118.540263332 TCP 66 218144978 [ACK] Seq=42 Ack=62 Win=64256 Len=0 TSval=1996498330 TSecr=3500466024
   66 118.541564514 SMPP 86 Bind_receiver[Malformed Packet]
   67 118.541607278 TCP 66 449782181 [ACK] Seq=62 Ack=62 Win=64896 Len=0 TSval=3500466025 TSecr=1996498331
   68 118.541999795 TCP 66 218144978 [FIN, ACK] Seq=62 Ack=62 Win=64256 Len=0 TSval=1996498331 TSecr=3500466025
   69 118.542214437 TCP 66 449782181 [FIN, ACK] Seq=62 Ack=63 Win=64896 Len=0 TSval=3500466026 TSecr=1996498331

Despite this, it seems like I am still seeing the following errors on the Kafka logs:

[2021-01-29 19:17:49,922] INFO Session: 0x1000031e07c0011 closed (org.apache.zookeeper.ZooKeeper)
[2021-01-29 19:17:49,922] INFO EventThread shut down for session: 0x1000031e07c0011 (org.apache.zookeeper.ClientCnxn)
[2021-01-29 19:17:49,925] INFO [ZooKeeperClient Kafka server] Closed. (kafka.zookeeper.ZooKeeperClient)
[2021-01-29 19:17:49,928] ERROR Fatal error during KafkaServer startup. Prepare to shutdown (kafka.server.KafkaServer)
kafka.zookeeper.ZooKeeperClientTimeoutException: Timed out waiting for connection while in state: CONNECTING
	at kafka.zookeeper.ZooKeeperClient.$anonfun$waitUntilConnected$3(ZooKeeperClient.scala:262)
	at kafka.zookeeper.ZooKeeperClient.waitUntilConnected(ZooKeeperClient.scala:258)
	at kafka.zookeeper.ZooKeeperClient.<init>(ZooKeeperClient.scala:119)
	at kafka.zk.KafkaZkClient$.apply(KafkaZkClient.scala:1881)
	at kafka.server.KafkaServer.createZkClient$1(KafkaServer.scala:441)
	at kafka.server.KafkaServer.initZkClient(KafkaServer.scala:466)
	at kafka.server.KafkaServer.startup(KafkaServer.scala:233)
	at kafka.server.KafkaServerStartable.startup(KafkaServerStartable.scala:44)
	at kafka.Kafka$.main(Kafka.scala:82)
	at kafka.Kafka.main(Kafka.scala)

I've tried a few things: 1. As mentioned above, I saw that IP/TCP traffic between Kafka and Zookeeper did seem to be working successfully, so I don't believe it's an underlying routing issue. 2. This is sort of implied by (1), but I looked at the iptables rules in the nat table, and the rules seem to be correct. The zookeeper service is correctly NAT'd to the zookeeper pod IP. 3. I've manually tried running debugging commands from within the Kafka pod to confirm once again if it could make an end to end connection to Zookeeper. The following seemed to work: echo mntr | nc 2181. 4. I don't have any firewalls running to my knowledge. It is entirely possible there is something within iptables that is preventing another layer from working, but this is what I hope to get some clarity on.

-- jackson4123

1 Answer


I have this working now. It appears to be because I repeatedly brought the cluster down and up and didn't properly clear the networking state, which probably led to some sort of black-holing somewhere.

It may be overkill, but what I ended up doing was simply flushing the iptables rules and restarting all relevant services like docker which required special iptables rules. Now that the cluster is working, I don't envision repeatedly re-creating the cluster.

-- jackson4123
Source: StackOverflow