For context, I am bringing up Kafka and Zookeeper locally on an Ubuntu machine using Kubernetes, through Helm:
- name: kafka
version: 12.7.3
repository: https://charts.bitnami.com/bitnami
I've looked at existing questions for this error, but none seem to be related to my issue exactly. For these existing questions, I see that the issue seems to involve docker networks, or communication. However, on my local setup, I can see that Kafka can communicate to Zookeeper successfully and initiate a TCP connection. I saw the following tshark
logs, where .83
is Kafka and .80
is Zookeeper:
57 118.532604170 192.168.83.83 → 192.168.83.80 TCP 74 44978 → 2181 [SYN] Seq=0 Win=64800 Len=0 MSS=1440 SACK_PERM=1 TSval=3500466016 TSecr=0 WS=128
58 118.532617080 192.168.83.80 → 192.168.83.83 TCP 74 2181 → 44978 [SYN, ACK] Seq=0 Ack=1 Win=64260 Len=0 MSS=1440 SACK_PERM=1 TSval=1996498322 TSecr=3500466016 WS=128
59 118.532633329 192.168.83.83 → 192.168.83.80 TCP 66 44978 → 2181 [ACK] Seq=1 Ack=1 Win=64896 Len=0 TSval=3500466016 TSecr=1996498322
60 118.535617526 192.168.83.83 → 192.168.83.80 TCP 115 44978 → 2181 [PSH, ACK] Seq=1 Ack=1 Win=64896 Len=49 TSval=3500466019 TSecr=1996498322
61 118.535644624 192.168.83.80 → 192.168.83.83 TCP 66 2181 → 44978 [ACK] Seq=1 Ack=50 Win=64256 Len=0 TSval=1996498325 TSecr=3500466019
62 118.537006985 192.168.83.80 → 192.168.83.83 TCP 107 2181 → 44978 [PSH, ACK] Seq=1 Ack=50 Win=64256 Len=41 TSval=1996498326 TSecr=3500466019
63 118.537047974 192.168.83.83 → 192.168.83.80 TCP 66 44978 → 2181 [ACK] Seq=50 Ack=42 Win=64896 Len=0 TSval=3500466020 TSecr=1996498326
64 118.540259005 192.168.83.83 → 192.168.83.80 TCP 78 44978 → 2181 [PSH, ACK] Seq=50 Ack=42 Win=64896 Len=12 TSval=3500466024 TSecr=1996498326
65 118.540263332 192.168.83.80 → 192.168.83.83 TCP 66 2181 → 44978 [ACK] Seq=42 Ack=62 Win=64256 Len=0 TSval=1996498330 TSecr=3500466024
66 118.541564514 192.168.83.80 → 192.168.83.83 SMPP 86 Bind_receiver[Malformed Packet]
67 118.541607278 192.168.83.83 → 192.168.83.80 TCP 66 44978 → 2181 [ACK] Seq=62 Ack=62 Win=64896 Len=0 TSval=3500466025 TSecr=1996498331
68 118.541999795 192.168.83.80 → 192.168.83.83 TCP 66 2181 → 44978 [FIN, ACK] Seq=62 Ack=62 Win=64256 Len=0 TSval=1996498331 TSecr=3500466025
69 118.542214437 192.168.83.83 → 192.168.83.80 TCP 66 44978 → 2181 [FIN, ACK] Seq=62 Ack=63 Win=64896 Len=0 TSval=3500466026 TSecr=1996498331
Despite this, it seems like I am still seeing the following errors on the Kafka logs:
[2021-01-29 19:17:49,922] INFO Session: 0x1000031e07c0011 closed (org.apache.zookeeper.ZooKeeper)
[2021-01-29 19:17:49,922] INFO EventThread shut down for session: 0x1000031e07c0011 (org.apache.zookeeper.ClientCnxn)
[2021-01-29 19:17:49,925] INFO [ZooKeeperClient Kafka server] Closed. (kafka.zookeeper.ZooKeeperClient)
[2021-01-29 19:17:49,928] ERROR Fatal error during KafkaServer startup. Prepare to shutdown (kafka.server.KafkaServer)
kafka.zookeeper.ZooKeeperClientTimeoutException: Timed out waiting for connection while in state: CONNECTING
at kafka.zookeeper.ZooKeeperClient.$anonfun$waitUntilConnected$3(ZooKeeperClient.scala:262)
at kafka.zookeeper.ZooKeeperClient.waitUntilConnected(ZooKeeperClient.scala:258)
at kafka.zookeeper.ZooKeeperClient.<init>(ZooKeeperClient.scala:119)
at kafka.zk.KafkaZkClient$.apply(KafkaZkClient.scala:1881)
at kafka.server.KafkaServer.createZkClient$1(KafkaServer.scala:441)
at kafka.server.KafkaServer.initZkClient(KafkaServer.scala:466)
at kafka.server.KafkaServer.startup(KafkaServer.scala:233)
at kafka.server.KafkaServerStartable.startup(KafkaServerStartable.scala:44)
at kafka.Kafka$.main(Kafka.scala:82)
at kafka.Kafka.main(Kafka.scala)
I've tried a few things:
1. As mentioned above, I saw that IP/TCP traffic between Kafka and Zookeeper did seem to be working successfully, so I don't believe it's an underlying routing issue.
2. This is sort of implied by (1), but I looked at the iptables
rules in the nat
table, and the rules seem to be correct. The zookeeper
service is correctly NAT'd to the zookeeper
pod IP.
3. I've manually tried running debugging commands from within the Kafka pod to confirm once again if it could make an end to end connection to Zookeeper. The following seemed to work: echo mntr | nc 10.96.85.98 2181
.
4. I don't have any firewalls running to my knowledge. It is entirely possible there is something within iptables
that is preventing another layer from working, but this is what I hope to get some clarity on.
I have this working now. It appears to be because I repeatedly brought the cluster down and up and didn't properly clear the networking state, which probably led to some sort of black-holing somewhere.
It may be overkill, but what I ended up doing was simply flushing the iptables
rules and restarting all relevant services like docker
which required special iptables
rules. Now that the cluster is working, I don't envision repeatedly re-creating the cluster.