I run Kafka inside Kubernetes cluster on VMWare with a ControlPlane and one worker node. From the ControlPlane node my client can communicate with Kafka, but from my worker node this ends up in this error
%3|1638529687.405|FAIL|apollo-prototype-765f4d8bcf-bjpf4#producer-2| [thrd:sasl_plaintext://my-cluster-kafka-bootstrap:9092/bootstrap]: sasl_plaintext://my-cluster-kafka-bootstrap:9092/bootstrap: Failed to resolve 'my-cluster-kafka-bootstrap:9092': Temporary failure in name resolution (after 20016ms in state CONNECT, 2 identical error(s) suppressed)
%3|1638529687.406|ERROR|apollo-prototype-765f4d8bcf-bjpf4#producer-2| [thrd:app]: apollo-prototype-765f4d8bcf-bjpf4#producer-2: sasl_plaintext://my-cluster-kafka-bootstrap:9092/bootstrap: Failed to resolve 'my-cluster-kafka-bootstrap:9092': Temporary failure in name resolution (after 20016ms in state CONNECT, 2 identical error(s) suppressed)
This is my Kafka cluster manifest (using Strimzi)
listeners:
- name: plain
port: 9092
type: internal
tls: false
authentication:
type: scram-sha-512
- name: external
port: 9094
type: ingress
tls: true
authentication:
type: scram-sha-512
configuration:
class: nginx
bootstrap:
host: localb.kafka.xxx.com
brokers:
- broker: 0
host: local.kafka.xxx.com
To be mentioned that exactly the same config, when i run inside in a cloud works flawleslly.
Telnet and nslookup (from both nodes) throw up an error. CoreDNS logs do not even mention this error. Also firewall is disable on both nodes.
Could you please help me out? Thanks!
UPDATE: SOLUTION Calico Pod (from the worker node) was complaining that bird: Netlink: Network is down, even it was not crashing
2021-12-03 09:39:58.051 [INFO][90] felix/int_dataplane.go 1539: Received interface addresses update msg=&intdataplane.ifaceAddrsUpdate{Name:"tunl0", Addrs:set.mapSet{}}
2021-12-03 09:39:58.051 [INFO][90] felix/hostip_mgr.go 85: Interface addrs changed. update=&intdataplane.ifaceAddrsUpdate{Name:"tunl0", Addrs:set.mapSet{}}
2021-12-03 09:39:58.052 [INFO][90] felix/ipsets.go 130: Queueing IP set for creation family="inet" setID="this-host" setType="hash:ip"
2021-12-03 09:39:58.057 [INFO][90] felix/ipsets.go 785: Doing full IP set rewrite family="inet" numMembersInPendingReplace=3 setID="this-host"
2021-12-03 09:39:58.059 [INFO][90] felix/int_dataplane.go 1036: Linux interface state changed. ifIndex=13 ifaceName="tunl0" state="down"
2021-12-03 09:39:58.082 [INFO][90] felix/int_dataplane.go 1521: Received interface update msg=&intdataplane.ifaceUpdate{Name:"tunl0", State:"down", Index:13}
bird: Netlink: Network is down
Here is what I have done and it worked like a charm!
The fault is caused by the different ipvs modules loaded by the node. I configured the ipip module for the new node, but the old node did not load the ipip module, which caused the calico exception. Delete the ipip module to return to normal.
[root@k8s-node236-232 ~]# lsmod | grep ipip ipip 16384 0 tunnel4 16384 1 ipip ip_tunnel 24576 1 ipip [root@k8s-node236-232 ~]# modprobe -r ipip [root@k8s-node236-232 ~]# lsmod | grep ipip
Calico Pod (from the worker node) was complaining that bird: Netlink: Network is down, even it was not crashing
2021-12-03 09:39:58.051 [INFO][90] felix/int_dataplane.go 1539: Received interface addresses update msg=&intdataplane.ifaceAddrsUpdate{Name:"tunl0", Addrs:set.mapSet{}}
2021-12-03 09:39:58.051 [INFO][90] felix/hostip_mgr.go 85: Interface addrs changed. update=&intdataplane.ifaceAddrsUpdate{Name:"tunl0", Addrs:set.mapSet{}}
2021-12-03 09:39:58.052 [INFO][90] felix/ipsets.go 130: Queueing IP set for creation family="inet" setID="this-host" setType="hash:ip"
2021-12-03 09:39:58.057 [INFO][90] felix/ipsets.go 785: Doing full IP set rewrite family="inet" numMembersInPendingReplace=3 setID="this-host"
2021-12-03 09:39:58.059 [INFO][90] felix/int_dataplane.go 1036: Linux interface state changed. ifIndex=13 ifaceName="tunl0" state="down"
2021-12-03 09:39:58.082 [INFO][90] felix/int_dataplane.go 1521: Received interface update msg=&intdataplane.ifaceUpdate{Name:"tunl0", State:"down", Index:13}
bird: Netlink: Network is down
Here is what I have done and it worked like a charm!
The fault is caused by the different ipvs modules loaded by the node. I configured the ipip module for the new node, but the old node did not load the ipip module, which caused the calico exception. Delete the ipip module to return to normal.
[root@k8s-node236-232 ~]# lsmod | grep ipip ipip 16384 0 tunnel4 16384 1 ipip ip_tunnel 24576 1 ipip [root@k8s-node236-232 ~]# modprobe -r ipip [root@k8s-node236-232 ~]# lsmod | grep ipip