I know the below information is not enough to trace the issue but still, I want some solution.
We have Amazon EKS cluster.
Currently, we are facing the reachability of the Kafka pod issue.
Environment:
Working:
telnet 10.0.1.45 19092
. It works as expected. IP 10.0.1.45
is a loadbalancer ip.telnet 10.0.1.69 31899
. It works as expected. IP 10.0.1.69
is a actual node's ip and 31899 is nodeport.Problem:
telnet 10.0.1.45 19092
. It works sometime and sometime it will gives an error like telnet: Unable to connect to remote host: Connection timed out
The issue is something related to kube-proxy. we need help to resolve this issue.
Can anyone help to guide me? Can I restart kube-proxy? Does it affect other pods/deployments?
I believe this problem is caused by AWS's NLB TCP-only nature (as mentioned in the comments).
In a nutshell, your pod-to-pod communication fails when hairpin is needed.
To confirm this is the root cause, you can verify that when the telnet works, kafka pod and client pod are not in the same EC2 node. And when they're in the same EC2 server, the telnet fails.
There are (at least) two approaches to tackle this issue:
Every K8s service has its own DNS FQDN for internal usage (meaning using k8s network only, without reaching the LoadBalancer and come back to k8s again). You can just telnet this instead of the NodePort via the LB.
I.e. let's assume your kafka service is named kafka
. Then you can just telnet kafka.svc.cluster.local
(on the port exposed by kafka service)
Oh and as indicated in this answer you might need to make that service headless.