Error sending fetch request (sessionId=1175648978, epoch=189) to node 53: org.apache.kafka.common.errors.DisconnectException

5/11/2020

We have a topic with 100 partitions and the load is millions of records per hour.

We ran into the problem whenever we deploy a new version of stream-processor using state-store with stateful-set in Kubernetes.

Normally, we need 4 pods to handle the workload of 100 partitions.

Before deploying a new version, 4 instances are up to date with the data from the topics.

Three out of 4 times when we deploy a new version, only 2 or 3 instances are processing the data within a minutes, the others throw exception:

 Error sending fetch request (sessionId=1175648978, epoch=189) to node 53: org.apache.kafka.common.errors.DisconnectException

So all the data in the partitions that assigned to instance # 4 are build up and the lag is increasing...

If we scale the number of instances to 6 or 8, then 5 or 6 instances are processing the data, other 3 or 2 instances throw this exception:

 Error sending fetch request (sessionId=1175648978, epoch=189) to node 53: org.apache.kafka.common.errors.DisconnectException

If we let all the instances run like that, eventually ( some between 4 to 36 hours later ) all the instances will be fine and no more exception from any pod.

Any suggestions to fix this problem be appreciated.

Thanks,

Austin

-- AustinTX
apache-kafka
apache-kafka-streams
java
kubernetes

0 Answers