We have a topic with 100 partitions and the load is millions of records per hour.
We ran into the problem whenever we deploy a new version of stream-processor using state-store with stateful-set in Kubernetes.
Normally, we need 4 pods to handle the workload of 100 partitions.
Before deploying a new version, 4 instances are up to date with the data from the topics.
Three out of 4 times when we deploy a new version, only 2 or 3 instances are processing the data within a minutes, the others throw exception:
Error sending fetch request (sessionId=1175648978, epoch=189) to node 53: org.apache.kafka.common.errors.DisconnectException
So all the data in the partitions that assigned to instance # 4 are build up and the lag is increasing...
If we scale the number of instances to 6 or 8, then 5 or 6 instances are processing the data, other 3 or 2 instances throw this exception:
Error sending fetch request (sessionId=1175648978, epoch=189) to node 53: org.apache.kafka.common.errors.DisconnectException
If we let all the instances run like that, eventually ( some between 4 to 36 hours later ) all the instances will be fine and no more exception from any pod.
Any suggestions to fix this problem be appreciated.
Thanks,
Austin