Kafka is marking the coordinator dead when autoscaling is on

11/28/2018

We run a Kubernetes cluster with Kafka 0.10.2. In the cluster we have a replica set of 10 replicas running one of our services, which consume from one topic as one consumer-group.

Lately we turned on the autoscaling feature for this replica-set, so it can increase or decrease the number of pods, based on its CPU consumption.

Soon after this feature was enabled we started to see lags in our Kafka queue. I looked at the log and saw the consumer is marking the coordinator dead very often (almost every 5 minutes) and the reconnect to the same coordinator few seconds later.

I also saw frequently in the logs:

org.apache.kafka.clients.consumer.CommitFailedException: Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member. 
This means that the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time message processing. 
You can address this either by increasing the session timeout or by reducing the maximum size of batches returned in poll() with max.poll.records.

It takes a few seconds to process a message (normally) and we never had this kind of issues before. I assume the problem relates to a bad partition assignment but I can't pinpoint the problem.

If we kill pod that got "stuck" Kafka reassign the partition to another pod and it get stuck as well, but if we scale down the replica-set to 0 and then scale it up the messages are being consumed quickly!

Relevant consumer configurations:

heartbeat.interval.ms = 3000
max.poll.interval.ms = 300000
max.poll.records = 500
session.timeout.ms = 10000

Any suggestions?

-- Yuval
apache-kafka
kubernetes
spring-cloud-stream
spring-kafka

1 Answer

11/28/2018

I am not saying this is the problem but Spring kafka 1.1.x had a very complicated threading model (required by the 0.9 clients). For long-running listeners we had to pause/resume the consumer thread; I saw some issues with early kafka versions where the resume didn't always work.

KIP-62 allowed us to greatly simplify the threading model.

This was incorporated into the 1.3.x line.

I would suggest upgrading to at least cloud-stream Ditmars, which uses spring-kafka 1.3.x. The current 1.3.x version is 1.3.8.

-- Gary Russell
Source: StackOverflow