Kafka incremental sticky rebalancing

6/21/2021

I am running Kafka on Kubernetes using the Kafka Strimzi operator. I am using incremental sticky rebalance strategy by configuring my consumers with the following:

ConsumerConfig.PARTITION_ASSIGNMENT_STRATEGY_CONFIG,
      org.apache.kafka.clients.consumer.CooperativeStickyAssignor.class.getName()

Each time I scale consumers in my consumer group all existing consumer in the group generate the following exception

Exception in thread "main" org.apache.kafka.common.errors.RebalanceInProgressException: Offset commit cannot be completed since the consumer is undergoing a rebalance for auto partition assignment. You can try completing the rebalance by calling poll() and then retry the operation

Any idea on what caused this exception and/or how to resolve it?

Thank you.

-- Mazen Ezzeddine
apache-kafka
kafka-consumer-api
kubernetes
strimzi

1 Answer

6/21/2021

The consumer rebalance happens whenever there is a change in the metadata information of a consumer group.

Adding more consumers (scaling in your words) in a group is one such change and triggers a rebalance. During this change, each consumer will be re-assigned partitions and therefore will not know which offsets to commit until the re-assignment is complete. Now, the StickyAssignor does try and ensure that the previous assignment gets preserved as much as possible but the rebalance will still be triggered and even distribution of partitions will take precedence over retaining previous assignment. (Reference - Kafka Documentation)

Rest, the exception's message is self-explanatory that while the rebalance is happening some of the operations are prohibited.

How to avoid such situations?

This is a tricky one because Kafka needs rebalancing to be able to work effectively. There are a few practices you could use to avoid unnecessary impact:

  1. Increase the polling time - max.poll.interval.ms - so the possibility of experiencing these exceptions is reduced.
  2. Decrease the number of poll records - max.poll.records or max.partition.fetch.bytes
  3. Try and utilise the latest version(s) of Kafka (or upgrade if you're using an old one) as many of the latest upgrades so far have made improvements to the rebalance protocol
  4. Use Static membership protocol to reduce rebalances
  5. Might consider configuring group.initial.rebalance.delay.ms for empty consumer groups (either for the first time deployment or destroyin everything and redeploying again)

These techniques can only help you reduce the unnecessary behaviour or exception but will NOT prevent rebalance completely.

-- Lalit
Source: StackOverflow