We run a Kafka cluster in Kubernetes based on the gcr.io/google_containers/kubernetes-kafka:1.0-10.2.1
docker image with the zookeeper backend using gcr.io/google_containers/kubernetes-zookeeper:1.0-3.4.10
with three instances of both kafka and zookeeper.
We have a few different consumer groups that both consume and produces data on three different topics.
Behaviour: Sometimes a consumer group will set their offset for a topic on a partition to -1 and from then on stop consuming on that topic all together. If we restart our consumers we might see them setting their offset to the latest offset, which might mean that the consumer has missed messages in the time between it going to -1 and being restarted.
I'm having issues finding why a consumer group would ever set its offset to -1 and why it would do so "randomly" after days of uptime. Is there any logical explanation to why Kafka would set this offset for a certain consumer? Cannot see anything in our actual consumers that indicates that they explicitly are doing this.
We are currently having consumers both running in golang
and in Node.js
, where all are facing this issue, so our current assumption is that this issue does not have to do with our consumers, but rather with our Kafka setup.
The default offset retention policy offsets.retention.minutes
used to be 1 day and in older Kafka versions the offset got wiped out even for active consumers. Fixed with KIP-211
We originally discovered this with Kafka 0.10.2.1, a few select topics lost the consumer group offsets (i.e., turned to -1) because no messages arrived on the topic for a couple of days and the offset retention policy kicked in and wiped out offsets for active consumers.
We were able to workaround this by increasing the retention setting to 7 days which seems to be what Kafka also ended up doing, see KIP-186