Kafka broker takes a long time in index recovery, and ends up shutting down

10/10/2019

I have a 3-broker, no-replica setup of Kafka on Azure K8S, using the cp-kafka 5.0.1 helm (which uses the 5.0.1 image).

At some point (which unfortunately I do not have the logs for), one of the Kafka brokers crashed, and when it restarted, it went on an endless, painful restart loop. It seems to be trying to recover certain corrupt log entries, takes a looong time, and then hangs up with SIGTERM. To make things worse, I can no longer consume/produce on the affected topics in their entirety. Logs attached below, as well as a monitoring screen capture that shows Kafka slowly going through log files, populating disk cache.

Now, I have log.retention.bytes set to 180GiB - but I would have liked to keep it this way without Kafka ending up in this endless loop. Suspecting this might be an old-version issue, I searched for relevant keywords in the Kafka JIRA ("still starting up" and "SIGTERM" "corrupted index file"), but found nothing.

So I cannot rely on a newer version to solve this, and I don't want to rely on a small retention size as this might pop up with a larger number of corrupt logs.

So my question is - is there a way to do any/all of the following:

  • Prevent the SIGTERM from happening, thus allowing Kafka to fully recover?
  • Allow consumption/producing to resume on non-affected partitions (it seems only 4 of 30 partitions have a corrupt entry)?
  • Otherwise stop this madness from happening?

(If not, I will resort to: (a) upgrading Kafka; (b) shrinking the log.retention.bytes by an order of magnitude; (c) turn on replicas in hope that this will help; (d) improve logging to find out what is causing the crash in the first place.)


Logs

Log where loading logs is done, but cleanup + flushing is interrupted:

[2019-10-10 00:05:36,562 INFO [ThrottledChannelReaper-Fetch: Starting (kafka.server.ClientQuotaManager$ThrottledChannelReaper)
[2019-10-10 00:05:36,564 INFO [ThrottledChannelReaper-Produce: Starting (kafka.server.ClientQuotaManager$ThrottledChannelReaper)
[2019-10-10 00:05:36,564 INFO [ThrottledChannelReaper-Request: Starting (kafka.server.ClientQuotaManager$ThrottledChannelReaper)
[2019-10-10 00:05:36,598 INFO Loading logs. (kafka.log.LogManager)
[2019-10-10 00:05:37,802 WARN [Log partition=my-topic-3, dir=/opt/kafka/data-0/logs] Found a corrupted index file corresponding to log file /opt/kafka/data-0/logs/my-topic-3/00000000000000031038.log due to Corrupt time index found, time index file (/opt/kafka/data-0/logs/my-topic-3/00000000000000031038.timeindex) has non-zero size but the last timestamp is 0 which is less than the first timestamp 1570449760949}, recovering segment and rebuilding index files... (kafka.log.Log)
...
[2019-10-10 00:42:27,037] INFO Logs loading complete in 2210438 ms. (kafka.log.LogManager)
[2019-10-10 00:42:27,052] INFO Starting log cleanup with a period of 300000 ms. (kafka.log.LogManager)
[2019-10-10 00:42:27,054] INFO Starting log flusher with a default period of 9223372036854775807 ms. (kafka.log.LogManager)
[2019-10-10 00:42:27,057] INFO Starting the log cleaner (kafka.log.LogCleaner)
[2019-10-10 00:42:27,738] INFO Terminating process due to signal SIGTERM (org.apache.kafka.common.utils.LoggingSignalHandler)
[2019-10-10 00:42:27,763] INFO Shutting down SupportedServerStartable (io.confluent.support.metrics.SupportedServerStartable)  

Log where loading is interrupted:

[2019-10-10 01:55:25,502 INFO [ThrottledChannelReaper-Fetch: Starting (kafka.server.ClientQuotaManager$ThrottledChannelReaper)
[2019-10-10 01:55:25,502 INFO [ThrottledChannelReaper-Produce: Starting (kafka.server.ClientQuotaManager$ThrottledChannelReaper)
[2019-10-10 01:55:25,504 INFO [ThrottledChannelReaper-Request: Starting (kafka.server.ClientQuotaManager$ThrottledChannelReaper)
[2019-10-10 01:55:25,549 INFO Loading logs. (kafka.log.LogManager)
[2019-10-10 01:55:27,123 WARN [Log partition=my-topic-3, dir=/opt/kafka/data-0/logs] Found a corrupted index file corresponding to log file /opt/kafka/data-0/logs/my-topic-3/00000000000000031038.log due to Corrupt time index found, time index file (/opt/kafka/data-0/logs/my-topic-3/00000000000000031038.timeindex) has non-zero size but the last timestamp is 0 which is less than the first timestamp 1570449760949}, recovering segment and rebuilding index files... (kafka.log.Log)
...
[2019-10-10 02:17:01,249] INFO [ProducerStateManager partition=my-topic-12] Loading producer state from snapshot file '/opt/kafka/data-0/logs/my-topic-12/00000000000000004443.snapshot' (kafka.log.ProducerStateManager)
[2019-10-10 02:17:07,090] INFO Terminating process due to signal SIGTERM (org.apache.kafka.common.utils.LoggingSignalHandler)
[2019-10-10 02:17:07,093] INFO Shutting down SupportedServerStartable (io.confluent.support.metrics.SupportedServerStartable)
[2019-10-10 02:17:07,093] INFO Closing BaseMetricsReporter (io.confluent.support.metrics.BaseMetricsReporter)
[2019-10-10 02:17:07,093] INFO Waiting for metrics thread to exit (io.confluent.support.metrics.SupportedServerStartable)
[2019-10-10 02:17:07,093] INFO Shutting down KafkaServer (io.confluent.support.metrics.SupportedServerStartable)
[2019-10-10 02:17:07,097] INFO [KafkaServer id=2] shutting down (kafka.server.KafkaServer)
[2019-10-10 02:17:07,105] ERROR [KafkaServer id=2] Fatal error during KafkaServer shutdown. (kafka.server.KafkaServer)
java.lang.IllegalStateException: Kafka server is still starting up, cannot shut down!
    at kafka.server.KafkaServer.shutdown(KafkaServer.scala:560)
    at io.confluent.support.metrics.SupportedServerStartable.shutdown(SupportedServerStartable.java:147)
    at io.confluent.support.metrics.SupportedKafka$1.run(SupportedKafka.java:62)
[2019-10-10 02:17:07,110] ERROR Caught exception when trying to shut down KafkaServer. Exiting forcefully. (io.confluent.support.metrics.SupportedServerStartable)
java.lang.IllegalStateException: Kafka server is still starting up, cannot shut down!
    at kafka.server.KafkaServer.shutdown(KafkaServer.scala:560)
    at io.confluent.support.metrics.SupportedServerStartable.shutdown(SupportedServerStartable.java:147)
    at io.confluent.support.metrics.SupportedKafka$1.run(SupportedKafka.java:62)

Monitoring

Memory consumption log of Kafka throughout crashes

-- Yuval
apache-kafka
azure-aks
confluent-platform
kubernetes-helm

1 Answer

12/29/2019

I found your question while searching for solution to a similar problem.
I wonder if you solved this problem??
In the meantime, who is calling the SIGTERM? probably it's Kubernetes or other orchestrator, you can modify the readiness probe to allow more attempts before its killing the container.
Also make sure your xmx configuration is less then the pod/container assigned resource. Otherwise Kubernetes will kill this pod (If Kubernetes is the case here)

-- Igal
Source: StackOverflow