Kafka brokers down with fully written storages

4/28/2020

Kafka brokers down with fully written storages

I have tried produce as much messages as broker could handle. With fully written storages (8GB) brokers all stopped and they can't up again with this error

logs of brokers trying to restart

[2020-04-28 04:34:05,774] INFO [ThrottledChannelReaper-Request]: Stopped (kafka.server.ClientQuotaManager$ThrottledChannelReaper)
[2020-04-28 04:34:05,774] INFO [ThrottledChannelReaper-Request]: Shutdown completed (kafka.server.ClientQuotaManager$ThrottledChannelReaper)
[2020-04-28 04:34:05,842] INFO [KafkaServer id=1] shut down completed (kafka.server.KafkaServer)
[2020-04-28 04:34:05,844] INFO Shutting down SupportedServerStartable (io.confluent.support.metrics.SupportedServerStartable)
[2020-04-28 04:33:58,847] INFO [ThrottledChannelReaper-Produce]: Stopped (kafka.server.ClientQuotaManager$ThrottledChannelReaper)
[2020-04-28 04:34:05,844] INFO Closing BaseMetricsReporter (io.confluent.support.metrics.BaseMetricsReporter)
[2020-04-28 04:34:05,844] INFO Waiting for metrics thread to exit (io.confluent.support.metrics.SupportedServerStartable)
[2020-04-28 04:34:05,844] INFO Shutting down KafkaServer (io.confluent.support.metrics.SupportedServerStartable)
[2020-04-28 04:34:05,845] INFO [KafkaServer id=1] shutting down (kafka.server.KafkaServer)
[2020-04-28 04:33:58,847] INFO [ThrottledChannelReaper-Request]: Shutting down (kafka.server.ClientQuotaManager$ThrottledChannelReaper)
[2020-04-28 04:33:59,847] INFO [ThrottledChannelReaper-Request]: Shutdown completed (kafka.server.ClientQuotaManager$ThrottledChannelReaper)
[2020-04-28 04:33:59,847] INFO [ThrottledChannelReaper-Request]: Stopped (kafka.server.ClientQuotaManager$ThrottledChannelReaper)
[2020-04-28 04:33:59,854] INFO [KafkaServer id=0] shut down completed (kafka.server.KafkaServer)
[2020-04-28 04:33:59,942] INFO Shutting down SupportedServerStartable (io.confluent.support.metrics.SupportedServerStartable)
[2020-04-28 04:33:59,942] INFO Closing BaseMetricsReporter (io.confluent.support.metrics.BaseMetricsReporter)
[2020-04-28 04:33:59,942] INFO Waiting for metrics thread to exit (io.confluent.support.metrics.SupportedServerStartable)
[2020-04-28 04:33:59,942] INFO Shutting down KafkaServer (io.confluent.support.metrics.SupportedServerStartable)
[2020-04-28 04:33:59,942] INFO [KafkaServer id=0] shutting down (kafka.server.KafkaServer)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
[2020-04-28 04:37:18,938] INFO [ReplicaAlterLogDirsManager on broker 2] Removed fetcher for partitions Set(__consumer_offsets-22, kafka-connect-offset-15, kafka-connect-offset-16, kafka-connect-offset-2, __consumer_offsets-30, kafka-connect-offset-7, kafka-connect-offset-13, __consumer_offsets-8, __consumer_offsets-21, kafka-connect-offset-8, kafka-connect-status-2, kafka-connect-offset-5, __consumer_offsets-4, __consumer_offsets-27, __consumer_offsets-7, __consumer_offsets-9, __consumer_offsets-46, kafka-connect-offset-19, __consumer_offsets-25, __consumer_offsets-35, __consumer_offsets-41, __consumer_offsets-33, __consumer_offsets-23, __consumer_offsets-49, kafka-connect-offset-20, kafka-connect-offset-3, __consumer_offsets-47, __consumer_offsets-16, __consumer_offsets-28, kafka-connect-config-0, kafka-connect-offset-9, kafka-connect-offset-17, __consumer_offsets-31, __consumer_offsets-36, kafka-connect-status-1, __consumer_offsets-42, __consumer_offsets-3, __consumer_offsets-18, __consumer_offsets-37, __consumer_offsets-15, __consumer_offsets-24, kafka-connect-offset-10, kafka-connect-offset-24, kafka-connect-status-4, __consumer_offsets-38, __consumer_offsets-17, __consumer_offsets-48, kafka-connect-offset-23, kafka-connect-offset-21, kafka-connect-offset-0, __consumer_offsets-19, __consumer_offsets-11, kafka-connect-status-0, __consumer_offsets-13, kafka-connect-offset-18, __consumer_offsets-2, __consumer_offsets-43, __consumer_offsets-6, __consumer_offsets-14, kafka-connect-offset-14, kafka-connect-offset-22, kafka-connect-offset-6, perf-test4-0, kafka-connect-status-3, kafka-connect-offset-11, kafka-connect-offset-12, __consumer_offsets-20, __consumer_offsets-0, kafka-connect-offset-4, __consumer_offsets-44, __consumer_offsets-39, kafka-connect-offset-1, __consumer_offsets-12, __consumer_offsets-45, __consumer_offsets-1, __consumer_offsets-5, __consumer_offsets-26, __consumer_offsets-29, __consumer_offsets-34, __consumer_offsets-10, __consumer_offsets-32, __consumer_offsets-40) (kafka.server.ReplicaAlterLogDirsManager)
[2020-04-28 04:37:18,990] INFO [ReplicaManager broker=2] Broker 2 stopped fetcher for partitions __consumer_offsets-22,kafka-connect-offset-15,kafka-connect-offset-16,kafka-connect-offset-2,__consumer_offsets-30,kafka-connect-offset-7,kafka-connect-offset-13,__consumer_offsets-8,__consumer_offsets-21,kafka-connect-offset-8,kafka-connect-status-2,kafka-connect-offset-5,__consumer_offsets-4,__consumer_offsets-27,__consumer_offsets-7,__consumer_offsets-9,__consumer_offsets-46,kafka-connect-offset-19,__consumer_offsets-25,__consumer_offsets-35,__consumer_offsets-41,__consumer_offsets-33,__consumer_offsets-23,__consumer_offsets-49,kafka-connect-offset-20,kafka-connect-offset-3,__consumer_offsets-47,__consumer_offsets-16,__consumer_offsets-28,kafka-connect-config-0,kafka-connect-offset-9,kafka-connect-offset-17,__consumer_offsets-31,__consumer_offsets-36,kafka-connect-status-1,__consumer_offsets-42,__consumer_offsets-3,__consumer_offsets-18,__consumer_offsets-37,__consumer_offsets-15,__consumer_offsets-24,kafka-connect-offset-10,kafka-connect-offset-24,kafka-connect-status-4,__consumer_offsets-38,__consumer_offsets-17,__consumer_offsets-48,kafka-connect-offset-23,kafka-connect-offset-21,kafka-connect-offset-0,__consumer_offsets-19,__consumer_offsets-11,kafka-connect-status-0,__consumer_offsets-13,kafka-connect-offset-18,__consumer_offsets-2,__consumer_offsets-43,__consumer_offsets-6,__consumer_offsets-14,kafka-connect-offset-14,kafka-connect-offset-22,kafka-connect-offset-6,perf-test4-0,kafka-connect-status-3,kafka-connect-offset-11,kafka-connect-offset-12,__consumer_offsets-20,__consumer_offsets-0,kafka-connect-offset-4,__consumer_offsets-44,__consumer_offsets-39,kafka-connect-offset-1,__consumer_offsets-12,__consumer_offsets-45,__consumer_offsets-1,__consumer_offsets-5,__consumer_offsets-26,__consumer_offsets-29,__consumer_offsets-34,__consumer_offsets-10,__consumer_offsets-32,__consumer_offsets-40 and stopped moving logs for partitions  because they are in the failed log directory /opt/kafka/data/logs. (kafka.server.ReplicaManager)
[2020-04-28 04:37:18,990] INFO Stopping serving logs in dir /opt/kafka/data/logs (kafka.log.LogManager)
[2020-04-28 04:37:18,992] WARN [Producer clientId=producer-1] 1 partitions have leader brokers without a matching listener, including [__confluent.support.metrics-0] (org.apache.kafka.clients.NetworkClient)
[2020-04-28 04:37:18,996] ERROR Shutdown broker because all log dirs in /opt/kafka/data/logs have failed (kafka.log.LogManager)

Prometheus monitoring snapshot

enter image description here

I hope to prevent this situation before broker down with solutions like remove previous messages for enough space or something. is there any best practice recommended for this ?

-- Anton 재호프
apache-kafka
confluent
kafka-producer-api
kubernetes
prometheus

1 Answer

4/28/2020

You need to configure two parameters on your kafka: log.retention.hours and log.retention.bytes. The two parameters are "how many data I would store before rotate and for how many time I would store the data before rotate? Try to improve those values if you wanna store more data or reduce your retention time.

Another thing that you can take a look if the problem is the storage is your logs. The logs generated can use a lot of space. In this point you need to put some rotate technique. Take a look on the size of your logs folder.

-- William Prigol Lopes
Source: StackOverflow