Problem
I have a Kafka setup with three brokers in Kubernetes, set up according to the guide at https://github.com/Yolean/kubernetes-kafka. The following error message appears when producing messages from a Java client.
2018-06-06 11:15:44.103 ERROR 1 --- [ad | producer-1] o.s.k.support.LoggingProducerListener : Exception thrown when sending a message with key='null' and payload='[...redacted...]':
org.apache.kafka.common.errors.TimeoutException: Expiring 1 record(s) for topicname-0: 30001 ms has passed since last append
Detailed setup
The listeners are set up to allow SSL producers/consumers from the outside world:
advertised.host.name = null
advertised.listeners = OUTSIDE://kafka-0.mydomain.com:32400,PLAINTEXT://:9092
advertised.port = null
listener.security.protocol.map = PLAINTEXT:PLAINTEXT,SSL:SSL,SASL_PLAINTEXT:SASL_PLAINTEXT,SASL_SSL:SASL_SSL,OUTSIDE:SSL
listeners = OUTSIDE://:9094,PLAINTEXT://:9092
inter.broker.listener.name = PLAINTEXT
host.name =
port.name = 9092
The OUTSIDE listeners are listening on kafka-0.mydomain.com, kafka-1.mydomain.com, etc. The plaintext listeners are listening on any IP, since they are cluster-local to Kubernetes.
The producer settings:
kafka:
bootstrap-servers: kafka.mydomain.com:9092
properties:
security.protocol: SSL
producer:
batch-size: 16384
buffer-memory: 1048576 # 1MB
retries: 1
ssl:
key-password: redacted
keystore-location: file:/var/private/ssl/kafka.client.keystore.jks
keystore-password: redacted
truststore-location: file:/var/private/ssl/kafka.client.truststore.jks
truststore-password: redacted
In addition I set linger.ms
to 100 in code, which forces messages to be transmitted within 100ms. Linger time is set intentionally low, because the use case requires minimal delays.
Analysis
What could it be?
This problem normally occurs when the producer is faster than the brokers, the reason why this happens with your setup seems to be that the SSL needs extra CPU and that may slow down the brokers. But anyway check the following:
To minimize whatever causes this retention you should increase the buffer-memory
to more than 32MB, think that 32MB is the default and you are setting this lower. The lower you have, the easy is that the buffer gets full and if this happens it will block at most max.block.ms
, and a request will timeout after request.timeout.ms
.
Another parameter that you should increase is batch-size
, this parameter is in bytes, not in number of messages. Also linger.ms should be increased, in case this producer messages are created in user request time, do not increase very much, a good choice could be 1-4 ms.
Messages will be send when the batch.size
gets full or takes longer than linger.ms
to have more data than the batch.size
. Big batches increases the throughput in normal cases, but if the linger is too low it doesn't help, because you will send before you have enough data to get the batch.size
.
Also recheck on producer logs that the properties are loaded correctly.