Kafka pod fails to come up after pod deletion with NFS

2/26/2018

We were trying to run a Kafka cluster on Kubernetes using NFS provisioner. The cluster came up fine. However when we killed one of the Kafka pods, the replacement pod failed to come up.

Persistent volume before pod deletion:

# mount
10.102.32.184:/export/pvc-ce1461b3-1b38-11e8-a88e-005056073f99 on /opt/kafka/data type nfs4 (rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=10.133.40.245,local_lock=none,addr=10.102.32.184)

# ls -al /opt/kafka/data/logs
total 4
drwxr-sr-x 2 99 99 152 Feb 26 21:07 .
drwxrwsrwx 3 99 99  18 Feb 26 21:07 ..
-rw-r--r-- 1 99 99   0 Feb 26 21:07 .lock
-rw-r--r-- 1 99 99   0 Feb 26 21:07 cleaner-offset-checkpoint
-rw-r--r-- 1 99 99  57 Feb 26 21:07 meta.properties
-rw-r--r-- 1 99 99   0 Feb 26 21:07 recovery-point-offset-checkpoint
-rw-r--r-- 1 99 99   0 Feb 26 21:07 replication-offset-checkpoint

# cat /opt/kafka/data/logs   /meta.properties
#
#Mon Feb 26 21:07:08 UTC 2018
version=0
broker.id=1003

Deleting the pod:

kubectl delete pod kafka-iced-unicorn-1

The reattached persistent volume in the newly created pod:

# ls -al /opt/kafka/data/logs
total 4
drwxr-sr-x 2 99 99 180 Feb 26 21:10 .
drwxrwsrwx 3 99 99  18 Feb 26 21:07 ..
-rw-r--r-- 1 99 99   0 Feb 26 21:10 .kafka_cleanshutdown
-rw-r--r-- 1 99 99   0 Feb 26 21:07 .lock
-rw-r--r-- 1 99 99   0 Feb 26 21:07 cleaner-offset-checkpoint
-rw-r--r-- 1 99 99  57 Feb 26 21:07 meta.properties
-rw-r--r-- 1 99 99   0 Feb 26 21:07 recovery-point-offset-checkpoint
-rw-r--r-- 1 99 99   0 Feb 26 21:07 replication-offset-checkpoint

#cat /opt/kafka/data/logs/meta.properties
#
#Mon Feb 26 21:07:08 UTC 2018
version=0
broker.id=1003

We see the following error in the Kafka logs:

[2018-02-26 21:26:40,606] INFO [ThrottledRequestReaper-Produce], Starting      (kafka.server.ClientQuotaManager$ThrottledRequestReaper)
[2018-02-26 21:26:40,711] FATAL [Kafka Server 1002], Fatal error during         KafkaServer startup. Prepare to shutdown (kafka.server.KafkaServer)
java.io.IOException: Invalid argument
    at java.io.UnixFileSystem.createFileExclusively(Native Method)
    at java.io.File.createNewFile(File.java:1012)
    at kafka.utils.FileLock.<init>(FileLock.scala:28)
    at kafka.log.LogManager$anonfun$lockLogDirs$1.apply(LogManager.scala:104)
    at kafka.log.LogManager$anonfun$lockLogDirs$1.apply(LogManager.scala:103)
    at scala.collection.TraversableLike$anonfun$map$1.apply(TraversableLike.scala:234)
    at scala.collection.TraversableLike$anonfun$map$1.apply(TraversableLike.scala:234)
    at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
    at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
    at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
    at scala.collection.AbstractTraversable.map(Traversable.scala:104)
    at kafka.log.LogManager.lockLogDirs(LogManager.scala:103)
    at kafka.log.LogManager.<init>(LogManager.scala:65)
    at kafka.server.KafkaServer.createLogManager(KafkaServer.scala:648)
    at kafka.server.KafkaServer.startup(KafkaServer.scala:208)
    at io.confluent.support.metrics.SupportedServerStartable.startup(SupportedServerStartable.java:102)
    at io.confluent.support.metrics.SupportedKafka.main(SupportedKafka.java:49)
[2018-02-26 21:26:40,713] INFO [Kafka Server 1002], shutting down (kafka.server.KafkaServer)
[2018-02-26 21:26:40,715] INFO Terminate ZkClient event thread. (org.I0Itec.zkclient.ZkEventThread)

The only way around this seems to be to delete the persistent volume claim and force delete the pod again. Or alternatively use another storage provider than NFS (rook is working fine in this scenario).

Has anyone come across this issue with NFS provisioner?

-- js_3135843443153
apache-kafka
kubernetes
nfs

0 Answers