Cassandra timeout during write query at consistency QUORUM

4/4/2018

I'm running a Lagom 1.4 application with a Cassandra 3.11.2 backend, deployed to AWS via Kubernetes. There are two Cassandra replicas. Everything runs on i3xlarge instances with 4G EBS volumes attached for Cassandra storage to each replica (in the ReadWriteOnce mode).

I'm getting occasional exceptions for "Cassandra timeout during write query at consistency QUORUM":

2018-04-04T18:42:19.642Z [error] akka.stream.scaladsl.RestartWithBackoffSource [sourceThread=application-akka.actor.default-dispatcher-3, akkaSource=RestartWithBackoffSource(akka://application), sourceActorSystem=application, akkaTimestamp=18:42:19.637UTC] - Restarting graph due to failure
java.util.concurrent.ExecutionException: com.datastax.driver.core.exceptions.WriteTimeoutException: Cassandra timeout during write query at consistency QUORUM (1 replica were required but only 0 acknowledged the write)
    at com.google.common.util.concurrent.AbstractFuture.getDoneValue(AbstractFuture.java:503)
    at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:462)
    at akka.persistence.cassandra.package$ListenableFutureConverter$anon$2$anonfun$run$2.apply(package.scala:25)
    at scala.util.Try$.apply(Try.scala:191)
    at akka.persistence.cassandra.package$ListenableFutureConverter$anon$2.run(package.scala:25)
    at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40)
    at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:43)
    at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
    at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
    at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
    at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: com.datastax.driver.core.exceptions.WriteTimeoutException: Cassandra timeout during write query at consistency QUORUM (1 replica were required but only 0 acknowledged the write)
    at com.datastax.driver.core.exceptions.WriteTimeoutException.copy(WriteTimeoutException.java:100)
    at com.datastax.driver.core.Responses$Error.asException(Responses.java:134)
    at com.datastax.driver.core.RequestHandler$SpeculativeExecution.onSet(RequestHandler.java:507)
    at com.datastax.driver.core.Connection$Dispatcher.channelRead0(Connection.java:1075)
    at com.datastax.driver.core.Connection$Dispatcher.channelRead0(Connection.java:998)
    at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
    at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
    at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
    at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:310)
    at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:297)
    at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:413)
    at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:265)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
    at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1414)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
    at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:945)
    at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:146)
    at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:645)
    at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
    at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)
    at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:886)
    at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
    at java.lang.Thread.run(Thread.java:748)
Caused by: com.datastax.driver.core.exceptions.WriteTimeoutException: Cassandra timeout during write query at consistency QUORUM (1 replica were required but only 0 acknowledged the write)
    at com.datastax.driver.core.Responses$Error$1.decode(Responses.java:60)
    at com.datastax.driver.core.Responses$Error$1.decode(Responses.java:38)
    at com.datastax.driver.core.Message$ProtocolDecoder.decode(Message.java:289)
    at com.datastax.driver.core.Message$ProtocolDecoder.decode(Message.java:269)
    at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:88)
    ... 22 common frames omitted

I have a generous write timeout and periodic commit log synchronization strategy:

      - name: WRITE_REQUEST_TIMEOUT_IN_MS
        value: "20000"
      - name: COMMITLOG_SYNC
        value: "periodic"
      - name: COMMITLOG_SYNC_PERIOD_IN_MS
        value: "10000"

Cassandra feels sluggish, in general. In Cassandra logs, I see a lot of stuff like this:

INFO  18:38:49 G1 Young Generation GC in 405ms.  G1 Eden Space: 120586240 -> 0; G1 Old Gen: 393197456 -> 394246032; G1 Survivor Space: 5242880 -> 11534336; 
INFO  18:38:51 G1 Young Generation GC in 301ms.  G1 Eden Space: 41943040 -> 0; G1 Old Gen: 394246048 -> 405698464; G1 Survivor Space: 11534336 -> 4194304; 
INFO  18:39:06 G1 Young Generation GC in 292ms.  G1 Eden Space: 46137344 -> 0; G1 Old Gen: 307863800 -> 282697960; G1 Survivor Space: 7340032 -> 6291456; 
INFO  18:39:08 G1 Young Generation GC in 304ms.  G1 Eden Space: 47185920 -> 0; G1 Old Gen: 282697960 -> 259629304; G1 Survivor Space: 6291456 -> 7340032; 
INFO  18:39:10 G1 Young Generation GC in 204ms.  G1 Eden Space: 46137344 -> 0; G1 Old Gen: 260677912 -> 259105040; G1 Survivor Space: 7340032 -> 6291456; 
INFO  18:39:35 G1 Young Generation GC in 203ms.  G1 Eden Space: 156237824 -> 0; G1 Old Gen: 299370920 -> 300419496; G1 Survivor Space: 9437184 -> 14680064; 
INFO  18:39:43 G1 Young Generation GC in 301ms.  G1 Eden Space: 117440512 -> 0; G1 Old Gen: 306153944 -> 317825504; G1 Survivor Space: 15728640 -> 9437184; 
INFO  18:39:57 G1 Young Generation GC in 290ms.  G1 Eden Space: 158334976 -> 0; G1 Old Gen: 325588488 -> 330408464; G1 Survivor Space: 13631488 -> 14680064; 
INFO  18:40:00 G1 Young Generation GC in 208ms.  G1 Eden Space: 101711872 -> 0; G1 Old Gen: 331457072 -> 341942832; G1 Survivor Space: 14680064 -> 9437184; 
INFO  18:40:10 G1 Young Generation GC in 211ms.  G1 Eden Space: 328204288 -> 0; G1 Old Gen: 345088624 -> 346137200; G1 Survivor Space: 9437184 -> 19922944; 
WARN  18:40:14 Out of 31 commit log syncs over the past 296.80s with average duration of 336.74ms, 31 have exceeded the configured commit interval by an average of 111.32ms
INFO  18:40:14 G1 Young Generation GC in 208ms.  G1 Eden Space: 135266304 -> 0; G1 Old Gen: 346137200 -> 368059496; G1 Survivor Space: 19922944 -> 7340032; 
INFO  18:40:31 G1 Young Generation GC in 493ms.  G1 Eden Space: 208666624 -> 0; G1 Old Gen: 367535232 -> 374973064; G1 Survivor Space: 16777216 -> 12582912; 
INFO  18:40:57 G1 Young Generation GC in 300ms.  G1 Eden Space: 68157440 -> 0; G1 Old Gen: 389128912 -> 402026704; G1 Survivor Space: 15728640 -> 5242880; 
INFO  18:41:08 G1 Young Generation GC in 203ms.  G1 Eden Space: 39845888 -> 0; G1 Old Gen: 352218840 -> 308988128; G1 Survivor Space: 13631488 -> 4194304; 
INFO  18:41:10 G1 Young Generation GC in 204ms.  G1 Eden Space: 49283072 -> 0; G1 Old Gen: 308988128 -> 277979360; G1 Survivor Space: 4194304 -> 5242880; 
INFO  18:41:12 G1 Young Generation GC in 304ms.  G1 Eden Space: 48234496 -> 0; G1 Old Gen: 279027968 -> 249667840; G1 Survivor Space: 5242880 -> 7340032; 
INFO  18:41:14 G1 Young Generation GC in 298ms.  G1 Eden Space: 46137344 -> 0; G1 Old Gen: 249667840 -> 226074872; G1 Survivor Space: 7340032 -> 5242880; 
INFO  18:41:16 G1 Young Generation GC in 202ms.  G1 Eden Space: 48234496 -> 0; G1 Old Gen: 226074872 -> 224502016; 
INFO  18:41:43 G1 Young Generation GC in 205ms.  G1 Eden Space: 225443840 -> 0; G1 Old Gen: 258939328 -> 260512184; G1 Survivor Space: 10485760 -> 15728640; 
INFO  18:42:07 G1 Young Generation GC in 202ms.  G1 Eden Space: 222298112 -> 0; G1 Old Gen: 289514048 -> 291086904; G1 Survivor Space: 13631488 -> 17825792; 
INFO  18:42:17 G1 Young Generation GC in 296ms.  G1 Eden Space: 177209344 -> 0; G1 Old Gen: 292135512 -> 303354464; G1 Survivor Space: 23068672 -> 16777216; 
INFO  18:42:20 G1 Young Generation GC in 205ms.  G1 Eden Space: 106954752 -> 0; G1 Old Gen: 304403072 -> 316252808; G1 Survivor Space: 16777216 -> 9437184; 
INFO  18:42:31 G1 Young Generation GC in 393ms.  G1 Eden Space: 196083712 -> 0; G1 Old Gen: 316777608 -> 317301904; G1 Survivor Space: 13631488 -> 17825792; 
INFO  18:42:33 G1 Young Generation GC in 206ms.  G1 Eden Space: 35651584 -> 0; G1 Old Gen: 317301904 -> 333834384; G1 Survivor Space: 17825792 -> 3145728; 

And even this:

WARN  18:40:53 Out of 0 commit log syncs over the past 0.00s with average duration of Infinityms, 1 have exceeded the configured commit interval by an average of 171.42ms

The worst is that eventual consistency is broken. There are now entities in the registry that haven’t been translated into the read-side by even processors. This must be where the timeouts have occurred.

Why garbage collection happen so often, taking so long? Do you think this is the main impact on performance? What else may I be missing?

-- silverberry
amazon-web-services
cassandra
database-performance
kubernetes
lagom

0 Answers