Neo4j Causal Cluster replication fails with "Connection reset by peer"

4/9/2019

Running a three node Neo4j causal cluster (deployed on a Kubernetes cluster) our leader seems to have trouble with replicating transaction to it's followers. We are seeing the following error/warning appear in the debug.log:

2019-04-09 16:21:52.008+0000 WARN [o.n.c.c.t.TxPullRequestHandler] Streamed transactions [868842--868908] to /10.0.31.11:38968 Connection reset by peer
java.io.IOException: Connection reset by peer
        at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
        at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
        at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
        at sun.nio.ch.IOUtil.write(IOUtil.java:51)
        at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471)
        at io.netty.channel.socket.nio.NioSocketChannel.doWrite(NioSocketChannel.java:403)
        at io.netty.channel.AbstractChannel$AbstractUnsafe.flush0(AbstractChannel.java:934)
        at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.forceFlush(AbstractNioChannel.java:367)
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:639)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)
        at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:884)
        at java.lang.Thread.run(Thread.java:748)
        at org.neo4j.helpers.NamedThreadFactory$2.run(NamedThreadFactory.java:110)

In our application we seem to catch this error as:

Database not up to the requested version: 868969. Latest database version is 868967

The errors occur when we apply WRITE loads to the cluster using an asynchronous worker process that reads chunks of data from a queue and pushes them in to the database.

We have looked into obvious culprits:

  • Networking bandwidth limits are not reached
  • No obvious peaks on CPU / memory
  • No other Neo4j exceptions (specifically, no OOMs)
  • We have unbound/rebound the cluster and performed a validity check on the database(s) (they're all fine)
  • We tweaked the causal_clustering.pull_interval to 30s, which seems to improve performance but does not alleviate this issue
  • We have removed resource constraints on the db to mitigate bugs that might induce throttling on Kubernetes (without reaching actual CPU limits), this also did nothing to alleviate the issue
-- Matthijs van der Kroon
kubernetes
neo4j

0 Answers