K
Q

Question

Neo4j Causal Cluster replication fails with "Connection reset by peer"

4/9/2019

Running a three node Neo4j causal cluster (deployed on a Kubernetes cluster) our leader seems to have trouble with replicating transaction to it's followers. We are seeing the following error/warning appear in the debug.log:

2019-04-09 16:21:52.008+0000 WARN [o.n.c.c.t.TxPullRequestHandler] Streamed transactions [868842--868908] to /10.0.31.11:38968 Connection reset by peer
java.io.IOException: Connection reset by peer
        at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
        at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
        at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
        at sun.nio.ch.IOUtil.write(IOUtil.java:51)
        at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471)
        at io.netty.channel.socket.nio.NioSocketChannel.doWrite(NioSocketChannel.java:403)
        at io.netty.channel.AbstractChannel$AbstractUnsafe.flush0(AbstractChannel.java:934)
        at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.forceFlush(AbstractNioChannel.java:367)
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:639)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)
        at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:884)
        at java.lang.Thread.run(Thread.java:748)
        at org.neo4j.helpers.NamedThreadFactory$2.run(NamedThreadFactory.java:110)

In our application we seem to catch this error as:

Database not up to the requested version: 868969. Latest database version is 868967

The errors occur when we apply WRITE loads to the cluster using an asynchronous worker process that reads chunks of data from a queue and pushes them in to the database.

We have looked into obvious culprits:

Networking bandwidth limits are not reached
No obvious peaks on CPU / memory
No other Neo4j exceptions (specifically, no OOMs)
We have unbound/rebound the cluster and performed a validity check on the database(s) (they're all fine)
We tweaked the causal_clustering.pull_interval to 30s, which seems to improve performance but does not alleviate this issue
We have removed resource constraints on the db to mitigate bugs that might induce throttling on Kubernetes (without reaching actual CPU limits), this also did nothing to alleviate the issue

-- Matthijs van der Kroon

kubernetes

neo4j

K
Q

Neo4j Causal Cluster replication fails with "Connection reset by peer"

Similar Questions

0 Answers

KQ

Neo4j Causal Cluster replication fails with "Connection reset by peer"

Similar Questions

0 Answers

K
Q