Split Brain Condition does not get merged in kubernetes environment with 4.2.1 although nodes seem to agree

7/20/2021

We are using hazelcast 4.2.1 in an kubernetes environment with openjdk:14-jdk-slim images. In our dev environment, where we only have two nodes, these two nodes sometimes (short after every 5th deployment) end up having a split brain condition and do not merge, although they find each other and agree on what to do:

The joiner of the first nodes says the second node should join. And the joiner of the second not it should join the first node. But nothing happens. The log repeats every couple of minutes and the clusters are not merged.

It does not matter if we use a merge policy or not. More often than not it works without any problems.

Log of first node:

2021-07-20 09:14:08.306 DEBUG 142 --- [hz.hazelcast-instance.cached.thread-4] c.h.i.cluster.impl.MembershipManager     : [10.41.31.101]:5701 [light-cluster] [4.2.1] Sending member list to the non-master nodes:

Members {size:1, ver:5} [
        Member [10.41.31.101]:5701 - 7263bccd-f330-4b96-8b52-f22db7c7a90e this
]

2021-07-20 09:14:08.446 DEBUG 142 --- [hz.hazelcast-instance.cached.thread-5] c.h.i.cluster.impl.DiscoveryJoiner       : [10.41.31.101]:5701 [light-cluster] [4.2.1] Sending SplitBrainJoinMessage to [10.41.31.102]:5701
2021-07-20 09:14:08.448 DEBUG 142 --- [hz.hazelcast-instance.cached.thread-5] c.h.i.cluster.impl.ClusterJoinManager    : [10.41.31.101]:5701 [light-cluster] [4.2.1] Checking if we should merge to: SplitBrainJoinMessage{packetVersion=4, buildNumber=20210630, memberVersion=4.2.1, clusterVersion=4.2, address=[10.41.31.102]:5701, uuid='9cdd64b4-62c8-4f19-bf29-d3cef4e8e2f6', liteMember=false, memberCount=1, dataMemberCount=1, memberListVersion=1}
2021-07-20 09:14:08.449  INFO 142 --- [hz.hazelcast-instance.cached.thread-5] c.h.i.cluster.impl.ClusterJoinManager    : [10.41.31.101]:5701 [light-cluster] [4.2.1] [10.41.31.102]:5701 should merge to us, both have the same data member count: 1
2021-07-20 09:14:23.277 DEBUG 142 --- [hz.hazelcast-instance.cached.thread-4] c.h.i.p.InternalPartitionService         : [10.41.31.101]:5701 [light-cluster] [4.2.1] Checking partition state, stamp: -5900145379368197006

Log of second node:

2021-07-20 09:14:24.149 DEBUG 141 --- [hz.hazelcast-instance.cached.thread-4] c.h.i.p.InternalPartitionService         : [10.41.31.102]:5701 [light-cluster] [4.2.1] Checking partition state, stamp: -8661523421455686299
2021-07-20 09:14:24.175 DEBUG 141 --- [hz.hazelcast-instance.cached.thread-4] c.h.s.d.integration.DiscoveryService     : [10.41.31.102]:5701 [light-cluster] [4.2.1] Using service name to discover nodes.
2021-07-20 09:14:24.176 DEBUG 141 --- [hz.hazelcast-instance.cached.thread-6] c.h.i.cluster.impl.MembershipManager     : [10.41.31.102]:5701 [light-cluster] [4.2.1] Sending member list to the non-master nodes:

Members {size:1, ver:1} [
        Member [10.41.31.102]:5701 - 9cdd64b4-62c8-4f19-bf29-d3cef4e8e2f6 this
]

2021-07-20 09:14:39.149 DEBUG 141 --- [hz.hazelcast-instance.cached.thread-4] c.h.i.p.InternalPartitionService         : [10.41.31.102]:5701 [light-cluster] [4.2.1] Checking partition state, stamp: -8661523421455686299
2021-07-20 09:14:54.148 DEBUG 141 --- [hz.hazelcast-instance.cached.thread-6] c.h.i.p.InternalPartitionService         : [10.41.31.102]:5701 [light-cluster] [4.2.1] Checking partition state, stamp: -8661523421455686299
2021-07-20 09:15:08.423 DEBUG 141 --- [hz.hazelcast-instance.priority-generic-operation.thread-0] c.h.i.cluster.impl.ClusterJoinManager    : [10.41.31.102]:5701 [light-cluster] [4.2.1] Checking if we should merge to: SplitBrainJoinMessage{packetVersion=4, buildNumber=20210630, memberVersion=4.2.1, clusterVersion=4.2, address=[10.41.31.101]:5701, uuid='7263bccd-f330-4b96-8b52-f22db7c7a90e', liteMember=false, memberCount=1, dataMemberCount=1, memberListVersion=5}
2021-07-20 09:15:08.423  INFO 141 --- [hz.hazelcast-instance.priority-generic-operation.thread-0] c.h.i.cluster.impl.ClusterJoinManager    : [10.41.31.102]:5701 [light-cluster] [4.2.1] We should merge to [10.41.31.101]:5701, both have the same data member count: 1
2021-07-20 09:15:08.424 DEBUG 141 --- [hz.hazelcast-instance.priority-generic-operation.thread-0] c.h.i.c.i.o.SplitBrainMergeValidationOp  : [10.41.31.102]:5701 [light-cluster] [4.2.1] Returning SplitBrainJoinMessage{packetVersion=4, buildNumber=20210630, memberVersion=4.2.1, clusterVersion=4.2, address=[10.41.31.102]:5701, uuid='9cdd64b4-62c8-4f19-bf29-d3cef4e8e2f6', liteMember=false, memberCount=1, dataMemberCount=1, memberListVersion=1} to [10.41.31.101]:5701
2021-07-20 09:15:09.148 DEBUG 141 --- [hz.hazelcast-instance.cached.thread-6] c.h.i.p.InternalPartitionService         : [10.41.31.102]:5701 [light-cluster] [4.2.1] Checking partition state, stamp: -8661523421455686299```
-- abelmannu
hazelcast
java
kubernetes
splitbrain

0 Answers