I'm using hazelcast (3.7.4) with OpenShift. Each application is starting a HazelcastInstance.
The network discovery is done via hazelcast-kubernetes (1.1.0).
Sometimes when I deploy the whole application, the cluster is stuck in a split-brain syndrom forever. It never fix and reconnect the whole cluster.
I have to restart pods to enable the reconstruction of a single cluster.
Can someone help me to prevent the split-brain or at least making it recover after ?
Thanks
Use StatefulSet instead of Deployment (or ReplicationController). Then, PODs start one by one which prevents the Split Brain issue. You can have a look at the official OpenShift Code Sample for Hazelcast or specifically at the OpenShift template for Hazelcast.
What's more, try to use the latest Hazelcast version, I think it should re-form the cluster even if you use Deployment and the cluster starts with a Split Brain.