We are deploying a four-instance ignite cluster in kubernetes along with several micro-services that rely upon it. The servers host several caches, some of which are read-through and/or write-behind. Kubernetes service and clients are configured for discovery as directed in the Ignite documentation.
Often times on server startup at least one instance will never connect to the rest of the cluster, and that instance may need to be deleted and replaced multiple times before it joins.
Server instances may also end up segmenting out, sometimes two at a time, from the cluster for no recognizable reason. Kubernetes will replace those nodes with results similar to what happens on cluster startup.
Clients will also suddenly be unable to find a server instance and remain unable to connect to the cluster until the entire server cluster is restarted.
The cluster may run for days w/o issue, then suddenly server instances can't see each other, then clients can't find servers, and we spend a day bouncing everything over and over until it works. We have tried several changes to the kubernetes service, client and server configs, but we are unable to even diagnose, much less repair this issue.
We do suspect this is specific to kubernetes since a similar configuration was previously running fine in a VM-based environment w/o these issues.
Help appreciated.