I'm trying to run Hazelcast on a Kubernetes cluster (provisioned using Kubespray default configuration). The cluster is successfully assembled but is unstable. Members leave and rejoin the cluster with no apparent reason. If there is some load on the cluster it happens almost instantaneously - otherwise, it's a matter of seconds, sometimes minutes.
Here are the logs:
2018-09-28T18:17:57.450214594Z WARNING: [10.233.90.5]:5701 [kpts-cluster] [3.10.4]
Member [10.233.71.21]:5701 - 5585d841-f30f-44e5-8290-4f777a9f9a5e is suspected to be dead for reason:
Suspecting Member [10.233.71.21]:5701 - 5585d841-f30f-44e5-8290-4f777a9f9a5e because it has not sent any heartbeats since 2018-09-28 18:16:55.957. Now: 2018-09-28 18:17:57.413, heartbeat timeout: 60000 ms, suspicion level: 1.00
2018-09-28T18:17:57.450814818Z Sep 28, 2018 6:17:57 PM com.hazelcast.nio.tcp.TcpIpConnection
2018-09-28T18:17:57.45082653Z INFO: [10.233.90.5]:5701 [kpts-cluster] [3.10.4]
Connection[id=4, /10.233.90.5:5701->/10.233.71.21:41018, endpoint=[10.233.71.21]:5701, alive=false, type=MEMBER] closed. Reason:
Suspecting Member [10.233.71.21]:5701 - 5585d841-f30f-44e5-8290-4f777a9f9a5e because it has not sent any heartbeats since 2018-09-28 18:16:55.957. Now: 2018-09-28 18:17:57.413, heartbeat timeout: 60000 ms, suspicion level: 1.00
2018-09-28T18:17:59.308401465Z Sep 28, 2018 6:17:59 PM com.hazelcast.internal.cluster.impl.ClusterHeartbeatManager
2018-09-28T18:17:59.3084277Z WARNING: [10.233.90.5]:5701 [kpts-cluster] [3.10.4] This node does not have a connection to Member [10.233.71.21]:5701 - 5585d841-f30f-44e5-8290-4f777a9f9a5e 2018-09-28T18:17:59.30843287Z Sep 28, 2018 6:17:59 PM com.hazelcast.internal.cluster.ClusterService
2018-09-28T18:17:59.308436765Z INFO: [10.233.90.5]:5701 [kpts-cluster] [3.10.4] 2018-09-28T18:17:59.30844038Z 2018-09-28T18:17:59.308443787Z Members {size:4, ver:6} [
2018-09-28T18:17:59.308447427Z Member [10.233.97.132]:5701 - edec2d4b-2038-4d4e-a07a-d949c5eddb73
2018-09-28T18:17:59.308451285Z Member [10.233.75.68]:5701 - df4eefa7-5829-4da6-9cf5-0efcfe7aa1e7
2018-09-28T18:17:59.308455097Z Member [10.233.90.5]:5701 - a87ec39a-9df9-45b2-8be3-5a01d9c3e5a7
this
2018-09-28T18:17:59.308458882Z Member [10.233.71.21]:5701 - 5585d841-f30f-44e5-8290-4f777a9f9a5e
2018-09-28T18:17:59.308481465Z ] 2018-09-28T18:17:59.308485511Z 2018-09-28T18:17:59.308488998Z Sep 28, 2018 6:17:59 PM com.hazelcast.nio.tcp.TcpIpConnector
2018-09-28T18:17:59.308492741Z INFO: [10.233.90.5]:5701 [kpts-cluster] [3.10.4] Connecting to /10.233.71.21:5701, timeout: 0, bind-any: true 2018-09-28T18:17:59.310401142Z Sep 28, 2018 6:17:59 PM com.hazelcast.transaction.TransactionManagerService
2018-09-28T18:17:59.310413971Z INFO: [10.233.90.5]:5701 [kpts-cluster] [3.10.4] Committing/rolling-back live transactions of [10.233.100.131]:5701, UUID: d386005a-40fc-4d2d-aeb9-f5f58216e55b 2018-09-28T18:17:59.310428599Z Sep 28, 2018 6:17:59 PM com.hazelcast.nio.tcp.TcpIpConnectionManager
2018-09-28T18:17:59.310433319Z INFO: [10.233.90.5]:5701 [kpts-cluster] [3.10.4] Established socket connection between /10.233.90.5:41613 and /10.233.71.21:5701 2018-09-28T18:17:59.376900798Z Sep 28, 2018 6:17:59 PM com.hazelcast.nio.tcp.TcpIpConnection
2018-09-28T18:17:59.376931621Z INFO: [10.233.90.5]:5701 [kpts-cluster] [3.10.4] Connection[id=5, /10.233.90.5:41613->/10.233.71.21:5701, endpoint=[10.233.71.21]:5701, alive=false, type=MEMBER] closed. Reason: Member left event received from master 2018-09-28T18:17:59.378512532Z Sep 28, 2018 6:17:59 PM com.hazelcast.transaction.TransactionManagerService
2018-09-28T18:17:59.378612815Z INFO: [10.233.90.5]:5701 [kpts-cluster] [3.10.4] Committing/rolling-back live transactions of [10.233.71.21]:5701, UUID: 5585d841-f30f-44e5-8290-4f777a9f9a5e 2018-09-28T18:17:59.378757175Z Sep 28, 2018 6:17:59 PM com.hazelcast.internal.cluster.ClusterService
2018-09-28T18:17:59.378948248Z INFO: [10.233.90.5]:5701 [kpts-cluster] [3.10.4] 2018-09-28T18:17:59.379282381Z 2018-09-28T18:17:59.37937371Z Members {size:3, ver:7} [
2018-09-28T18:17:59.379381035Z Member [10.233.97.132]:5701 - edec2d4b-2038-4d4e-a07a-d949c5eddb73
2018-09-28T18:17:59.379475593Z Member [10.233.75.68]:5701 - df4eefa7-5829-4da6-9cf5-0efcfe7aa1e7
2018-09-28T18:17:59.379482891Z Member [10.233.90.5]:5701 - a87ec39a-9df9-45b2-8be3-5a01d9c3e5a7
this
2018-09-28T18:17:59.37948704Z ] 2018-09-28T18:17:59.379490677Z 2018-09-28T18:18:00.978709605Z Sep 28, 2018 6:18:00 PM com.hazelcast.nio.tcp.TcpIpAcceptor
2018-09-28T18:18:00.978736307Z INFO: [10.233.90.5]:5701 [kpts-cluster] [3.10.4] Accepting socket connection from /10.233.71.21:45044 2018-09-28T18:18:00.978954156Z Sep 28, 2018 6:18:00 PM com.hazelcast.nio.tcp.TcpIpConnectionManager
2018-09-28T18:18:00.978964757Z INFO: [10.233.90.5]:5701 [kpts-cluster] [3.10.4] Established socket connection between /10.233.90.5:5701 and /10.233.71.21:45044 Logs from 9/28/18 6:16 PM to 9/28/18 6:18 PM UTC
Here is the resource definition:
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: iga
spec:
selector:
matchLabels:
app: iga-worker
replicas: 4
template:
metadata:
labels:
app: iga-worker
spec:
containers:
- name: iga-worker
image: "kbhit/iga-adi-cl-worker:latest"
imagePullPolicy: Always
ports:
- containerPort: 5701
env:
- name: JAVA_OPTS
value: "
-XX:+UnlockExperimentalVMOptions
-XX:+UseCGroupMemoryLimitForHeap
-XX:MaxRAMFraction=4
-XshowSettings:vm
-Dk8s.service.name=iga-adi-cl-workers
-Dk8s.namespace=iga-adi-cl
-Dmancenter.url=http://iga-management-center.iga-adi-cl:8080/hazelcast-mancenter
-Dhazelcast.diagnostics.enabled=true
-Dhazelcast.diagnostics.metric.level=info
-Dhazelcast.diagnostics.invocation.sample.period.seconds=30
-Dhazelcast.diagnostics.pending.invocations.period.seconds=30
-Dhazelcast.diagnostics.slowoperations.period.seconds=30
-Dhazelcast.log.state=true
"
resources:
limits:
cpu: 3
memory: 5Gi
requests:
cpu: 3
memory: 5Gi
---
apiVersion: apps/v1beta1
kind: Deployment
metadata:
name: iga-management-center
spec:
replicas: 1
template:
metadata:
labels:
app: iga-management-center
spec:
containers:
- name: hazelcast
image: hazelcast/management-center
---
kind: Service
apiVersion: v1
metadata:
name: iga-adi-cl-workers
spec:
selector:
app: iga-worker
ports:
- protocol: TCP
port: 5701
targetPort: 5701
---
kind: Service
apiVersion: v1
metadata:
name: iga-management-center
spec:
type: NodePort
selector:
app: iga-management-center
ports:
- protocol: TCP
port: 8080
targetPort: 8080
And here is the complete Hazelcast config I'm using
<?xml version="1.0" encoding="UTF-8"?>
<hazelcast xsi:schemaLocation="http://www.hazelcast.com/schema/config hazelcast-config-3.11.xsd"
xmlns="http://www.hazelcast.com/schema/config"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<properties>
<property name="hazelcast.discovery.enabled">true</property>
<property name="service-name">kpts-worker</property>
<property name="hazelcast.partition.count">271</property>
<property name="hazelcast.diagnostics.enabled">true</property>
</properties>
<group>
<name>kpts-cluster</name>
<password>kpts-cluster-pass</password>
</group>
<management-center enabled="true">${mancenter.url}</management-center>
<network>
<join>
<!-- deactivate normal discovery -->
<multicast enabled="false"/>
<tcp-ip enabled="false" />
<!-- activate the Kubernetes plugin -->
<discovery-strategies>
<discovery-strategy enabled="true"
class="com.hazelcast.kubernetes.HazelcastKubernetesDiscoveryStrategy">
<properties>
<property name="service-name">${k8s.service.name}</property>
<!--<property name="service-label-name">cluster01</property>-->
<!--<property name="service-label-value">true</property>-->
<property name="namespace">${k8s.namespace}</property>
</properties>
</discovery-strategy>
</discovery-strategies>
</join>
</network>
<partition-group enabled="false"/>
<executor-service name="default">
<pool-size>4</pool-size>
<queue-capacity>0</queue-capacity>
</executor-service>
<map name="commons">
<in-memory-format>BINARY</in-memory-format>
<backup-count>0</backup-count>
<async-backup-count>0</async-backup-count>
<near-cache>
<in-memory-format>OBJECT</in-memory-format>
<!--
Maximum size of the near cache. When max size is reached,
cache is evicted based on the policy defined.
Any integer between 0 and Integer.MAX_VALUE. 0 means
Integer.MAX_VALUE. Default is 0.
-->
<max-size>0</max-size>
<!--
Maximum number of seconds for each entry to stay in the near cache. Entries that are
older than <time-to-live-seconds> will get automatically evicted from the near cache.
Any integer between 0 and Integer.MAX_VALUE. 0 means infinite. Default is 0.
-->
<time-to-live-seconds>0</time-to-live-seconds>
<!--
Maximum number of seconds each entry can stay in the near cache as untouched (not-read).
Entries that are not read (touched) more than <max-idle-seconds> value will get removed
from the near cache.
Any integer between 0 and Integer.MAX_VALUE. 0 means
Integer.MAX_VALUE. Default is 0.
-->
<max-idle-seconds>0</max-idle-seconds>
<!--
Valid values are:
NONE (no extra eviction, <time-to-live-seconds> may still apply),
LRU (Least Recently Used),
LFU (Least Frequently Used).
NONE is the default.
Regardless of the eviction policy used, <time-to-live-seconds> will still apply.
-->
<eviction-policy>NONE</eviction-policy>
<!--
Should the cached entries get evicted if the entries are changed (updated or removed).
true of false. Default is true.
-->
<invalidate-on-change>true</invalidate-on-change>
<!--
You may want also local entries to be cached.
This is useful when in memory format for near cache is different than the map's one.
By default it is disabled.
-->
<cache-local-entries>true</cache-local-entries>
</near-cache>
</map>
<map name="vertices">
<!--
Data type that will be used for storing recordMap.
Possible values:
BINARY (default): keys and values will be stored as binary data
OBJECT : values will be stored in their object forms
NATIVE : values will be stored in non-heap region of JVM
-->
<in-memory-format>BINARY</in-memory-format>
<!--
Number of backups. If 1 is set as the backup-count for example,
then all entries of the map will be copied to another JVM for
fail-safety. 0 means no backup.
-->
<backup-count>0</backup-count>
<!--
Number of async backups. 0 means no backup.
-->
<async-backup-count>0</async-backup-count>
<!--
Maximum number of seconds for each entry to stay in the map. Entries that are
older than <time-to-live-seconds> and not updated for <time-to-live-seconds>
will get automatically evicted from the map.
Any integer between 0 and Integer.MAX_VALUE. 0 means infinite. Default is 0.
-->
<time-to-live-seconds>0</time-to-live-seconds>
<!--
Maximum number of seconds for each entry to stay idle in the map. Entries that are
idle(not touched) for more than <max-idle-seconds> will get
automatically evicted from the map. Entry is touched if get, put or containsKey is called.
Any integer between 0 and Integer.MAX_VALUE. 0 means infinite. Default is 0.
-->
<max-idle-seconds>0</max-idle-seconds>
<!--
Valid values are:
NONE (no eviction),
LRU (Least Recently Used),
LFU (Least Frequently Used).
NONE is the default.
-->
<eviction-policy>NONE</eviction-policy>
<!--
Maximum size of the map. When max size is reached,
map is evicted based on the policy defined.
Any integer between 0 and Integer.MAX_VALUE. 0 means
Integer.MAX_VALUE. Default is 0.
-->
<max-size policy="PER_NODE">0</max-size>
<!--
While recovering from split-brain (network partitioning),
map entries in the small cluster will merge into the bigger cluster
based on the policy set here. When an entry merge into the
cluster, there might an existing entry with the same key already.
Values of these entries might be different for that same key.
Which value should be set for the key? Conflict is resolved by
the policy set here. Default policy is PutIfAbsentMapMergePolicy
There are built-in merge policies such as
com.hazelcast.map.merge.PassThroughMergePolicy; entry will be overwritten if merging entry exists for the key.
com.hazelcast.map.merge.PutIfAbsentMapMergePolicy ; entry will be added if the merging entry doesn't exist in the cluster.
com.hazelcast.map.merge.HigherHitsMapMergePolicy ; entry with the higher hits wins.
com.hazelcast.map.merge.LatestUpdateMapMergePolicy ; entry with the latest update wins.
-->
<merge-policy>com.hazelcast.map.merge.PutIfAbsentMapMergePolicy</merge-policy>
<!--
Control caching of de-serialized values. Caching makes query evaluation faster, but it cost memory.
Possible Values:
NEVER: Never cache deserialized object
INDEX-ONLY: Caches values only when they are inserted into an index.
ALWAYS: Always cache deserialized values.
-->
<cache-deserialized-values>NEVER</cache-deserialized-values>
</map>
<serialization>
<data-serializable-factories>
<data-serializable-factory factory-id="1">
com.agh.iet.komplastech.solver.factories.HazelcastProductionFactory
</data-serializable-factory>
<data-serializable-factory factory-id="2">
com.agh.iet.komplastech.solver.factories.HazelcastGeneralFactory
</data-serializable-factory>
<data-serializable-factory factory-id="3">
com.agh.iet.komplastech.solver.factories.HazelcastProblemFactory
</data-serializable-factory>
</data-serializable-factories>
</serialization>
<services enable-defaults="true"/>
<lite-member enabled="false"/>
</hazelcast>
Client logs
WARNING: hz.client_0 [kpts-cluster] [3.10.4] Error while fetching cluster partition table!
java.util.concurrent.ExecutionException: com.hazelcast.spi.exception.TargetDisconnectedException: Heartbeat timed out to owner connection ClientConnection{alive=true, connectionId=139, channel=NioChannel{/10.233.74.77:45202->/10.233.100.138:5701}, remoteEndpoint=[10.233.100.138]:5701, lastReadTime=2018-09-30 11:59:50.747, lastWriteTime=2018-09-30 12:00:48.140, closedTime=never, lastHeartbeatRequested=2018-09-30 12:00:47.934, lastHeartbeatReceived=2018-09-30 11:59:47.936, connected server version=3.10.4}
at com.hazelcast.client.spi.impl.ClientInvocationFuture.resolve(ClientInvocationFuture.java:73)
at com.hazelcast.spi.impl.AbstractInvocationFuture$1.run(AbstractInvocationFuture.java:250)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
at com.hazelcast.util.executor.HazelcastManagedThread.executeRun(HazelcastManagedThread.java:64)
at com.hazelcast.util.executor.HazelcastManagedThread.run(HazelcastManagedThread.java:80)
Caused by: com.hazelcast.spi.exception.TargetDisconnectedException: Heartbeat timed out to owner connection ClientConnection{alive=true, connectionId=139, channel=NioChannel{/10.233.74.77:45202->/10.233.100.138:5701}, remoteEndpoint=[10.233.100.138]:5701, lastReadTime=2018-09-30 11:59:50.747, lastWriteTime=2018-09-30 12:00:48.140, closedTime=never, lastHeartbeatRequested=2018-09-30 12:00:47.934, lastHeartbeatReceived=2018-09-30 11:59:47.936, connected server version=3.10.4}
at com.hazelcast.client.spi.impl.AbstractClientInvocationService$CleanResourcesTask.notifyException(AbstractClientInvocationService.java:224)
at com.hazelcast.client.spi.impl.AbstractClientInvocationService$CleanResourcesTask.run(AbstractClientInvocationService.java:213)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
... 5 more
Caused by: com.hazelcast.spi.exception.TargetDisconnectedException: Heartbeat timed out to owner connection ClientConnection{alive=true, connectionId=139, channel=NioChannel{/10.233.74.77:45202->/10.233.100.138:5701}, remoteEndpoint=[10.233.100.138]:5701, lastReadTime=2018-09-30 11:59:50.747, lastWriteTime=2018-09-30 12:00:48.140, closedTime=never, lastHeartbeatRequested=2018-09-30 12:00:47.934, lastHeartbeatReceived=2018-09-30 11:59:47.936, connected server version=3.10.4}
at com.hazelcast.client.connection.nio.DefaultClientConnectionStrategy.onHeartbeatStopped(DefaultClientConnectionStrategy.java:117)
at com.hazelcast.client.connection.nio.ClientConnectionManagerImpl.heartbeatStopped(ClientConnectionManagerImpl.java:730)
at com.hazelcast.client.connection.nio.HeartbeatManager.fireHeartbeatStopped(HeartbeatManager.java:139)
at com.hazelcast.client.connection.nio.HeartbeatManager.checkConnection(HeartbeatManager.java:98)
at com.hazelcast.client.connection.nio.HeartbeatManager.run(HeartbeatManager.java:85)
... 9 more
Sep 30, 2018 12:00:53 PM com.hazelcast.client.spi.ClientPartitionService
WARNING: hz.client_0 [kpts-cluster] [3.10.4] Error while fetching cluster partition table!
java.util.concurrent.ExecutionException: com.hazelcast.spi.exception.TargetDisconnectedException: Heartbeat timed out to owner connection ClientConnection{alive=true, connectionId=139, channel=NioChannel{/10.233.74.77:45202->/10.233.100.138:5701}, remoteEndpoint=[10.233.100.138]:5701, lastReadTime=2018-09-30 11:59:50.747, lastWriteTime=2018-09-30 12:00:48.140, closedTime=never, lastHeartbeatRequested=2018-09-30 12:00:47.934, lastHeartbeatReceived=2018-09-30 11:59:47.936, connected server version=3.10.4}
at com.hazelcast.client.spi.impl.ClientInvocationFuture.resolve(ClientInvocationFuture.java:73)
Nodes used can easily accommodate resource requirements and there are no other applications deployed. What might be the issue here?