Openshift routers update from version 3.7 to version 3.9 caused hundreds of warnings in Openshift logs:
[timestamp] [hz._hzInstance_1_dev.IO.thread-in-2] WARN com.hazelcast.nio.tcp.TcpIpConnection - [x.x.19.150]:5701 [dev] [3.11.4] Connection[id=157132, /x.x.19.150:5701->/x.x.25.1:50370, endpoint=null, alive=false, type=NONE] closed. Reason: Exception in Connection[id=157132, /x.x.19.150:5701->/x.x.25.1:50370, endpoint=null, alive=true, type=NONE], thread=hz._hzInstance_1_dev.IO.thread-in-2 java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
[timestamp] [hz._hzInstance_1_dev.IO.thread-in-0] WARN com.hazelcast.nio.tcp.TcpIpConnection - [x.x.31.153]:5701 [dev] [3.11.4] Connection[id=156553, /x.x.31.153:5701->/x.x.9.1:48700, endpoint=null, alive=false, type=NONE] closed. Reason: Exception in Connection[id=156553, /x.x.31.153:5701->/x.x.9.1:48700, endpoint=null, alive=true, type=NONE], thread=hz._hzInstance_1_dev.IO.thread-in-0\njava.io.IOException: Connection reset by peer\n at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
[timestamp] [hz._hzInstance_1_dev.IO.thread-in-2] WARN com.hazelcast.nio.tcp.TcpIpConnection - [x.x.3.34]:5701 [dev] [3.11.4] Connection[id=157179, /x.x.3.34:5701->/x.x.25.1:60596, endpoint=null, alive=false, type=NONE] closed. Reason: Exception in Connection[id=157179, /x.x.3.34:5701->/x.x.25.1:60596, endpoint=null, alive=true, type=NONE]
[timestamp] [hz._hzInstance_1_dev.IO.thread-in-1] WARN com.hazelcast.nio.tcp.TcpIpConnection - [x.x.10.75]:5701 [dev] [3.11.4] Connection[id=157171, /x.x.10.75:5701->/x.x.25.1:33826, endpoint=null, alive=false, type=NONE] closed. Reason: Exception in Connection[id=157171, /x.x.10.75:5701->/x.x.25.1:33826, endpoint=null, alive=true, type=NONE]
[timestamp] [hz._hzInstance_1_dev.IO.thread-in-1] WARN com.hazelcast.nio.tcp.TcpIpConnection - [x.x.27.206]:5701 [dev] [3.11.4] Connection[id=157157, /x.x.27.206:5701->/x.x.25.1:49578, endpoint=null, alive=false, type=NONE] closed. Reason: Exception in Connection[id=157157, /x.x.27.206:5701->/x.x.25.1:49578, endpoint=null, alive=true, type=NONE]
[timestamp] [hz._hzInstance_1_dev.IO.thread-in-1] WARN com.hazelcast.nio.tcp.TcpIpConnection - [x.x.31.153]:5701 [dev] [3.11.4] Connection[id=157127, /x.x.31.153:5701->/x.x.25.1:42506, endpoint=null, alive=false, type=NONE] closed. Reason: Exception in Connection[id=157127, /x.x.31.153:5701->/x.x.25.1:42506, endpoint=null, alive=true, type=NONE]
Issue was temporary "solved" by roll back to version 3.7: no warnings in Hazelcast logs anymore.
Current findings:
hazelcast.xml:
...
<properties>
<property name="hazelcast.discovery.enabled">true</property>
<property name="hazelcast.logging.type">slf4j</property>
</properties>
<network>
<port port-count="1" auto-increment="false">5701</port>
<reuse-address>true</reuse-address>
<join>
<multicast enabled="false"/>
<kubernetes enabled="true">
<namespace>project-name</namespace>
<service-name>hazelcast-discovery</service-name>
</kubernetes>
</join>
</network>
Versions details:
Misc:
Questions:
Would it help?
<property name="hazelcast.discovery.enabled">true</property>
<tcp-ip enabled="false"></tcp-ip>
(I didn’t find in hazelcast documentation what is default value for tcp-ip. Official example of hazelcast.xml explicitly set tcp-ip to false.)Edit 24.10.2019. Deployment details:
bash script for probes:
#!/bin/bash
URL="http://127.0.0.1:5701/hazelcast/health/node-state"
HTTP_RESPONSE=$(curl -m 5 -sS $URL | head -1)
if [ "_${HTTP_RESPONSE}" != "_ACTIVE" ]; then
echo "failure on ${URL}, response: ${HTTP_RESPONSE}"
exit 1
fi
exit 0
Edit 25.10.2019:
$ oc describe svc hazelcast-discovery
Name: hazelcast-discovery
Namespace: [project-name]
Labels: app=hazelcast
template=hazelcast-statefulset-template
Annotations: service.alpha.kubernetes.io/tolerate-unready-endpoints=true
Selector: name=hazelcast-node-cluster
Type: ClusterIP
IP: None
Port: 5701-tcp 5701/TCP
TargetPort: 5701/TCP
Endpoints: x.x.1.45:5701,x.x.12.144:5701,x.x.13.251:5701 and more...
Session Affinity: None
Events: <none>
Pods were restarted after the issue, so IPs might differ from the logs. Could it be connected to tolerate-unready-endpoints=true
?