Hazelcast discovery in Openshift: Connection reset warnings to routers IPs

10/23/2019

Openshift routers update from version 3.7 to version 3.9 caused hundreds of warnings in Openshift logs:

[timestamp] [hz._hzInstance_1_dev.IO.thread-in-2] WARN com.hazelcast.nio.tcp.TcpIpConnection - [x.x.19.150]:5701 [dev] [3.11.4] Connection[id=157132, /x.x.19.150:5701->/x.x.25.1:50370, endpoint=null, alive=false, type=NONE] closed. Reason: Exception in Connection[id=157132, /x.x.19.150:5701->/x.x.25.1:50370, endpoint=null, alive=true, type=NONE], thread=hz._hzInstance_1_dev.IO.thread-in-2 java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
[timestamp] [hz._hzInstance_1_dev.IO.thread-in-0] WARN  com.hazelcast.nio.tcp.TcpIpConnection - [x.x.31.153]:5701 [dev] [3.11.4] Connection[id=156553, /x.x.31.153:5701->/x.x.9.1:48700, endpoint=null, alive=false, type=NONE] closed. Reason: Exception in Connection[id=156553, /x.x.31.153:5701->/x.x.9.1:48700, endpoint=null, alive=true, type=NONE], thread=hz._hzInstance_1_dev.IO.thread-in-0\njava.io.IOException: Connection reset by peer\n at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
[timestamp] [hz._hzInstance_1_dev.IO.thread-in-2] WARN  com.hazelcast.nio.tcp.TcpIpConnection - [x.x.3.34]:5701 [dev] [3.11.4] Connection[id=157179, /x.x.3.34:5701->/x.x.25.1:60596, endpoint=null, alive=false, type=NONE] closed. Reason: Exception in Connection[id=157179, /x.x.3.34:5701->/x.x.25.1:60596, endpoint=null, alive=true, type=NONE]
[timestamp] [hz._hzInstance_1_dev.IO.thread-in-1] WARN  com.hazelcast.nio.tcp.TcpIpConnection - [x.x.10.75]:5701 [dev] [3.11.4] Connection[id=157171, /x.x.10.75:5701->/x.x.25.1:33826, endpoint=null, alive=false, type=NONE] closed. Reason: Exception in Connection[id=157171, /x.x.10.75:5701->/x.x.25.1:33826, endpoint=null, alive=true, type=NONE]
[timestamp] [hz._hzInstance_1_dev.IO.thread-in-1] WARN  com.hazelcast.nio.tcp.TcpIpConnection - [x.x.27.206]:5701 [dev] [3.11.4] Connection[id=157157, /x.x.27.206:5701->/x.x.25.1:49578, endpoint=null, alive=false, type=NONE] closed. Reason: Exception in Connection[id=157157, /x.x.27.206:5701->/x.x.25.1:49578, endpoint=null, alive=true, type=NONE]
[timestamp] [hz._hzInstance_1_dev.IO.thread-in-1] WARN  com.hazelcast.nio.tcp.TcpIpConnection - [x.x.31.153]:5701 [dev] [3.11.4] Connection[id=157127, /x.x.31.153:5701->/x.x.25.1:42506, endpoint=null, alive=false, type=NONE] closed. Reason: Exception in Connection[id=157127, /x.x.31.153:5701->/x.x.25.1:42506, endpoint=null, alive=true, type=NONE]

Issue was temporary "solved" by roll back to version 3.7: no warnings in Hazelcast logs anymore.

Current findings:

  • All exceptions contain IPs that end to x.x.x.1. That are Openshift routers IPs.
  • Target ports are 50370, 48700, 60596, 39840, 35046, 59900, etc.

hazelcast.xml:

...
<properties>
    <property name="hazelcast.discovery.enabled">true</property>
    <property name="hazelcast.logging.type">slf4j</property>
</properties>
<network>
    <port port-count="1" auto-increment="false">5701</port>
    <reuse-address>true</reuse-address>
    <join>
        <multicast enabled="false"/>
        <kubernetes enabled="true">
            <namespace>project-name</namespace>
            <service-name>hazelcast-discovery</service-name>
        </kubernetes>
    </join>
</network>

Versions details:

  • Hazelcast version: 3.11.4
  • OpenShift Master: v3.9.68 (Kubernetes Master: v1.9.1+a0ce1bc657), routers version 3.9

Misc:

  • The issue is not reproduced with another cluster (Openshift 3.11, router 3.9, Hazelcast 3.11.4)
  • The issue was reproduced in the same cluster with Hazelcast version 3.10.

Questions:

  • What is the root cause of these warnings?
  • Can we tune our configuration to avoid such connections?

Would it help?

  • remove <property name="hazelcast.discovery.enabled">true</property>
  • add <tcp-ip enabled="false"></tcp-ip> (I didn’t find in hazelcast documentation what is default value for tcp-ip. Official example of hazelcast.xml explicitly set tcp-ip to false.)

Edit 24.10.2019. Deployment details:

  • Application and Hazelcast are in the same project. Application connects to hazelcast by service name: hazelcast:5701
  • We use custom livenessProbe/readinessProbe that start bash script every 10s. See below.
  • We have also have a route for Hazelcast. Endpoint /hazelcast/rest/cluster showed correct number of members.
  • Here is full hazelcast config: https://gitlab.com/snippets/1907166 Are our service settings correct?

bash script for probes:

#!/bin/bash
URL="http://127.0.0.1:5701/hazelcast/health/node-state"
HTTP_RESPONSE=$(curl -m 5 -sS $URL | head -1)
if [ "_${HTTP_RESPONSE}" != "_ACTIVE" ]; then
  echo "failure on ${URL}, response: ${HTTP_RESPONSE}"
  exit 1
fi
exit 0

Edit 25.10.2019:

$ oc describe svc hazelcast-discovery
Name:              hazelcast-discovery
Namespace:         [project-name]
Labels:            app=hazelcast
                   template=hazelcast-statefulset-template
Annotations:       service.alpha.kubernetes.io/tolerate-unready-endpoints=true

Selector:          name=hazelcast-node-cluster
Type:              ClusterIP
IP:                None
Port:              5701-tcp  5701/TCP
TargetPort:        5701/TCP
Endpoints:         x.x.1.45:5701,x.x.12.144:5701,x.x.13.251:5701 and more...
Session Affinity:  None
Events:            <none>

Pods were restarted after the issue, so IPs might differ from the logs. Could it be connected to tolerate-unready-endpoints=true?

-- idobr
hazelcast
kubernetes
openshift

0 Answers