JBOSS/Wildfly clustering and Kubernetes -

5/25/2018
  1. Current configuration:
    • 16 pods running, JBoss TCP based cluster with google ping discovery. The container is deployed as stateful set on the Kubernetes cluster.
  2. The initial cluster without load working as expected without any single issue, but when the load increase the following behaviour were observed:
    • Some of the pods become unavailable during the managing of the initial load and in result of this those pods were restarted automatically.
    • After a restart of those pods, they start with new IP addresses, but the same hosts stay in the JBoss discovery file with the old IPs. In result, this discovery file contains hosts with multiple IP addresses.
aaa-ops-stage-0        b6418a02-4db3-0397-ba2b-5a4a3e274560         10.20.0.17:7800        F
aaa-ops-stage-1        d57dc7b7-997f-236e-eb9f-a1604ddafc8f         10.20.0.10:7800        F
aaa-ops-stage-1        63a54371-111e-f9e9-3de5-65c6f6ff9dcd         10.20.0.16:7800        F
aaa-ops-stage-1        2dfeb3d8-6cc4-03e0-719e-b4dbb8a63815         10.20.1.13:7800        T
aaa-ops-stage-0        8053ed47-ba1b-5bb1-fcd2-a2cffb154703         10.20.0.9:7800  F
aaa-ops-stage-0        7068cd6c-ff83-dd5d-1610-e5c03f089605         10.20.0.9:7800  F
aaa-ops-stage-0        6230152a-1bc7-30ed-0073-816224bcdc26         10.20.0.14:7800        F
  • When this happens and pod is restarted the boot of this pod is very slow because it tries to send cluster message to all of the records from the discovery file above. Because aaa-ops-stage-0 has new and only one IP all others aaa-ops-stage-0 just timeout. If the restarts are many for pod 0 more records we have in the discovery file. This also increases the boot time generally each time when pod restarts because it appears with new IP and the timeouts become even more.
  • There are readiness probes implemented in the pod configuration and are used to change the status of newly started pods, and by this, the load balancer knows when the pod is ready to receive requests. Unfortunately with the huge amount of timeouts described above, the pod never fully boots because the readiness probe restarts the pod after 60 seconds of being unavailable. In result of the eventually all pods stuck in a restart loop and the service completely stop.

I believe that if we have the possibility to use sticky IPs and when pod starts with 10.20.0.17 it stays with this IP during restarts. By doing this, we will avoid the behavior described above, and there will be no timeouts. No timeouts will reduce the restarts that are triggered from the readiness probes completely and the service will stay up and be running no meter the load that we produce.

The question is if there is any possibility to use static or sticky IP addresses for the running pods and if it is possible those IPs to persist during restarts? Any other suggestion is welcome as well!

-- Hugh Buitano
jboss
kubernetes

1 Answer

5/28/2018

There are few way to achieve your goals:

1 use kubernetes DNS addresses instead of IP addresses as K.Nicholas wrote.

2 use Calico CNI plugin and use annotations :

 annotations:
        cni.projectcalico.org/ipAddrs: "[\"192.168.0.1\"]"

for specifying IP address for your pods. Information on how to configure Calico in your cluster can be found in documentation.

By the way, it isn't a good practice to use sticky IP address.

-- Nick Rak
Source: StackOverflow