We have deployed ActiveMQ Artemis v2.19.0 in a HA+Cluster configuration, hosted on Kubernetes (non-cloud) and use the JGroups KUBE_PING for the broker discovery. During regular operations, we have 2 primaries and 2 replica brokers and everything looks fine.
For testing, we now remove the replica instances (no Pods left) – and end up with a weird cluster state: 2 primaries – and 1 zombie replica connected to primary 1. The replica instances were shut down (scaling the corresponding StatefulSet to zero), i.e., no hard kill.
Restarting the replicas brings the cluster back to a normal state – sometimes.
According to the docs, the missing broker instances should be removed:
If it has not received a broadcast from a particular server for a length of time it will remove that server's entry from its list.
So the questions are: Why do we see the zombie broker (even after hours)? And how can we get back to a clean state without shutting down all instances?
Here is our jgroups.xml
:
<config xmlns="urn:org:jgroups"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="urn:org:jgroups http://www.jgroups.org/schema/JGroups-3.0.xsd">
<TCP
enable_diagnostics="true"
bind_addr="match-interface:eth0,lo"
bind_port="7800"
recv_buf_size="20000000"
send_buf_size="640000"
max_bundle_size="64000"
max_bundle_timeout="30"
sock_conn_timeout="300"
thread_pool.enabled="true"
thread_pool.min_threads="2"
thread_pool.max_threads="8"
thread_pool.keep_alive_time="5000"
thread_pool.queue_enabled="true"
thread_pool.queue_max_size="10000"
thread_pool.rejection_policy="run"
oob_thread_pool.enabled="true"
oob_thread_pool.min_threads="1"
oob_thread_pool.max_threads="8"
oob_thread_pool.keep_alive_time="5000"
oob_thread_pool.queue_enabled="true"
oob_thread_pool.queue_max_size="100"
oob_thread_pool.rejection_policy="run"
/>
<TRACE/>
<org.jgroups.protocols.kubernetes.KUBE_PING
namespace="${kubernetesNamespace:default}"
labels="artemis-cluster=${clusterName:activemq-artemis}"
/>
<MERGE3 min_interval="10000" max_interval="30000"/>
<FD_SOCK/>
<FD timeout="3000" max_tries="3" />
<VERIFY_SUSPECT timeout="1500" />
<BARRIER />
<pbcast.NAKACK2 use_mcast_xmit="false" discard_delivered_msgs="true"/>
<UNICAST3
xmit_table_num_rows="100"
xmit_table_msgs_per_row="1000"
xmit_table_max_compaction_time="30000"
/>
<pbcast.STABLE stability_delay="1000" desired_avg_gossip="50000" max_bytes="400000"/>
<pbcast.GMS print_local_addr="true" join_timeout="3000" view_bundling="true"/>
<!-- <FC max_credits="2000000" min_threshold="0.10"/> -->
<MFC max_credits="2M" min_threshold="0.4"/>
<FRAG2 frag_size="60000" />
<pbcast.STATE_TRANSFER/>
<!-- <pbcast.FLUSH timeout="0"/> -->
</config>
Configured logging as advised by Domenico. This time, when we shut down the replica brokers, both continue to exist as zombie instances:
Here are the logs (shutdown of replicas instances started at 2021-12-09T13:03:43Z):
------------------- TRACE (sent) -----------------------
MSG, arg=[dst: <null>, src: <null> (1 headers), size=0 bytes, flags=OOB|INTERNAL, transient_flags=DONT_LOOPBACK] (headers=NAKACK2: [HIGHEST_SEQNO, seqno=1727])
--------------------------------------------------------
---------------- TRACE (received) ----------------------
MSG, arg=[dst: <null>, src: ha-asa-activemq-artemis-primary-1-53544 (2 headers), size=0 bytes, flags=DONT_BUNDLE|INTERNAL] (headers=MERGE3: INFO: view_id=[ha-asa-activemq-artemis-primary-1-53544|5], logical_name=ha-asa-activemq-artemis-primary-1-53544, physical_addr=172.30.20.216:7800, TP: [cluster_name=active_broadcast_channel])
--------------------------------------------------------
------------------- TRACE (sent) -----------------------
SET_PHYSICAL_ADDRESS, arg=ha-asa-activemq-artemis-primary-1-53544 : 172.30.20.216:7800
--------------------------------------------------------
---------------- TRACE (received) ----------------------
MSG, arg=[dst: <null>, src: ha-asa-activemq-artemis-primary-1-53544 (2 headers), size=606 bytes] (headers=NAKACK2: [MSG, seqno=1733], TP: [cluster_name=active_broadcast_channel])
--------------------------------------------------------
{"timestamp":"2021-12-09T13:07:21.256Z","sequence":11249,"loggerClassName":"java.util.logging.Logger","loggerName":"org.apache.activemq.artemis.core.cluster.DiscoveryGroup","level":"DEBUG","message":"receiving 606","threadName":"activemq-discovery-group-thread-cluster-discovery-group0 (DiscoveryGroup-892093608)","threadId":78,"mdc":{},"ndc":"","hostName":"ha-asa-activemq-artemis-primary-0","processName":"Artemis","processId":303}
{"timestamp":"2021-12-09T13:07:21.256Z","sequence":11248,"loggerClassName":"java.util.logging.Logger","loggerName":"org.apache.activemq.artemis.core.cluster.DiscoveryGroup","level":"DEBUG","message":"receiving 606","threadName":"activemq-discovery-group-thread-cluster-discovery-group0 (DiscoveryGroup-176376157)","threadId":91,"mdc":{},"ndc":"","hostName":"ha-asa-activemq-artemis-primary-0","processName":"Artemis","processId":303}
{"timestamp":"2021-12-09T13:07:21.256Z","sequence":11252,"loggerClassName":"java.util.logging.Logger","loggerName":"org.apache.activemq.artemis.core.cluster.DiscoveryGroup","level":"DEBUG","message":"Received nodeID caec362d-58dc-11ec-9bf0-d2725171aa2d with originatingID = caad7f99-58dc-11ec-867d-ce446123ae5c","threadName":"activemq-discovery-group-thread-cluster-discovery-group0 (DiscoveryGroup-176376157)","threadId":91,"mdc":{},"ndc":"","hostName":"ha-asa-activemq-artemis-primary-0","processName":"Artemis","processId":303}
{"timestamp":"2021-12-09T13:07:21.256Z","sequence":11254,"loggerClassName":"java.util.logging.Logger","loggerName":"org.apache.activemq.artemis.core.cluster.DiscoveryGroup","level":"DEBUG","message":"Received nodeID d444f140-58dc-11ec-9bf0-d2725171aa2d with originatingID = caad7f99-58dc-11ec-867d-ce446123ae5c","threadName":"activemq-discovery-group-thread-cluster-discovery-group0 (DiscoveryGroup-892093608)","threadId":78,"mdc":{},"ndc":"","hostName":"ha-asa-activemq-artemis-primary-0","processName":"Artemis","processId":303}
{"timestamp":"2021-12-09T13:07:21.256Z","sequence":11256,"loggerClassName":"java.util.logging.Logger","loggerName":"org.apache.activemq.artemis.core.cluster.DiscoveryGroup","level":"DEBUG","message":"Received 1 discovery entry elements","threadName":"activemq-discovery-group-thread-cluster-discovery-group0 (DiscoveryGroup-176376157)","threadId":91,"mdc":{},"ndc":"","hostName":"ha-asa-activemq-artemis-primary-0","processName":"Artemis","processId":303}
{"timestamp":"2021-12-09T13:07:21.256Z","sequence":11258,"loggerClassName":"java.util.logging.Logger","loggerName":"org.apache.activemq.artemis.core.cluster.DiscoveryGroup","level":"DEBUG","message":"Received 1 discovery entry elements","threadName":"activemq-discovery-group-thread-cluster-discovery-group0 (DiscoveryGroup-892093608)","threadId":78,"mdc":{},"ndc":"","hostName":"ha-asa-activemq-artemis-primary-0","processName":"Artemis","processId":303}
{"timestamp":"2021-12-09T13:07:21.257Z","sequence":11261,"loggerClassName":"java.util.logging.Logger","loggerName":"org.apache.activemq.artemis.core.cluster.DiscoveryGroup","level":"DEBUG","message":"DiscoveryEntry[nodeID=caad7f99-58dc-11ec-867d-ce446123ae5c, connector=TransportConfiguration(name=artemis-tls-connector, factory=org-apache-activemq-artemis-core-remoting-impl-netty-NettyConnectorFactory) ?trustStorePassword=****&tcpReceiveBufferSize=1048576&port=61617&sslEnabled=true&host=ha-asa-activemq-artemis-primary-1-ha-asa-activemq-artemis-default-svc-bbscluster-hemisphere-local&trustStorePath=/var/lib/artemis/certs/truststore-jks&useEpoll=true&tcpSendBufferSize=1048576, lastUpdate=1639055241256]","threadName":"activemq-discovery-group-thread-cluster-discovery-group0 (DiscoveryGroup-892093608)","threadId":78,"mdc":{},"ndc":"","hostName":"ha-asa-activemq-artemis-primary-0","processName":"Artemis","processId":303}
{"timestamp":"2021-12-09T13:07:21.257Z","sequence":11260,"loggerClassName":"java.util.logging.Logger","loggerName":"org.apache.activemq.artemis.core.cluster.DiscoveryGroup","level":"DEBUG","message":"DiscoveryEntry[nodeID=caad7f99-58dc-11ec-867d-ce446123ae5c, connector=TransportConfiguration(name=artemis-tls-connector, factory=org-apache-activemq-artemis-core-remoting-impl-netty-NettyConnectorFactory) ?trustStorePassword=****&tcpReceiveBufferSize=1048576&port=61617&sslEnabled=true&host=ha-asa-activemq-artemis-primary-1-ha-asa-activemq-artemis-default-svc-bbscluster-hemisphere-local&trustStorePath=/var/lib/artemis/certs/truststore-jks&useEpoll=true&tcpSendBufferSize=1048576, lastUpdate=1639055241256]","threadName":"activemq-discovery-group-thread-cluster-discovery-group0 (DiscoveryGroup-176376157)","threadId":91,"mdc":{},"ndc":"","hostName":"ha-asa-activemq-artemis-primary-0","processName":"Artemis","processId":303}
{"timestamp":"2021-12-09T13:07:21.257Z","sequence":11264,"loggerClassName":"java.util.logging.Logger","loggerName":"org.apache.activemq.artemis.core.cluster.DiscoveryGroup","level":"DEBUG","message":"changed = false","threadName":"activemq-discovery-group-thread-cluster-discovery-group0 (DiscoveryGroup-892093608)","threadId":78,"mdc":{},"ndc":"","hostName":"ha-asa-activemq-artemis-primary-0","processName":"Artemis","processId":303}
{"timestamp":"2021-12-09T13:07:21.257Z","sequence":11266,"loggerClassName":"java.util.logging.Logger","loggerName":"org.apache.activemq.artemis.core.cluster.DiscoveryGroup","level":"DEBUG","message":"changed = false","threadName":"activemq-discovery-group-thread-cluster-discovery-group0 (DiscoveryGroup-176376157)","threadId":91,"mdc":{},"ndc":"","hostName":"ha-asa-activemq-artemis-primary-0","processName":"Artemis","processId":303}
{"timestamp":"2021-12-09T13:07:21.257Z","sequence":11268,"loggerClassName":"java.util.logging.Logger","loggerName":"org.apache.activemq.artemis.core.cluster.DiscoveryGroup","level":"DEBUG","message":"Calling notifyAll","threadName":"activemq-discovery-group-thread-cluster-discovery-group0 (DiscoveryGroup-892093608)","threadId":78,"mdc":{},"ndc":"","hostName":"ha-asa-activemq-artemis-primary-0","processName":"Artemis","processId":303}
{"timestamp":"2021-12-09T13:07:21.257Z","sequence":11270,"loggerClassName":"java.util.logging.Logger","loggerName":"org.apache.activemq.artemis.core.cluster.DiscoveryGroup","level":"DEBUG","message":"Calling notifyAll","threadName":"activemq-discovery-group-thread-cluster-discovery-group0 (DiscoveryGroup-176376157)","threadId":91,"mdc":{},"ndc":"","hostName":"ha-asa-activemq-artemis-primary-0","processName":"Artemis","processId":303}
{"timestamp":"2021-12-09T13:07:21.257Z","sequence":11272,"loggerClassName":"java.util.logging.Logger","loggerName":"org.apache.activemq.artemis.api.core.JGroupsBroadcastEndpoint","level":"TRACE","message":"Receiving Broadcast: clientOpened=true, channelOPen=true","threadName":"activemq-discovery-group-thread-cluster-discovery-group0 (DiscoveryGroup-892093608)","threadId":78,"mdc":{},"ndc":"","hostName":"ha-asa-activemq-artemis-primary-0","processName":"Artemis","processId":303}
{"timestamp":"2021-12-09T13:07:21.257Z","sequence":11274,"loggerClassName":"java.util.logging.Logger","loggerName":"org.apache.activemq.artemis.api.core.JGroupsBroadcastEndpoint","level":"TRACE","message":"Receiving Broadcast: clientOpened=true, channelOPen=true","threadName":"activemq-discovery-group-thread-cluster-discovery-group0 (DiscoveryGroup-176376157)","threadId":91,"mdc":{},"ndc":"","hostName":"ha-asa-activemq-artemis-primary-0","processName":"Artemis","processId":303}
---------------- TRACE (received) ----------------------
MSG, arg=[dst: ha-asa-activemq-artemis-primary-0-4048, src: ha-asa-activemq-artemis-primary-1-53544 (2 headers), size=0 bytes, flags=INTERNAL] (headers=FD: heartbeat, TP: [cluster_name=active_broadcast_channel])
--------------------------------------------------------
------------------- TRACE (sent) -----------------------
MSG, arg=[dst: ha-asa-activemq-artemis-primary-1-53544, src: <null> (1 headers), size=0 bytes, flags=INTERNAL] (headers=FD: heartbeat ack)
--------------------------------------------------------
------------------- TRACE (sent) -----------------------
MSG, arg=[dst: ha-asa-activemq-artemis-primary-1-53544, src: <null> (1 headers), size=0 bytes, flags=INTERNAL] (headers=FD: heartbeat)
--------------------------------------------------------
---------------- TRACE (received) ----------------------
MSG, arg=[dst: ha-asa-activemq-artemis-primary-0-4048, src: ha-asa-activemq-artemis-primary-1-53544 (2 headers), size=0 bytes, flags=INTERNAL] (headers=FD: heartbeat ack, TP: [cluster_name=active_broadcast_channel])
--------------------------------------------------------
{"timestamp":"2021-12-09T13:07:22.908Z","sequence":11276,"loggerClassName":"java.util.logging.Logger","loggerName":"org.apache.activemq.artemis.api.core.JGroupsBroadcastEndpoint","level":"TRACE","message":"Broadcasting: BroadCastOpened=true, channelOPen=true","threadName":"Thread-1 (ActiveMQ-scheduled-threads)","threadId":85,"mdc":{},"ndc":"","hostName":"ha-asa-activemq-artemis-primary-0","processName":"Artemis","processId":303}
------------------- TRACE (sent) -----------------------
MSG, arg=[dst: <null>, src: ha-asa-activemq-artemis-primary-0-4048 (1 headers), size=606 bytes, transient_flags=DONT_LOOPBACK] (headers=NAKACK2: [MSG, seqno=1728])
--------------------------------------------------------
---------------- TRACE (received) ----------------------
MSG, arg=[dst: <null>, src: ha-asa-activemq-artemis-primary-1-53544 (2 headers), size=0 bytes, flags=OOB|INTERNAL] (headers=NAKACK2: [HIGHEST_SEQNO, seqno=1733], TP: [cluster_name=active_broadcast_channel])
--------------------------------------------------------
------------------- TRACE (sent) -----------------------
MSG, arg=[dst: <null>, src: <null> (1 headers), size=0 bytes, flags=OOB|INTERNAL, transient_flags=DONT_LOOPBACK] (headers=NAKACK2: [HIGHEST_SEQNO, seqno=1728])
--------------------------------------------------------
---------------- TRACE (received) ----------------------
MSG, arg=[dst: ha-asa-activemq-artemis-primary-0-4048, src: ha-asa-activemq-artemis-primary-1-53544 (2 headers), size=0 bytes, flags=INTERNAL] (headers=FD: heartbeat, TP: [cluster_name=active_broadcast_channel])
--------------------------------------------------------
------------------- TRACE (sent) -----------------------
MSG, arg=[dst: ha-asa-activemq-artemis-primary-1-53544, src: <null> (1 headers), size=0 bytes, flags=INTERNAL] (headers=FD: heartbeat ack)
--------------------------------------------------------
------------------- TRACE (sent) -----------------------
MSG, arg=[dst: ha-asa-activemq-artemis-primary-1-53544, src: <null> (1 headers), size=0 bytes, flags=INTERNAL] (headers=FD: heartbeat)
--------------------------------------------------------
---------------- TRACE (received) ----------------------
MSG, arg=[dst: ha-asa-activemq-artemis-primary-0-4048, src: ha-asa-activemq-artemis-primary-1-53544 (2 headers), size=0 bytes, flags=INTERNAL] (headers=FD: heartbeat ack, TP: [cluster_name=active_broadcast_channel])
--------------------------------------------------------