ActiveMQ Artemis: Zombie replica (slave) instance

12/9/2021

We have deployed ActiveMQ Artemis v2.19.0 in a HA+Cluster configuration, hosted on Kubernetes (non-cloud) and use the JGroups KUBE_PING for the broker discovery. During regular operations, we have 2 primaries and 2 replica brokers and everything looks fine.

Artemis cluster in normal state

For testing, we now remove the replica instances (no Pods left) – and end up with a weird cluster state: 2 primaries – and 1 zombie replica connected to primary 1. The replica instances were shut down (scaling the corresponding StatefulSet to zero), i.e., no hard kill.

Artemis cluster with zombie replica broker

Restarting the replicas brings the cluster back to a normal state – sometimes.

According to the docs, the missing broker instances should be removed:

If it has not received a broadcast from a particular server for a length of time it will remove that server's entry from its list.

So the questions are: Why do we see the zombie broker (even after hours)? And how can we get back to a clean state without shutting down all instances?

Here is our jgroups.xml:

<config xmlns="urn:org:jgroups"
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="urn:org:jgroups http://www.jgroups.org/schema/JGroups-3.0.xsd">

  <TCP
    enable_diagnostics="true"
    bind_addr="match-interface:eth0,lo"
    bind_port="7800"
    recv_buf_size="20000000"
    send_buf_size="640000"
    max_bundle_size="64000"
    max_bundle_timeout="30"
    sock_conn_timeout="300"

    thread_pool.enabled="true"
    thread_pool.min_threads="2"
    thread_pool.max_threads="8"
    thread_pool.keep_alive_time="5000"
    thread_pool.queue_enabled="true"
    thread_pool.queue_max_size="10000"
    thread_pool.rejection_policy="run"

    oob_thread_pool.enabled="true"
    oob_thread_pool.min_threads="1"
    oob_thread_pool.max_threads="8"
    oob_thread_pool.keep_alive_time="5000"
    oob_thread_pool.queue_enabled="true"
    oob_thread_pool.queue_max_size="100"
    oob_thread_pool.rejection_policy="run"
  />

  <TRACE/>

  <org.jgroups.protocols.kubernetes.KUBE_PING
    namespace="${kubernetesNamespace:default}"
    labels="artemis-cluster=${clusterName:activemq-artemis}"
  />

  <MERGE3 min_interval="10000" max_interval="30000"/>
  <FD_SOCK/>
  <FD timeout="3000" max_tries="3" />
  <VERIFY_SUSPECT timeout="1500" />
  <BARRIER />
  <pbcast.NAKACK2 use_mcast_xmit="false" discard_delivered_msgs="true"/>
  <UNICAST3
    xmit_table_num_rows="100"
    xmit_table_msgs_per_row="1000"
    xmit_table_max_compaction_time="30000"
  />
  <pbcast.STABLE stability_delay="1000" desired_avg_gossip="50000" max_bytes="400000"/>
  <pbcast.GMS print_local_addr="true" join_timeout="3000" view_bundling="true"/>
  <!-- <FC max_credits="2000000" min_threshold="0.10"/> -->
  <MFC max_credits="2M" min_threshold="0.4"/>
  <FRAG2 frag_size="60000" />
  <pbcast.STATE_TRANSFER/>
  <!-- <pbcast.FLUSH timeout="0"/> -->

</config>

Update

Configured logging as advised by Domenico. This time, when we shut down the replica brokers, both continue to exist as zombie instances:

Artemis cluster status Artemis cluster topology with zombie replicas

Here are the logs (shutdown of replicas instances started at 2021-12-09T13:03:43Z):

------------------- TRACE (sent) -----------------------
MSG, arg=[dst: <null>, src: <null> (1 headers), size=0 bytes, flags=OOB|INTERNAL, transient_flags=DONT_LOOPBACK] (headers=NAKACK2: [HIGHEST_SEQNO, seqno=1727])
--------------------------------------------------------
---------------- TRACE (received) ----------------------
MSG, arg=[dst: <null>, src: ha-asa-activemq-artemis-primary-1-53544 (2 headers), size=0 bytes, flags=DONT_BUNDLE|INTERNAL] (headers=MERGE3: INFO: view_id=[ha-asa-activemq-artemis-primary-1-53544|5], logical_name=ha-asa-activemq-artemis-primary-1-53544, physical_addr=172.30.20.216:7800, TP: [cluster_name=active_broadcast_channel])
--------------------------------------------------------
------------------- TRACE (sent) -----------------------
SET_PHYSICAL_ADDRESS, arg=ha-asa-activemq-artemis-primary-1-53544 : 172.30.20.216:7800
--------------------------------------------------------
---------------- TRACE (received) ----------------------
MSG, arg=[dst: <null>, src: ha-asa-activemq-artemis-primary-1-53544 (2 headers), size=606 bytes] (headers=NAKACK2: [MSG, seqno=1733], TP: [cluster_name=active_broadcast_channel])
--------------------------------------------------------
{"timestamp":"2021-12-09T13:07:21.256Z","sequence":11249,"loggerClassName":"java.util.logging.Logger","loggerName":"org.apache.activemq.artemis.core.cluster.DiscoveryGroup","level":"DEBUG","message":"receiving 606","threadName":"activemq-discovery-group-thread-cluster-discovery-group0 (DiscoveryGroup-892093608)","threadId":78,"mdc":{},"ndc":"","hostName":"ha-asa-activemq-artemis-primary-0","processName":"Artemis","processId":303}
{"timestamp":"2021-12-09T13:07:21.256Z","sequence":11248,"loggerClassName":"java.util.logging.Logger","loggerName":"org.apache.activemq.artemis.core.cluster.DiscoveryGroup","level":"DEBUG","message":"receiving 606","threadName":"activemq-discovery-group-thread-cluster-discovery-group0 (DiscoveryGroup-176376157)","threadId":91,"mdc":{},"ndc":"","hostName":"ha-asa-activemq-artemis-primary-0","processName":"Artemis","processId":303}
{"timestamp":"2021-12-09T13:07:21.256Z","sequence":11252,"loggerClassName":"java.util.logging.Logger","loggerName":"org.apache.activemq.artemis.core.cluster.DiscoveryGroup","level":"DEBUG","message":"Received nodeID caec362d-58dc-11ec-9bf0-d2725171aa2d with originatingID = caad7f99-58dc-11ec-867d-ce446123ae5c","threadName":"activemq-discovery-group-thread-cluster-discovery-group0 (DiscoveryGroup-176376157)","threadId":91,"mdc":{},"ndc":"","hostName":"ha-asa-activemq-artemis-primary-0","processName":"Artemis","processId":303}
{"timestamp":"2021-12-09T13:07:21.256Z","sequence":11254,"loggerClassName":"java.util.logging.Logger","loggerName":"org.apache.activemq.artemis.core.cluster.DiscoveryGroup","level":"DEBUG","message":"Received nodeID d444f140-58dc-11ec-9bf0-d2725171aa2d with originatingID = caad7f99-58dc-11ec-867d-ce446123ae5c","threadName":"activemq-discovery-group-thread-cluster-discovery-group0 (DiscoveryGroup-892093608)","threadId":78,"mdc":{},"ndc":"","hostName":"ha-asa-activemq-artemis-primary-0","processName":"Artemis","processId":303}
{"timestamp":"2021-12-09T13:07:21.256Z","sequence":11256,"loggerClassName":"java.util.logging.Logger","loggerName":"org.apache.activemq.artemis.core.cluster.DiscoveryGroup","level":"DEBUG","message":"Received 1 discovery entry elements","threadName":"activemq-discovery-group-thread-cluster-discovery-group0 (DiscoveryGroup-176376157)","threadId":91,"mdc":{},"ndc":"","hostName":"ha-asa-activemq-artemis-primary-0","processName":"Artemis","processId":303}
{"timestamp":"2021-12-09T13:07:21.256Z","sequence":11258,"loggerClassName":"java.util.logging.Logger","loggerName":"org.apache.activemq.artemis.core.cluster.DiscoveryGroup","level":"DEBUG","message":"Received 1 discovery entry elements","threadName":"activemq-discovery-group-thread-cluster-discovery-group0 (DiscoveryGroup-892093608)","threadId":78,"mdc":{},"ndc":"","hostName":"ha-asa-activemq-artemis-primary-0","processName":"Artemis","processId":303}
{"timestamp":"2021-12-09T13:07:21.257Z","sequence":11261,"loggerClassName":"java.util.logging.Logger","loggerName":"org.apache.activemq.artemis.core.cluster.DiscoveryGroup","level":"DEBUG","message":"DiscoveryEntry[nodeID=caad7f99-58dc-11ec-867d-ce446123ae5c, connector=TransportConfiguration(name=artemis-tls-connector, factory=org-apache-activemq-artemis-core-remoting-impl-netty-NettyConnectorFactory) ?trustStorePassword=****&tcpReceiveBufferSize=1048576&port=61617&sslEnabled=true&host=ha-asa-activemq-artemis-primary-1-ha-asa-activemq-artemis-default-svc-bbscluster-hemisphere-local&trustStorePath=/var/lib/artemis/certs/truststore-jks&useEpoll=true&tcpSendBufferSize=1048576, lastUpdate=1639055241256]","threadName":"activemq-discovery-group-thread-cluster-discovery-group0 (DiscoveryGroup-892093608)","threadId":78,"mdc":{},"ndc":"","hostName":"ha-asa-activemq-artemis-primary-0","processName":"Artemis","processId":303}
{"timestamp":"2021-12-09T13:07:21.257Z","sequence":11260,"loggerClassName":"java.util.logging.Logger","loggerName":"org.apache.activemq.artemis.core.cluster.DiscoveryGroup","level":"DEBUG","message":"DiscoveryEntry[nodeID=caad7f99-58dc-11ec-867d-ce446123ae5c, connector=TransportConfiguration(name=artemis-tls-connector, factory=org-apache-activemq-artemis-core-remoting-impl-netty-NettyConnectorFactory) ?trustStorePassword=****&tcpReceiveBufferSize=1048576&port=61617&sslEnabled=true&host=ha-asa-activemq-artemis-primary-1-ha-asa-activemq-artemis-default-svc-bbscluster-hemisphere-local&trustStorePath=/var/lib/artemis/certs/truststore-jks&useEpoll=true&tcpSendBufferSize=1048576, lastUpdate=1639055241256]","threadName":"activemq-discovery-group-thread-cluster-discovery-group0 (DiscoveryGroup-176376157)","threadId":91,"mdc":{},"ndc":"","hostName":"ha-asa-activemq-artemis-primary-0","processName":"Artemis","processId":303}
{"timestamp":"2021-12-09T13:07:21.257Z","sequence":11264,"loggerClassName":"java.util.logging.Logger","loggerName":"org.apache.activemq.artemis.core.cluster.DiscoveryGroup","level":"DEBUG","message":"changed = false","threadName":"activemq-discovery-group-thread-cluster-discovery-group0 (DiscoveryGroup-892093608)","threadId":78,"mdc":{},"ndc":"","hostName":"ha-asa-activemq-artemis-primary-0","processName":"Artemis","processId":303}
{"timestamp":"2021-12-09T13:07:21.257Z","sequence":11266,"loggerClassName":"java.util.logging.Logger","loggerName":"org.apache.activemq.artemis.core.cluster.DiscoveryGroup","level":"DEBUG","message":"changed = false","threadName":"activemq-discovery-group-thread-cluster-discovery-group0 (DiscoveryGroup-176376157)","threadId":91,"mdc":{},"ndc":"","hostName":"ha-asa-activemq-artemis-primary-0","processName":"Artemis","processId":303}
{"timestamp":"2021-12-09T13:07:21.257Z","sequence":11268,"loggerClassName":"java.util.logging.Logger","loggerName":"org.apache.activemq.artemis.core.cluster.DiscoveryGroup","level":"DEBUG","message":"Calling notifyAll","threadName":"activemq-discovery-group-thread-cluster-discovery-group0 (DiscoveryGroup-892093608)","threadId":78,"mdc":{},"ndc":"","hostName":"ha-asa-activemq-artemis-primary-0","processName":"Artemis","processId":303}
{"timestamp":"2021-12-09T13:07:21.257Z","sequence":11270,"loggerClassName":"java.util.logging.Logger","loggerName":"org.apache.activemq.artemis.core.cluster.DiscoveryGroup","level":"DEBUG","message":"Calling notifyAll","threadName":"activemq-discovery-group-thread-cluster-discovery-group0 (DiscoveryGroup-176376157)","threadId":91,"mdc":{},"ndc":"","hostName":"ha-asa-activemq-artemis-primary-0","processName":"Artemis","processId":303}
{"timestamp":"2021-12-09T13:07:21.257Z","sequence":11272,"loggerClassName":"java.util.logging.Logger","loggerName":"org.apache.activemq.artemis.api.core.JGroupsBroadcastEndpoint","level":"TRACE","message":"Receiving Broadcast: clientOpened=true, channelOPen=true","threadName":"activemq-discovery-group-thread-cluster-discovery-group0 (DiscoveryGroup-892093608)","threadId":78,"mdc":{},"ndc":"","hostName":"ha-asa-activemq-artemis-primary-0","processName":"Artemis","processId":303}
{"timestamp":"2021-12-09T13:07:21.257Z","sequence":11274,"loggerClassName":"java.util.logging.Logger","loggerName":"org.apache.activemq.artemis.api.core.JGroupsBroadcastEndpoint","level":"TRACE","message":"Receiving Broadcast: clientOpened=true, channelOPen=true","threadName":"activemq-discovery-group-thread-cluster-discovery-group0 (DiscoveryGroup-176376157)","threadId":91,"mdc":{},"ndc":"","hostName":"ha-asa-activemq-artemis-primary-0","processName":"Artemis","processId":303}
---------------- TRACE (received) ----------------------
MSG, arg=[dst: ha-asa-activemq-artemis-primary-0-4048, src: ha-asa-activemq-artemis-primary-1-53544 (2 headers), size=0 bytes, flags=INTERNAL] (headers=FD: heartbeat, TP: [cluster_name=active_broadcast_channel])
--------------------------------------------------------
------------------- TRACE (sent) -----------------------
MSG, arg=[dst: ha-asa-activemq-artemis-primary-1-53544, src: <null> (1 headers), size=0 bytes, flags=INTERNAL] (headers=FD: heartbeat ack)
--------------------------------------------------------
------------------- TRACE (sent) -----------------------
MSG, arg=[dst: ha-asa-activemq-artemis-primary-1-53544, src: <null> (1 headers), size=0 bytes, flags=INTERNAL] (headers=FD: heartbeat)
--------------------------------------------------------
---------------- TRACE (received) ----------------------
MSG, arg=[dst: ha-asa-activemq-artemis-primary-0-4048, src: ha-asa-activemq-artemis-primary-1-53544 (2 headers), size=0 bytes, flags=INTERNAL] (headers=FD: heartbeat ack, TP: [cluster_name=active_broadcast_channel])
--------------------------------------------------------
{"timestamp":"2021-12-09T13:07:22.908Z","sequence":11276,"loggerClassName":"java.util.logging.Logger","loggerName":"org.apache.activemq.artemis.api.core.JGroupsBroadcastEndpoint","level":"TRACE","message":"Broadcasting: BroadCastOpened=true, channelOPen=true","threadName":"Thread-1 (ActiveMQ-scheduled-threads)","threadId":85,"mdc":{},"ndc":"","hostName":"ha-asa-activemq-artemis-primary-0","processName":"Artemis","processId":303}
------------------- TRACE (sent) -----------------------
MSG, arg=[dst: <null>, src: ha-asa-activemq-artemis-primary-0-4048 (1 headers), size=606 bytes, transient_flags=DONT_LOOPBACK] (headers=NAKACK2: [MSG, seqno=1728])
--------------------------------------------------------
---------------- TRACE (received) ----------------------
MSG, arg=[dst: <null>, src: ha-asa-activemq-artemis-primary-1-53544 (2 headers), size=0 bytes, flags=OOB|INTERNAL] (headers=NAKACK2: [HIGHEST_SEQNO, seqno=1733], TP: [cluster_name=active_broadcast_channel])
--------------------------------------------------------
------------------- TRACE (sent) -----------------------
MSG, arg=[dst: <null>, src: <null> (1 headers), size=0 bytes, flags=OOB|INTERNAL, transient_flags=DONT_LOOPBACK] (headers=NAKACK2: [HIGHEST_SEQNO, seqno=1728])
--------------------------------------------------------
---------------- TRACE (received) ----------------------
MSG, arg=[dst: ha-asa-activemq-artemis-primary-0-4048, src: ha-asa-activemq-artemis-primary-1-53544 (2 headers), size=0 bytes, flags=INTERNAL] (headers=FD: heartbeat, TP: [cluster_name=active_broadcast_channel])
--------------------------------------------------------
------------------- TRACE (sent) -----------------------
MSG, arg=[dst: ha-asa-activemq-artemis-primary-1-53544, src: <null> (1 headers), size=0 bytes, flags=INTERNAL] (headers=FD: heartbeat ack)
--------------------------------------------------------
------------------- TRACE (sent) -----------------------
MSG, arg=[dst: ha-asa-activemq-artemis-primary-1-53544, src: <null> (1 headers), size=0 bytes, flags=INTERNAL] (headers=FD: heartbeat)
--------------------------------------------------------
---------------- TRACE (received) ----------------------
MSG, arg=[dst: ha-asa-activemq-artemis-primary-0-4048, src: ha-asa-activemq-artemis-primary-1-53544 (2 headers), size=0 bytes, flags=INTERNAL] (headers=FD: heartbeat ack, TP: [cluster_name=active_broadcast_channel])
--------------------------------------------------------
-- Stephan
activemq-artemis
jgroups
kubernetes
service-discovery

0 Answers