I am deploying a simple master/slave replication configuration within Kubernetes environment. They use static connectors. When I delete the master pod the slave successfully takes over duties, but when the master pod comes back up the slave pod does not terminate as live so I end up with two live servers. When this happens I notice they also form an internal bridge. I've ran the exact same configurations locally, outside of Kubernetes, and the slave successfully terminates and goes back to being a slave when the master comes back up. Any ideas as to why this is happening? I am using Artemis version 2.6.4.
master broker.xml
:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<configuration xmlns="urn:activemq" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="urn:activemq /schema/artemis-configuration.xsd">
<jms xmlns="urn:activemq:jms">
<queue name="jms.queue.acmadapter_to_acm_design">
<durable>true</durable>
</queue>
</jms>
<core xmlns="urn:activemq:core" xsi:schemaLocation="urn:activemq:core ">
<acceptors>
<acceptor name="netty-acceptor">tcp://0.0.0.0:61618</acceptor>
</acceptors>
<connectors>
<connector name="netty-connector-master">tcp://artemis-service-0.artemis-service.falconx.svc.cluster.local:61618</connector>
<connector name="netty-connector-backup">tcp://artemis-service2-0.artemis-service.falconx.svc.cluster.local:61618</connector>
</connectors>
<ha-policy>
<replication>
<master>
<!--we need this for auto failback-->
<check-for-live-server>true</check-for-live-server>
</master>
</replication>
</ha-policy>
<cluster-connections>
<cluster-connection name="my-cluster">
<connector-ref>netty-connector-master</connector-ref>
<static-connectors>
<connector-ref>netty-connector-master</connector-ref>
<connector-ref>netty-connector-backup</connector-ref>
</static-connectors>
</cluster-connection>
</cluster-connections>
</core>
</configuration>
slave broker.xml
:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<configuration xmlns="urn:activemq" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="urn:activemq /schema/artemis-configuration.xsd">
<core xmlns="urn:activemq:core" xsi:schemaLocation="urn:activemq:core ">
<acceptors>
<acceptor name="netty-acceptor">tcp://0.0.0.0:61618</acceptor>
</acceptors>
<connectors>
<connector name="netty-connector-backup">tcp://artemis-service2-0.artemis-service.test.svc.cluster.local:61618</connector>
<connector name="netty-connector-master">tcp://artemis-service-0.artemis-service.test.svc.cluster.local:61618</connector>
</connectors>
<ha-policy>
<replication>
<slave>
<allow-failback>true</allow-failback>
<!-- not needed but tells the backup not to restart after failback as there will be > 0 backups saved -->
<max-saved-replicated-journals-size>0</max-saved-replicated-journals-size>
</slave>
</replication>
</ha-policy>
<cluster-connections>
<cluster-connection name="my-cluster">
<connector-ref>netty-connector-backup</connector-ref>
<static-connectors>
<connector-ref>netty-connector-master</connector-ref>
<connector-ref>netty-connector-backup</connector-ref>
</static-connectors>
</cluster-connection>
</cluster-connections>
</core>
</configuration>
It's likely that the master's journal is being lost when it is stopped. The journal (specifically the server.lock
file) holds the node's unique identifier (which is shared by the replicated slave). If the journal is lost when the node is dropped then it has no way to pair with its slave when it comes back up which would explain the behavior you're observing. Make sure the journal is on a persistent volume claim.
Also, it's worth noting that a single master/slave pair is not recommended due to the risk of split brain. In general, it's recommended that you have 3 master/slave pairs to establish a proper quorum.