Failed to join Apache Ignite topology when multiple server pods starts at the same time

10/18/2019

I am currently setting up a stateless Apache Ignite cluster in Kubernetes environment.

During disaster recovery test, I have restart multiple server Ignite nodes simultaneously and intentionally. Those Ignite server nodes was started at about the same time.

Ever since the Ignite server nodes recover, the whole Ignite cluster has gone haywire and the connection between servers and clients are lost and never been recovered.

The following line appears constantly in the Server node log:

Failed to wait for partition map exchange [topVer=AffinityTopologyVersion [topVer=572, minorTopVer=0], node=f1f26b7e-5130-423a-b6c0-477ad58437ee]. Dumping pending objects that might be the cause: 

Edit: Added with more log showing nodes are trying to rejoin Ignite topology consistently

Added new node to topology: TcpDiscoveryNode [id=91be6833-9884-404b-8b20-afb004ce32a3, addrs=[100.64.32.153, 127.0.0.1], sockAddrs=[/100.64.32.153:0, /127.0.0.1:0], discPort=0, order=337, intOrder=212, lastExchangeTime=1571403600207, loc=false, ver=2.7.5#20190603-sha1:be4f2a15, isClient=true]
Topology snapshot [ver=337, locNode=98f9d085, servers=9, clients=78, state=ACTIVE, CPUs=152, offheap=2.3GB, heap=45.0GB]
Local node's value of 'java.net.preferIPv4Stack' system property differs from remote node's (all nodes in topology should have identical value) [locPreferIpV4=true, rmtPreferIpV4=null, locId8=98f9d085, rmtId8=4110272f, rmtAddrs=[securities-1-0-0-6d57b9989b-95wkn/100.64.0.31, /127.0.0.1], rmtNode=ClusterNode [id=4110272f-ca98-4a51-89e3-3478d87ff73e, order=338, addr=[100.64.0.31, 127.0.0.1], daemon=false]]
Added new node to topology: TcpDiscoveryNode [id=4110272f-ca98-4a51-89e3-3478d87ff73e, addrs=[100.64.0.31, 127.0.0.1], sockAddrs=[/127.0.0.1:0, /100.64.0.31:0], discPort=0, order=338, intOrder=213, lastExchangeTime=1571403600394, loc=false, ver=2.7.5#20190603-sha1:be4f2a15, isClient=true]
Topology snapshot [ver=338, locNode=98f9d085, servers=9, clients=79, state=ACTIVE, CPUs=153, offheap=2.3GB, heap=45.0GB]
Completed partition exchange [localNode=98f9d085-933a-435c-a09b-1846cf39c3b1, exchange=GridDhtPartitionsExchangeFuture [topVer=AffinityTopologyVersion [topVer=284, minorTopVer=0], evt=NODE_FAILED, evtNode=TcpDiscoveryNode [id=f3fb9b23-e3b0-47ab-98da-baf2421fb59a, addrs=[100.64.32.132, 127.0.0.1], sockAddrs=[/100.64.32.132:0, /127.0.0.1:0], discPort=0, order=66, intOrder=66, lastExchangeTime=1571377609149, loc=false, ver=2.7.5#20190603-sha1:be4f2a15, isClient=true], done=true], topVer=AffinityTopologyVersion [topVer=284, minorTopVer=0], durationFromInit=104]
Finished exchange init [topVer=AffinityTopologyVersion [topVer=284, minorTopVer=0], crd=true]
Skipping rebalancing (obsolete exchange ID) [top=AffinityTopologyVersion [topVer=284, minorTopVer=0], evt=NODE_FAILED, node=f3fb9b23-e3b0-47ab-98da-baf2421fb59a]
Started exchange init [topVer=AffinityTopologyVersion [topVer=285, minorTopVer=0], mvccCrd=MvccCoordinator [nodeId=98f9d085-933a-435c-a09b-1846cf39c3b1, crdVer=1571377592872, topVer=AffinityTopologyVersion [topVer=117, minorTopVer=0]], mvccCrdChange=false, crd=true, evt=NODE_FAILED, evtNode=b4b25a6f-1d3c-411f-9d81-5593d52e9db1, customEvt=null, allowMerge=true]
Finish exchange future [startVer=AffinityTopologyVersion [topVer=285, minorTopVer=0], resVer=AffinityTopologyVersion [topVer=285, minorTopVer=0], err=null]
Local node's value of 'java.net.preferIPv4Stack' system property differs from remote node's (all nodes in topology should have identical value) [locPreferIpV4=true, rmtPreferIpV4=null, locId8=98f9d085, rmtId8=edc33f38, rmtAddrs=[transfer-1-0-0-846f8bf868-dnfjg/100.64.18.195, /127.0.0.1], rmtNode=ClusterNode [id=edc33f38-9c94-4c4d-a109-be722e918512, order=339, addr=[100.64.18.195, 127.0.0.1], daemon=false]]
Added new node to topology: TcpDiscoveryNode [id=edc33f38-9c94-4c4d-a109-be722e918512, addrs=[100.64.18.195, 127.0.0.1], sockAddrs=[/127.0.0.1:0, /100.64.18.195:0], discPort=0, order=339, intOrder=214, lastExchangeTime=1571403600468, loc=false, ver=2.7.5#20190603-sha1:be4f2a15, isClient=true]
Topology snapshot [ver=339, locNode=98f9d085, servers=9, clients=80, state=ACTIVE, CPUs=155, offheap=2.3GB, heap=46.0GB]
Completed partition exchange [localNode=98f9d085-933a-435c-a09b-1846cf39c3b1, exchange=GridDhtPartitionsExchangeFuture [topVer=AffinityTopologyVersion [topVer=285, minorTopVer=0], evt=NODE_FAILED, evtNode=TcpDiscoveryNode [id=b4b25a6f-1d3c-411f-9d81-5593d52e9db1, addrs=[100.64.19.98, 127.0.0.1], sockAddrs=[/127.0.0.1:0, /100.64.19.98:0], discPort=0, order=71, intOrder=71, lastExchangeTime=1571377609159, loc=false, ver=2.7.5#20190603-sha1:be4f2a15, isClient=true], done=true], topVer=AffinityTopologyVersion [topVer=285, minorTopVer=0], durationFromInit=100]
Finished exchange init [topVer=AffinityTopologyVersion [topVer=285, minorTopVer=0], crd=true]
Skipping rebalancing (obsolete exchange ID) [top=AffinityTopologyVersion [topVer=285, minorTopVer=0], evt=NODE_FAILED, node=b4b25a6f-1d3c-411f-9d81-5593d52e9db1]
Started exchange init [topVer=AffinityTopologyVersion [topVer=286, minorTopVer=0], mvccCrd=MvccCoordinator [nodeId=98f9d085-933a-435c-a09b-1846cf39c3b1, crdVer=1571377592872, topVer=AffinityTopologyVersion [topVer=117, minorTopVer=0]], mvccCrdChange=false, crd=true, evt=NODE_FAILED, evtNode=c161e542-bad7-4f41-a973-54b6e6e7b555, customEvt=null, allowMerge=true]
Finish exchange future [startVer=AffinityTopologyVersion [topVer=286, minorTopVer=0], resVer=AffinityTopologyVersion [topVer=286, minorTopVer=0], err=null]
Completed partition exchange [localNode=98f9d085-933a-435c-a09b-1846cf39c3b1, exchange=GridDhtPartitionsExchangeFuture [topVer=AffinityTopologyVersion [topVer=286, minorTopVer=0], evt=NODE_FAILED, evtNode=TcpDiscoveryNode [id=c161e542-bad7-4f41-a973-54b6e6e7b555, addrs=[100.64.17.126, 127.0.0.1], sockAddrs=[/127.0.0.1:0, /100.64.17.126:0], discPort=0, order=38, intOrder=38, lastExchangeTime=1571377608515, loc=false, ver=2.7.5#20190603-sha1:be4f2a15, isClient=true], done=true], topVer=AffinityTopologyVersion [topVer=286, minorTopVer=0], durationFromInit=20]
Finished exchange init [topVer=AffinityTopologyVersion [topVer=286, minorTopVer=0], crd=true]
Skipping rebalancing (obsolete exchange ID) [top=AffinityTopologyVersion [topVer=286, minorTopVer=0], evt=NODE_FAILED, node=c161e542-bad7-4f41-a973-54b6e6e7b555]
Started exchange init [topVer=AffinityTopologyVersion [topVer=287, minorTopVer=0], mvccCrd=MvccCoordinator [nodeId=98f9d085-933a-435c-a09b-1846cf39c3b1, crdVer=1571377592872, topVer=AffinityTopologyVersion [topVer=117, minorTopVer=0]], mvccCrdChange=false, crd=true, evt=NODE_FAILED, evtNode=0c16c5a7-6e3f-4fd4-8618-b6d8d8888af4, customEvt=null, allowMerge=true]
Finish exchange future [startVer=AffinityTopologyVersion [topVer=287, minorTopVer=0], resVer=AffinityTopologyVersion [topVer=287, minorTopVer=0], err=null]
Completed partition exchange [localNode=98f9d085-933a-435c-a09b-1846cf39c3b1, exchange=GridDhtPartitionsExchangeFuture [topVer=AffinityTopologyVersion [topVer=287, minorTopVer=0], evt=NODE_FAILED, evtNode=TcpDiscoveryNode [id=0c16c5a7-6e3f-4fd4-8618-b6d8d8888af4, addrs=[100.64.34.22, 127.0.0.1], sockAddrs=[/127.0.0.1:0, /100.64.34.22:0], discPort=0, order=25, intOrder=25, lastExchangeTime=1571377607690, loc=false, ver=2.7.5#20190603-sha1:be4f2a15, isClient=true], done=true], topVer=AffinityTopologyVersion [topVer=287, minorTopVer=0], durationFromInit=52]
Finished exchange init [topVer=AffinityTopologyVersion [topVer=287, minorTopVer=0], crd=true]
Skipping rebalancing (obsolete exchange ID) [top=AffinityTopologyVersion [topVer=287, minorTopVer=0], evt=NODE_FAILED, node=0c16c5a7-6e3f-4fd4-8618-b6d8d8888af4]
Started exchange init [topVer=AffinityTopologyVersion [topVer=288, minorTopVer=0], mvccCrd=MvccCoordinator [nodeId=98f9d085-933a-435c-a09b-1846cf39c3b1, crdVer=1571377592872, topVer=AffinityTopologyVersion [topVer=117, minorTopVer=0]], mvccCrdChange=false, crd=true, evt=NODE_FAILED, evtNode=807333d7-0b71-4510-a35d-0ed41e068ac5, customEvt=null, allowMerge=true]
Finish exchange future [startVer=AffinityTopologyVersion [topVer=288, minorTopVer=0], resVer=AffinityTopologyVersion [topVer=288, minorTopVer=0], err=null]
Completed partition exchange [localNode=98f9d085-933a-435c-a09b-1846cf39c3b1, exchange=GridDhtPartitionsExchangeFuture [topVer=AffinityTopologyVersion [topVer=288, minorTopVer=0], evt=NODE_FAILED, evtNode=TcpDiscoveryNode [id=807333d7-0b71-4510-a35d-0ed41e068ac5, addrs=[100.64.32.231, 127.0.0.1], sockAddrs=[/127.0.0.1:0, /100.64.32.231:0], discPort=0, order=74, intOrder=74, lastExchangeTime=1571377609280, loc=false, ver=2.7.5#20190603-sha1:be4f2a15, isClient=true], done=true], topVer=AffinityTopologyVersion [topVer=288, minorTopVer=0], durationFromInit=60]
Finished exchange init [topVer=AffinityTopologyVersion [topVer=288, minorTopVer=0], crd=true]
-- ho wing kent
ignite
kubernetes

1 Answer

10/19/2019

I have also faced the same problem. Deploying each node one by one is the only way to to resove this.,accor.to. As far as my experience with ignite .

-- shubhendra sen
Source: StackOverflow