Apache ignite nodes deployed as pods discover each other using TcpDiscoveryKubernetesIpFinder but cannot communication and therefore do not join the same cluster.
I set up a kubernetes deployment on Azure for an ignite based application using the "official" tutorials. At this point, the deployment are successful but there is always only one server in the topology for each pods. When I log on the pod directly and try to connect to the other pod on pod 47500 it does not work. More interesting is that the port 47500 is only accessing on 127.0.01 on the current pod not using its external IP.
Here are the debug message on pod/node 1. As you can see the TcpDiscoveryKubernetesIpFinder discovers the two ignite pods/nodes. But it cannot connect to the other ignite node:
INFO [org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi] (ServerService Thread Pool -- 5) Successfully bound communication NIO server to TCP port [port=47100, locHost=0.0.0.0/0.0.0.0, selectorsCnt=4, selectorSpins=0, pairedConn=false]
DEBUG [org.apache.ignite.internal.managers.communication.GridIoManager] (ServerService Thread Pool -- 5) Starting SPI: TcpCommunicationSpi [connectGate=null, connPlc=org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi$FirstConnectionPolicy@48ca2359, enableForcibleNodeKill=false, enableTroubleshootingLog=false, locAddr=null, locHost=0.0.0.0/0.0.0.0, locPort=47100, locPortRange=100, shmemPort=-1, directBuf=true, directSndBuf=false, idleConnTimeout=600000, connTimeout=5000, maxConnTimeout=600000, reconCnt=10, sockSndBuf=32768, sockRcvBuf=32768, msgQueueLimit=0, slowClientQueueLimit=0, nioSrvr=GridNioServer [selectorSpins=0, filterChain=FilterChain[filters=[GridNioCodecFilter [parser=org.apache.ignite.internal.util.nio.GridDirectParser@30a29315, directMode=true], GridConnectionBytesVerifyFilter], closed=false, directBuf=true, tcpNoDelay=true, sockSndBuf=32768, sockRcvBuf=32768, writeTimeout=2000, idleTimeout=600000, skipWrite=false, skipRead=false, locAddr=0.0.0.0/0.0.0.0:47100, order=LITTLE_ENDIAN, sndQueueLimit=0, directMode=true, sslFilter=null, msgQueueLsnr=null, readerMoveCnt=0, writerMoveCnt=0, readWriteSelectorsAssign=false], shmemSrv=null, usePairedConnections=false, connectionsPerNode=1, tcpNoDelay=true, filterReachableAddresses=false, ackSndThreshold=32, unackedMsgsBufSize=0, sockWriteTimeout=2000, boundTcpPort=47100, boundTcpShmemPort=-1, selectorsCnt=4, selectorSpins=0, addrRslvr=null, ctxInitLatch=java.util.concurrent.CountDownLatch@4186e275[Count = 1], stopping=false]
DEBUG [org.apache.ignite.internal.managers.communication.GridIoManager] (ServerService Thread Pool -- 5) Starting SPI implementation: org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi
DEBUG [org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi] (ServerService Thread Pool -- 5) Using parameter [locAddr=null]
DEBUG [org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi] (ServerService Thread Pool -- 5) Using parameter [locPort=47100]
DEBUG [org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi] Grid runnable started: tcp-disco-srvr
DEBUG [org.apache.ignite.spi.discovery.tcp.ipfinder.kubernetes.TcpDiscoveryKubernetesIpFinder] (ServerService Thread Pool -- 5) Getting Apache Ignite endpoints from: https://kubernetes.default.svc.cluster.local:443/api/v1/namespaces/default/endpoints/ignite
DEBUG [org.apache.ignite.spi.discovery.tcp.ipfinder.kubernetes.TcpDiscoveryKubernetesIpFinder] (ServerService Thread Pool -- 5) Added an address to the list: 10.244.0.93
DEBUG [org.apache.ignite.spi.discovery.tcp.ipfinder.kubernetes.TcpDiscoveryKubernetesIpFinder] (ServerService Thread Pool -- 5) Added an address to the list: 10.244.0.94
ERROR [org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi] (ServerService Thread Pool -- 5) Exception on direct send: Invalid argument (connect failed): java.net.ConnectException: Invalid argument (connect failed)
at java.net.PlainSocketImpl.socketConnect(Native Method)
I logged on the pods directly and tried a ping on the other node/pod which works BUT neither echo > /dev/tcp/10.244.0.93/47500
nor echo > /dev/tcp/10.244.0.94/47500
worked. On the other end echo > /dev/tcp/127.0.0.1/47500
does. Which leads me to think that ignite is just listening to the local loopback address.
There are similar logs on pods/node 2
Here is the kubernetes configuration
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: pgdata
namespace: default
annotations:
volume.alpha.kubernetes.io/storage-class: default
spec:
accessModes: [ReadWriteOnce]
resources:
requests:
storage: 1Gi
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: ignite
namespace: default
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRole
metadata:
name: ignite
namespace: default
rules:
- apiGroups:
- ""
resources:
- pods
- endpoints
verbs:
- get
- list
- watch
---
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1beta1
metadata:
name: ignite
roleRef:
kind: ClusterRole
name: ignite
apiGroup: rbac.authorization.k8s.io
subjects:
- kind: ServiceAccount
name: ignite
namespace: default
---
apiVersion: v1
kind: Service
metadata:
name: ignite
namespace: default
spec:
clusterIP: None # custom value.
ports:
- port: 9042 # custom value.
selector:
type: processing-engine-node
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: database-tenant-1
namespace: default
spec:
replicas: 1
selector:
matchLabels:
app: database-tenant-1
template:
metadata:
labels:
app: database-tenant-1
spec:
containers:
- name: database-tenant-1
image: postgres:12
env:
- name: "POSTGRES_USER"
value: "admin"
- name: "POSTGRES_PASSWORD"
value: "admin"
- name: "POSTGRES_DB"
value: "tenant1"
volumeMounts:
- name: pgdata
mountPath: /var/lib/postgresql/data
subPath: postgres
ports:
- containerPort: 5432
readinessProbe:
exec:
command: ["psql", "-W", "admin", "-U", "admin", "-d", "tenant1", "-c", "SELECT 1"]
initialDelaySeconds: 15
timeoutSeconds: 2
livenessProbe:
exec:
command: ["psql", "-W", "admin", "-U", "admin", "-d", "tenant1", "-c", "SELECT 1"]
initialDelaySeconds: 45
timeoutSeconds: 2
volumes:
- name: pgdata
persistentVolumeClaim:
claimName: pgdata
---
apiVersion: v1
kind: Service
metadata:
name: database-tenant-1
namespace: default
labels:
app: database-tenant-1
spec:
type: NodePort
ports:
- port: 5432
selector:
app: database-tenant-1
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: processing-engine-master
namespace: default
spec:
replicas: 1
selector:
matchLabels:
app: processing-engine-master
template:
metadata:
labels:
app: processing-engine-master
type: processing-engine-node
spec:
serviceAccountName: ignite
initContainers:
- name: check-db-ready
image: postgres:12
command: ['sh', '-c',
'until pg_isready -h database-tenant-1 -p 5432;
do echo waiting for database; sleep 2; done;']
containers:
- name: xxxx-engine-master
image: shostettlerprivateregistry.azurecr.io/xxx/xxx-application:4.2.5
ports:
- containerPort: 8081
- containerPort: 11211 # REST port number.
- containerPort: 47100 # communication SPI port number.
- containerPort: 47500 # discovery SPI port number.
- containerPort: 49112 # JMX port number.
- containerPort: 10800 # SQL port number.
- containerPort: 10900 # Thin clients port number.
volumeMounts:
- name: config-volume
mountPath: /opt/project-postgres.yml
subPath: project-postgres.yml
volumes:
- name: config-volume
configMap:
name: pe-config
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: processing-engine-worker
namespace: default
spec:
replicas: 1
selector:
matchLabels:
app: processing-engine-worker
template:
metadata:
labels:
app: processing-engine-worker
type: processing-engine-node
spec:
serviceAccountName: ignite
initContainers:
- name: check-db-ready
image: postgres:12
command: ['sh', '-c',
'until pg_isready -h database-tenant-1 -p 5432;
do echo waiting for database; sleep 2; done;']
containers:
- name: xxx-engine-worker
image: shostettlerprivateregistry.azurecr.io/xxx/xxx-worker:4.2.5
ports:
- containerPort: 8081
- containerPort: 11211 # REST port number.
- containerPort: 47100 # communication SPI port number.
- containerPort: 47500 # discovery SPI port number.
- containerPort: 49112 # JMX port number.
- containerPort: 10800 # SQL port number.
- containerPort: 10900 # Thin clients port number.
volumeMounts:
- name: config-volume
mountPath: /opt/project-postgres.yml
subPath: project-postgres.yml
volumes:
- name: config-volume
configMap:
name: pe-config
and the ignite config
<bean id="tcpDiscoveryKubernetesIpFinder" class="org.apache.ignite.spi.discovery.tcp.ipfinder.kubernetes.TcpDiscoveryKubernetesIpFinder"/>
<property name="discoverySpi">
<bean class="org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi">
<property name="localPort" value="47500" />
<property name="localAddress" value="127.0.0.1" />
<property name="networkTimeout" value="10000" />
<property name="ipFinder">
<bean id="tcpDiscoveryKubernetesIpFinder" class="org.apache.ignite.spi.discovery.tcp.ipfinder.kubernetes.TcpDiscoveryKubernetesIpFinder"/>
</property>
</bean>
</property>
I expect the pods to be able to communicate and to end up with the following topology Topology snapshot:
[ver=1, locNode=a8e6a058, servers=2, clients=0, state=ACTIVE, CPUs=2, offheap=0.24GB, heap=1.5GB]
You configured discovery to bind to localhost:
<property name="localAddress" value="127.0.0.1" />
This means that nodes from different pods will not be able to join each other. Try removing this line from configuration.