Apache Ignite nodes deployed in the same kubernetes namespace do not join the same cluster

7/20/2019

Apache ignite nodes deployed as pods discover each other using TcpDiscoveryKubernetesIpFinder but cannot communication and therefore do not join the same cluster.

I set up a kubernetes deployment on Azure for an ignite based application using the "official" tutorials. At this point, the deployment are successful but there is always only one server in the topology for each pods. When I log on the pod directly and try to connect to the other pod on pod 47500 it does not work. More interesting is that the port 47500 is only accessing on 127.0.01 on the current pod not using its external IP.

Here are the debug message on pod/node 1. As you can see the TcpDiscoveryKubernetesIpFinder discovers the two ignite pods/nodes. But it cannot connect to the other ignite node:

INFO  [org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi] (ServerService Thread Pool -- 5) Successfully bound communication NIO server to TCP port [port=47100, locHost=0.0.0.0/0.0.0.0, selectorsCnt=4, selectorSpins=0, pairedConn=false]
DEBUG [org.apache.ignite.internal.managers.communication.GridIoManager] (ServerService Thread Pool -- 5) Starting SPI: TcpCommunicationSpi [connectGate=null, connPlc=org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi$FirstConnectionPolicy@48ca2359, enableForcibleNodeKill=false, enableTroubleshootingLog=false, locAddr=null, locHost=0.0.0.0/0.0.0.0, locPort=47100, locPortRange=100, shmemPort=-1, directBuf=true, directSndBuf=false, idleConnTimeout=600000, connTimeout=5000, maxConnTimeout=600000, reconCnt=10, sockSndBuf=32768, sockRcvBuf=32768, msgQueueLimit=0, slowClientQueueLimit=0, nioSrvr=GridNioServer [selectorSpins=0, filterChain=FilterChain[filters=[GridNioCodecFilter [parser=org.apache.ignite.internal.util.nio.GridDirectParser@30a29315, directMode=true], GridConnectionBytesVerifyFilter], closed=false, directBuf=true, tcpNoDelay=true, sockSndBuf=32768, sockRcvBuf=32768, writeTimeout=2000, idleTimeout=600000, skipWrite=false, skipRead=false, locAddr=0.0.0.0/0.0.0.0:47100, order=LITTLE_ENDIAN, sndQueueLimit=0, directMode=true, sslFilter=null, msgQueueLsnr=null, readerMoveCnt=0, writerMoveCnt=0, readWriteSelectorsAssign=false], shmemSrv=null, usePairedConnections=false, connectionsPerNode=1, tcpNoDelay=true, filterReachableAddresses=false, ackSndThreshold=32, unackedMsgsBufSize=0, sockWriteTimeout=2000, boundTcpPort=47100, boundTcpShmemPort=-1, selectorsCnt=4, selectorSpins=0, addrRslvr=null, ctxInitLatch=java.util.concurrent.CountDownLatch@4186e275[Count = 1], stopping=false]
DEBUG [org.apache.ignite.internal.managers.communication.GridIoManager] (ServerService Thread Pool -- 5) Starting SPI implementation: org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi
DEBUG [org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi] (ServerService Thread Pool -- 5) Using parameter [locAddr=null]
DEBUG [org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi] (ServerService Thread Pool -- 5) Using parameter [locPort=47100]
DEBUG [org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi]  Grid runnable started: tcp-disco-srvr
DEBUG [org.apache.ignite.spi.discovery.tcp.ipfinder.kubernetes.TcpDiscoveryKubernetesIpFinder] (ServerService Thread Pool -- 5) Getting Apache Ignite endpoints from: https://kubernetes.default.svc.cluster.local:443/api/v1/namespaces/default/endpoints/ignite
DEBUG [org.apache.ignite.spi.discovery.tcp.ipfinder.kubernetes.TcpDiscoveryKubernetesIpFinder] (ServerService Thread Pool -- 5) Added an address to the list: 10.244.0.93
DEBUG [org.apache.ignite.spi.discovery.tcp.ipfinder.kubernetes.TcpDiscoveryKubernetesIpFinder] (ServerService Thread Pool -- 5) Added an address to the list: 10.244.0.94
ERROR [org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi] (ServerService Thread Pool -- 5) Exception on direct send: Invalid argument (connect failed): java.net.ConnectException: Invalid argument (connect failed)
    at java.net.PlainSocketImpl.socketConnect(Native Method)

I logged on the pods directly and tried a ping on the other node/pod which works BUT neither echo > /dev/tcp/10.244.0.93/47500 nor echo > /dev/tcp/10.244.0.94/47500 worked. On the other end echo > /dev/tcp/127.0.0.1/47500 does. Which leads me to think that ignite is just listening to the local loopback address.

There are similar logs on pods/node 2

Here is the kubernetes configuration

kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: pgdata
  namespace: default
  annotations:
    volume.alpha.kubernetes.io/storage-class: default
spec:
  accessModes: [ReadWriteOnce]
  resources:
    requests:
      storage: 1Gi
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: ignite
  namespace: default
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRole
metadata:
  name: ignite
  namespace: default
rules:
- apiGroups:
  - ""
  resources:
  - pods
  - endpoints
  verbs:
  - get
  - list
  - watch
---
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1beta1
metadata:
  name: ignite
roleRef:
  kind: ClusterRole
  name: ignite
  apiGroup: rbac.authorization.k8s.io
subjects:
- kind: ServiceAccount
  name: ignite
  namespace: default
---
apiVersion: v1
kind: Service
metadata:
  name: ignite
  namespace: default
spec:
  clusterIP: None # custom value.
  ports:
    - port: 9042 # custom value.
  selector:
    type: processing-engine-node
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: database-tenant-1
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      app: database-tenant-1
  template:
    metadata:
      labels:
        app: database-tenant-1
    spec:
      containers:
      - name: database-tenant-1
        image: postgres:12
        env:
        - name: "POSTGRES_USER"
          value: "admin"
        - name: "POSTGRES_PASSWORD"
          value: "admin"
        - name: "POSTGRES_DB"
          value: "tenant1"
        volumeMounts:
        - name: pgdata
          mountPath: /var/lib/postgresql/data
          subPath: postgres
        ports:
        - containerPort: 5432
        readinessProbe:
          exec:
            command: ["psql", "-W", "admin", "-U", "admin", "-d", "tenant1", "-c", "SELECT 1"]
          initialDelaySeconds: 15
          timeoutSeconds: 2
        livenessProbe:
          exec:
            command: ["psql", "-W", "admin", "-U", "admin", "-d", "tenant1", "-c", "SELECT 1"]
          initialDelaySeconds: 45
          timeoutSeconds: 2
      volumes:
        - name: pgdata
          persistentVolumeClaim:
            claimName: pgdata
---
apiVersion: v1
kind: Service
metadata:
  name: database-tenant-1
  namespace: default
  labels:
    app: database-tenant-1
spec:
  type: NodePort
  ports:
   - port: 5432
  selector:
   app: database-tenant-1
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: processing-engine-master
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      app: processing-engine-master
  template:
    metadata:
      labels:
        app: processing-engine-master
        type: processing-engine-node
    spec:
      serviceAccountName: ignite
      initContainers:
      - name: check-db-ready
        image: postgres:12
        command: ['sh', '-c', 
          'until pg_isready -h database-tenant-1 -p 5432; 
          do echo waiting for database; sleep 2; done;']
      containers:
      - name: xxxx-engine-master
        image: shostettlerprivateregistry.azurecr.io/xxx/xxx-application:4.2.5
        ports:
            - containerPort: 8081
            - containerPort: 11211 # REST port number.
            - containerPort: 47100 # communication SPI port number.
            - containerPort: 47500 # discovery SPI port number.
            - containerPort: 49112 # JMX port number.
            - containerPort: 10800 # SQL port number.
            - containerPort: 10900 # Thin clients port number.
        volumeMounts:
        - name: config-volume
          mountPath: /opt/project-postgres.yml
          subPath: project-postgres.yml
      volumes:
          - name: config-volume
            configMap:
              name: pe-config
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: processing-engine-worker
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      app: processing-engine-worker
  template:
    metadata:
      labels:
        app: processing-engine-worker
        type: processing-engine-node
    spec:
      serviceAccountName: ignite
      initContainers:
      - name: check-db-ready
        image: postgres:12
        command: ['sh', '-c', 
          'until pg_isready -h database-tenant-1 -p 5432; 
          do echo waiting for database; sleep 2; done;']
      containers:
      - name: xxx-engine-worker
        image: shostettlerprivateregistry.azurecr.io/xxx/xxx-worker:4.2.5
        ports:
            - containerPort: 8081
            - containerPort: 11211 # REST port number.
            - containerPort: 47100 # communication SPI port number.
            - containerPort: 47500 # discovery SPI port number.
            - containerPort: 49112 # JMX port number.
            - containerPort: 10800 # SQL port number.
            - containerPort: 10900 # Thin clients port number.

        volumeMounts:
        - name: config-volume
          mountPath: /opt/project-postgres.yml
          subPath: project-postgres.yml
      volumes:
          - name: config-volume
            configMap:
              name: pe-config

and the ignite config

<bean id="tcpDiscoveryKubernetesIpFinder" class="org.apache.ignite.spi.discovery.tcp.ipfinder.kubernetes.TcpDiscoveryKubernetesIpFinder"/>

<property name="discoverySpi">
    <bean class="org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi">
        <property name="localPort" value="47500" />
        <property name="localAddress" value="127.0.0.1" />
        <property name="networkTimeout" value="10000" />
        <property name="ipFinder">
            <bean id="tcpDiscoveryKubernetesIpFinder" class="org.apache.ignite.spi.discovery.tcp.ipfinder.kubernetes.TcpDiscoveryKubernetesIpFinder"/>
        </property>
    </bean>
</property>

I expect the pods to be able to communicate and to end up with the following topology Topology snapshot:

[ver=1, locNode=a8e6a058, servers=2, clients=0, state=ACTIVE, CPUs=2, offheap=0.24GB, heap=1.5GB]
-- Steve Hostettler
ignite
kubernetes

1 Answer

7/21/2019

You configured discovery to bind to localhost:

<property name="localAddress" value="127.0.0.1" />

This means that nodes from different pods will not be able to join each other. Try removing this line from configuration.

-- Valentin Kulichenko
Source: StackOverflow