ECK master node not discovered on master node restart

8/12/2021

I have a simple elasticsearch cluster running on Kubernetes cluster. I am using the Elasticsearch operator to do so. version 1.7

this is how my ES object looks.

apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
  name: sifter-elastic-data-factory
spec:
  version: 7.10.1
  nodeSets:
    - name: master
      count: 1
      config:
        node.roles: [ master ]

      podTemplate:
        spec:
          initContainers:
            - name: sysctl
              securityContext:
                privileged: true
              command: ['sh', '-c', 'sysctl -w vm.max_map_count=262144']
          containers:
            - name: elasticsearch
              resources:
                requests:
                  memory: 8Gi
                  cpu: 3000m
                limits:
                  memory: 8Gi
                  cpu: 3000m
              env:
                - name: ES_JAVA_OPTS
                  value: -Xms6g -Xmx6g
                - name: cluster.initial_master_nodes
                  value: "sifter-elastic-data-factory-es-master-0"
      volumeClaimTemplates:
        - metadata:
            name: elasticsearch-data-data-factory
          spec:
            accessModes:
              - ReadWriteOnce
            resources:
              requests:
                storage: 50Gi
            storageClassName: ssd
    - name: data
      count: 3
      config:
        node.roles: [ data, ingest ]
      podTemplate:
        spec:
          initContainers:
            - name: sysctl
              securityContext:
                privileged: true
              command: ['sh', '-c', 'sysctl -w vm.max_map_count=262144']
          containers:
            - name: elasticsearch
              resources:
                requests:
                  memory: 8Gi
                  cpu: 3000m
                limits:
                  memory: 8Gi
                  cpu: 3000m
              env:
                - name: ES_JAVA_OPTS
                  value: -Xms6g -Xmx6g
      volumeClaimTemplates:
        - metadata:
            name: elasticsearch-data-data-factory
          spec:
            accessModes:
              - ReadWriteOnce
            resources:
              requests:
                storage: 60Gi
            storageClassName: ssd
  http:
    service:
      spec:
        type: ClusterIP
    tls:
      selfSignedCertificate:
        disabled: true

This works fine if one of the data nodes is restarted. Kubernetes stateful sets bring up a deleted node and then it knows who is ES master and picks up from there.

but if the master node dies (or is deleted) a new master node throws the following exception.

"Caused by: org.elasticsearch.cluster.coordination.CoordinationStateRejectedException: join validation on cluster state with a different cluster uuid gsBEw4N2S-K31IxI4tu4-w than local cluster uuid QlL6zADsR_-8cF7mW4n9Og, rejecting",

or

{"type": "server", "timestamp": "2021-08-12T18:32:34,916Z", "level": "WARN", "component": "o.e.c.c.ClusterFormationFailureHelper", "cluster.name": "xxxxx-elastic-data-factory", "node.name": "xxxxx-elastic-data-factory-es-master-0", "message": "master not discovered yet, this node has not previously joined a bootstrapped (v7+) cluster, and cluster.initial_master_nodes is empty on this node: have discovered {xxxxx-elastic-data-factory-es-master-0}{6ftRopASSq-jAh-Y7DOy_g}{vdTgG6vFSweeMmuTdCOkVw}{10.1.7.72}{10.1.7.72:9300}{lmr}{k8s_node_name=aks-npdev-10099729-vmss0000dv, ml.machine_memory=8589934592, xpack.installed=true, transform.node=false, ml.max_open_jobs=20}; discovery will continue using [127.0.0.1:9300, 127.0.0.1:9301, 127.0.0.1:9302, 127.0.0.1:9303, 127.0.0.1:9304, 127.0.0.1:9305, ::1:9300, ::1:9301, ::1:9302, ::1:9303, ::1:9304, ::1:9305] from hosts providers and {xxxxx-elastic-data-factory-es-master-0}{6ftRopASSq-jAh-Y7DOy_g}{vdTgG6vFSweeMmuTdCOkVw}{10.1.7.72}{10.1.7.72:9300}{lmr}{k8s_node_name=aks-npdev-10099729-vmss0000dv, ml.machine_memory=8589934592, xpack.installed=true, transform.node=false, ml.max_open_jobs=20} from last-known cluster state; node term 0, last-accepted version 0 in term 0" }

not sure why I get different errors on different occasions.

what should I do, to keep things running even if the master node goes down?

-- Ojas Kale
elastic-cloud
elasticsearch
kubernetes

0 Answers