Rabbitmq pod on kubernetes struck in pod initialization state

7/4/2021

I am running 3 nodes rabbitmq cluster on the Kubernetes. Kubernetes cluster is running on the AWS spot instances and somehow one of the Kubernetes nodes got terminated unexpectedly on with one of the Rabbitmq pods was running. Now the pod git scheduled ona another node and Since then my rabbitmq pod is stuck in the pod initialization state.

Kubernetes event says "FailedPostStartHook".

Logs:

9m46s       Warning   FailedPostStartHook      pod/rabbitmq-0   Exec lifecycle hook ([/bin/sh -c until rabbitmqctl --erlang-cookie ${RABBITMQ_ERLANG_COOKIE} node_health_check; do sleep 1; done; rabbitmqctl --erlang-cookie ${RABBITMQ_ERLANG_COOKIE} set_policy ha-all "" '{"ha-mode":"all", "ha-sync-mode": "automatic"}'
]) for Container "rabbitmq" in Pod "rabbitmq-0_devops(c96c1a6e-bf9a-450d-828d-ed0e8a0ad949)" failed - error: command '/bin/sh -c until rabbitmqctl --erlang-cookie ${RABBITMQ_ERLANG_COOKIE} node_health_check; do sleep 1; done; rabbitmqctl --erlang-cookie ${RABBITMQ_ERLANG_COOKIE} set_policy ha-all "" '{"ha-mode":"all", "ha-sync-mode": "automatic"}'
' exited with 137: Error: unable to perform an operation on node 'rabbit@rabbitmq-0.rabbitmq-service.devops.svc.cluster.local'. Please see diagnostics information and suggestions below.
Most common reasons for this are:
 * Target node is unreachable (e.g. due to hostname resolution, TCP connection or firewall issues)
 * CLI tool fails to authenticate with the server (e.g. due to CLI tool's Erlang cookie not matching that of the server)
 * Target node is not running
In addition to the diagnostics info below:
 * See the CLI, clustering and networking guides on https://rabbitmq.com/documentation.html to learn more
 * Consult server logs on node rabbit@rabbitmq-0.rabbitmq-service.devops.svc.cluster.local
 * If target node is configured to use long node names, don't forget to use --longnames with CLI tools
DIAGNOSTICS
===========
attempted to contact: ['rabbit@rabbitmq-0.rabbitmq-service.devops.svc.cluster.local']

Kubernetes statefulset manifest:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: rabbitmq
  namespace: devops
spec:
  podManagementPolicy: OrderedReady
  replicas: 3
  revisionHistoryLimit: 3
  selector:
    matchLabels:
      app: rabbitmq
  serviceName: rabbitmq-service
  template:
    metadata:
      annotations:
      labels:
        app: rabbitmq
      name: rabbitmq
    spec:
      containers:
      - env:
        - name: HOSTNAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.name
        - name: NAMESPACE
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
        - name: RABBITMQ_USE_LONGNAME
          value: "true"
        - name: RABBITMQ_BASIC_AUTH
          valueFrom:
            secretKeyRef:
              key: password
              name: rabbitmq
        - name: RABBITMQ_NODENAME
          value: rabbit@$(HOSTNAME).rabbitmq-service.$(NAMESPACE).svc.cluster.local
        - name: K8S_SERVICE_NAME
          value: rabbitmq-service
        - name: RABBITMQ_DEFAULT_USER
          value: admin
        - name: RABBITMQ_DEFAULT_PASS
          valueFrom:
            secretKeyRef:
              key: password
              name: rabbitmq
        - name: RABBITMQ_ERLANG_COOKIE
          value: some-cookie
        - name: NODE_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.name
        image: rabbitmq:3.8.1-management-alpine
        imagePullPolicy: IfNotPresent
        lifecycle:
          postStart:
            exec:
              command:
              - /bin/sh
              - -c
              - |
                until rabbitmqctl --erlang-cookie ${RABBITMQ_ERLANG_COOKIE} node_health_check; do sleep 1; done; rabbitmqctl --erlang-cookie ${RABBITMQ_ERLANG_COOKIE} set_policy ha-all "" '{"ha-mode":"all", "ha-sync-mode": "automatic"}'
        livenessProbe:
          exec:
            command:
            - rabbitmqctl
            - status
          failureThreshold: 3
          initialDelaySeconds: 60
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 30
        name: rabbitmq
        ports:
        - containerPort: 4369
          protocol: TCP
        - containerPort: 5672
          protocol: TCP
        - containerPort: 5671
          protocol: TCP
        - containerPort: 25672
          protocol: TCP
        - containerPort: 15672
          protocol: TCP
        readinessProbe:
          exec:
            command:
            - rabbitmqctl
            - status
          failureThreshold: 3
          initialDelaySeconds: 20
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 30
        resources:
          limits:
            cpu: "2"
            memory: 3Gi
          requests:
            cpu: "1"
            memory: 2Gi
        volumeMounts:
        - mountPath: /var/lib/rabbitmq/
          name: rabbitmq-data
        - mountPath: /etc/rabbitmq
          name: config
      dnsPolicy: ClusterFirst
      initContainers:
      - command:
        - /bin/bash
        - -euc
        - |
          rm -f /var/lib/rabbitmq/.erlang.cookie
          cp /rabbitmqconfig/rabbitmq.conf /etc/rabbitmq/rabbitmq.conf
          cp /rabbitmqconfig/enabled_plugins /etc/rabbitmq/enabled_plugins
        image: rabbitmq:3.8.1-management-alpine
        imagePullPolicy: Always
        name: copy-rabbitmq-config
        resources: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /rabbitmqconfig
          name: rabbitmq-configmap
        - mountPath: /etc/rabbitmq
          name: config
        - mountPath: /var/lib/rabbitmq
          name: rabbitmq-data
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: rabbitmq
      serviceAccountName: rabbitmq
      terminationGracePeriodSeconds: 10
      volumes:
      - configMap:
          defaultMode: 420
          items:
          - key: rabbitmq.conf
            path: rabbitmq.conf
          - key: enabled_plugins
            path: enabled_plugins
          name: rabbitmq-configmap
        name: rabbitmq-configmap
      - emptyDir: {}
        name: config
  updateStrategy:
    type: RollingUpdate
  volumeClaimTemplates:
  - apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      creationTimestamp: null
      name: rabbitmq-data
    spec:
      accessModes:
      - ReadWriteOnce
      resources:
        requests:
          storage: 20Gi
      storageClassName: gp2
      volumeMode: Filesystem

Things I have tried:

  1. Logged into the struck pod and executed(This command just struck without any response)

rabbitmqctl stop_app

  1. Tried deleting the pod forcefully but no luck.

  2. Logged into the struck pod and executed

rabbitmqctl reset

  1. Logged into the struck pod and executed

rabbitmqctl force_boot

  1. Logged into the struck pod and executed

rm /var/log/rabbitmq/*

None of the above things helped.

Please note that the other 2 rabbitmq nodes are running fine and serving the traffic and showing the failed node as up:

rabbitmq-2 rabbitmq 2021-07-04 12:19:07.233 [info] <0.490.0> node 'rabbit@rabbitmq-0.rabbitmq-service.devops.svc.cluster.local' up
rabbitmq-1 rabbitmq 2021-07-04 12:19:07.208 [info] <0.494.0> node 'rabbit@rabbitmq-0.rabbitmq-service.devops.svc.cluster.local' up 
-- Vaibhav Jain
docker
kubernetes
queue
rabbitmq
stateful

1 Answer

7/5/2021

Running the rollout restart of statefulset command worked for me.

kubectl rollout restart statefulset rabbitmq -n devops

After this command the rabbitmq cluster is up and running and all the three nodes joined the cluster without any issue.

Once this is done its required to restart the applications which are connecting to this rabbitmq cluster.

-- Amjad Hussain Syed
Source: StackOverflow