RabbitMQ cannot start after upgrading Azure Kubernetes Service (AKS)

7/18/2021

I had the same problem with @Amir Soleimani but the error result was a bit different, I tried all the solutions in that post but all of them didn't work.... I'm using Azure Kubernetes Service (AKS) and after upgrading from 1.13.xx to 1.18.xx can't start RabbitMQ anymore.

UPDATED - Solution that worked for me (please consider this approach as it may affect your existing queues)

Remove current rabbitmq StatefulSet including persistent disks

========

Here is my StatefulSet file:

apiVersion: v1
kind: Service
metadata:
  name: rabbitmq-management
  labels:
    app: rabbitmq
spec:
  ports:
    - port: 80
      targetPort: 15672
      name: http
  selector:
    app: rabbitmq
  type: LoadBalancer
---
apiVersion: v1
kind: Service
metadata:
  name: rabbitmq
  labels:
    app: rabbitmq
spec:
  ports:
    - port: 5672
      name: amqp
    - port: 4369
      name: epmd
    - port: 25672
      name: rabbitmq-dist
  clusterIP: None
  selector:
    app: rabbitmq
---
apiVersion: v1
kind: Secret
metadata:
  name: rabbitmq-config
  namespace: default
type: Opaque
data:
  erlang.cookie: samplecookie==
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: rabbitmq
  labels:
    app: rabbitmq
spec:
  serviceName: rabbitmq
  selector:
    matchLabels:
      app: rabbitmq
  replicas: 3
  template:
    metadata:
      labels:
        app: rabbitmq
    spec:
      containers:
        - name: rabbitmq
          image: 'rabbitmq:3.6.6-management-alpine'
          lifecycle:
            postStart:
              exec:
                command:
                  - /bin/sh
                  - -c
                  - >
                    if [ -z "$(grep rabbitmq /etc/resolv.conf)" ]; then
                      sed "s/^search \([^ ]\+\)/search rabbitmq.\1 \1/" /etc/resolv.conf > /etc/resolv.conf.new;
                      cat /etc/resolv.conf.new > /etc/resolv.conf;
                      rm /etc/resolv.conf.new;
                    fi;
                    until rabbitmqctl node_health_check; do sleep 1; done;
                    if [[ "$HOSTNAME" != "rabbitmq-0" && -z "$(rabbitmqctl cluster_status | grep rabbitmq-0)" ]]; then
                      rabbitmqctl stop_app;
                      rabbitmqctl join_cluster rabbit@rabbitmq-0;
                      rabbitmqctl start_app;
                    fi;
                    rabbitmqctl set_policy ha-all "." '{"ha-mode":"exactly","ha-params":3,"ha-sync-mode":"automatic"}'
          env:
            - name: RABBITMQ_ERLANG_COOKIE
              valueFrom:
                secretKeyRef:
                  name: rabbitmq-config
                  key: erlang.cookie
            - name: RABBITMQ_DEFAULT_USER
              value: username
            - name: RABBITMQ_DEFAULT_PASS
              value: password
          ports:
            - containerPort: 5672
              name: amqp
            - containerPort: 15672
              name: amqp-management
          volumeMounts:
            - mountPath: /var/lib/rabbitmq
              name: volume
  volumeClaimTemplates:
    - metadata:
        name: volume
      spec:
        accessModes:
          - ReadWriteOnce
        resources:
          requests:
            storage: 1Gi

Result of kubectl describe pod rabbitmq-0

DIAGNOSTICS
===========

attempted to contact: ['rabbit@rabbitmq-0']

rabbit@rabbitmq-0:
  * connected to epmd (port 4369) on rabbitmq-0
  * epmd reports: node 'rabbit' not running at all
                  no other nodes on rabbitmq-0
  * suggestion: start the node

current node details:
- node name: 'rabbitmq-cli-91@rabbitmq-0'
- home dir: /var/lib/rabbitmq
- cookie hash: P1XNOe5pN3Ug2FCRFzH7Xg==

Error: unable to connect to node 'rabbit@rabbitmq-0': nodedown

DIAGNOSTICS
===========

attempted to contact: ['rabbit@rabbitmq-0']

rabbit@rabbitmq-0:
  * connected to epmd (port 4369) on rabbitmq-0
  * epmd reports: node 'rabbit' not running at all
                  no other nodes on rabbitmq-0
  * suggestion: start the node

current node details:
- node name: 'rabbitmq-cli-26@rabbitmq-0'
- home dir: /var/lib/rabbitmq
- cookie hash: P1XNOe5pN3Ug2FCRFzH7Xg==

Error: {aborted,{no_exists,[rabbit_vhost,[{{vhost,'$1','_'},[],['$1']}]]}}
Error: {aborted,{no_exists,[rabbit_vhost,[{{vhost,'$1','_'},[],['$1']}]]}}
Error: {aborted,{no_exists,[rabbit_vhost,[{{vhost,'$1','_'},[],['$1']}]]}}
Error: {aborted,{no_exists,[rabbit_vhost,[{{vhost,'$1','_'},[],['$1']}]]}}
Error: {aborted,{no_exists,[rabbit_vhost,[{{vhost,'$1','_'},[],['$1']}]]}}
Error: {aborted,{no_exists,[rabbit_vhost,[{{vhost,'$1','_'},[],['$1']}]]}}
Error: {aborted,{no_exists,[rabbit_vhost,[{{vhost,'$1','_'},[],['$1']}]]}}
Error: {aborted,{no_exists,[rabbit_vhost,[{{vhost,'$1','_'},[],['$1']}]]}}
Error: {aborted,{no_exists,[rabbit_vhost,[{{vhost,'$1','_'},[],['$1']}]]}}
Error: {aborted,{no_exists,[rabbit_vhost,[{{vhost,'$1','_'},[],['$1']}]]}}
Error: {aborted,{no_exists,[rabbit_vhost,[{{vhost,'$1','_'},[],['$1']}]]}}
Error: {aborted,{no_exists,[rabbit_vhost,[{{vhost,'$1','_'},[],['$1']}]]}}
Error: rabbit application is not running on node rabbit@rabbitmq-0.
 * Suggestion: start it with "rabbitmqctl start_app" and try again
, message: "Timeout: 70.0 seconds ...\nChecking health of node 'rabbit@rabbitmq-0' ...\nTimeout: 70.0 seconds ...\nChecking health of node 'rabbit@rabbitmq-0' ...\nTimeout: 70.0 seconds ...\nChecking health of node 'rabbit@rabbitmq-0' ...\nTimeout: 70.0 seconds ...\nChecking health of node 'rabbit@rabbitmq-0' ...\nTimeout: 70.0 seconds ...\nChecking health of node 'rabbit@rabbitmq-0' ...\nTimeout: 70.0 seconds ...\nChecking health of node 'rabbit@rabbitmq-0' ...\nTimeout: 70.0 seconds ...\nChecking health of node 'rabbit@rabbitmq-0' ...\nTimeout: 70.0 seconds ...\nChecking health of node 'rabbit@rabbitmq-0' ...\nTimeout: 70.0 seconds ...\nChecking health of node 'rabbit@rabbitmq-0' ...\nTimeout: 70.0 seconds ...\nChecking health of node 'rabbit@rabbitmq-0' ...\nTimeout: 70.0 seconds ...\nChecking health of node 'rabbit@rabbitmq-0' ...\nTimeout: 70.0 seconds ...\nChecking health of node 'rabbit@rabbitmq-0' ...\nError: unable to connect to node 'rabbit@rabbitmq-0': nodedown\n\nDIAGNOSTICS\n===========\n\nattempted to contact: ['rabbit@rabbitmq-0']\n\nrabbit@rabbitmq-0:\n  * connected to epmd (port 4369) on rabbitmq-0\n  * epmd reports: node 'rabbit' not running at all\n                  no other nodes on rabbitmq-0\n  * suggestion: start the node\n\ncurrent node details:\n- node name: 'rabbitmq-cli-91@rabbitmq-0'\n- home dir: /var/lib/rabbitmq\n- cookie hash: P1XNOe5pN3Ug2FCRFzH7Xg==\n\nError: unable to connect to node 'rabbit@rabbitmq-0': nodedown\n\nDIAGNOSTICS\n===========\n\nattempted to contact: ['rabbit@rabbitmq-0']\n\nrabbit@rabbitmq-0:\n  * connected to epmd (port 4369) on rabbitmq-0\n  * epmd reports: node 'rabbit' not running at all\n                  no other nodes on rabbitmq-0\n  * suggestion: start the node\n\ncurrent node details:\n- node name: 'rabbitmq-cli-26@rabbitmq-0'\n- home dir: /var/lib/rabbitmq\n- cookie hash: P1XNOe5pN3Ug2FCRFzH7Xg==\n\nError: {aborted,{no_exists,[rabbit_vhost,[{{vhost,'$1','_'},[],['$1']}]]}}\nError: {aborted,{no_exists,[rabbit_vhost,[{{vhost,'$1','_'},[],['$1']}]]}}\nError: {aborted,{no_exists,[rabbit_vhost,[{{vhost,'$1','_'},[],['$1']}]]}}\nError: {aborted,{no_exists,[rabbit_vhost,[{{vhost,'$1','_'},[],['$1']}]]}}\nError: {aborted,{no_exists,[rabbit_vhost,[{{vhost,'$1','_'},[],['$1']}]]}}\nError: {aborted,{no_exists,[rabbit_vhost,[{{vhost,'$1','_'},[],['$1']}]]}}\nError: {aborted,{no_exists,[rabbit_vhost,[{{vhost,'$1','_'},[],['$1']}]]}}\nError: {aborted,{no_exists,[rabbit_vhost,[{{vhost,'$1','_'},[],['$1']}]]}}\nError: {aborted,{no_exists,[rabbit_vhost,[{{vhost,'$1','_'},[],['$1']}]]}}\nError: {aborted,{no_exists,[rabbit_vhost,[{{vhost,'$1','_'},[],['$1']}]]}}\nError: {aborted,{no_exists,[rabbit_vhost,[{{vhost,'$1','_'},[],['$1']}]]}}\nError: {aborted,{no_exists,[rabbit_vhost,[{{vhost,'$1','_'},[],['$1']}]]}}\nError: rabbit application is not running on node rabbit@rabbitmq-0.\n * Suggestion: start it with \"rabbitmqctl start_app\" and try again\n"
  Warning  FailedPostStartHook  23m  kubelet  Exec lifecycle hook ([/bin/sh -c if [ -z "$(grep rabbitmq /etc/resolv.conf)" ]; then
  sed "s/^search \([^ ]\+\)/search rabbitmq.\1 \1/" /etc/resolv.conf > /etc/resolv.conf.new;
  cat /etc/resolv.conf.new > /etc/resolv.conf;
  rm /etc/resolv.conf.new;
fi; until rabbitmqctl node_health_check; do sleep 1; done; if [[ "$HOSTNAME" != "rabbitmq-0" && -z "$(rabbitmqctl cluster_status | grep rabbitmq-0)" ]]; then
  rabbitmqctl stop_app;
  rabbitmqctl join_cluster rabbit@rabbitmq-0;
  rabbitmqctl start_app;
fi; rabbitmqctl set_policy ha-all "." '{"ha-mode":"exactly","ha-params":3,"ha-sync-mode":"automatic"}'
]) for Container "rabbitmq" in Pod "rabbitmq-0_default(3ac91d73-de7b-4cde-81f6-c31bacd10252)" failed - error: command '/bin/sh -c if [ -z "$(grep rabbitmq /etc/resolv.conf)" ]; then
  sed "s/^search \([^ ]\+\)/search rabbitmq.\1 \1/" /etc/resolv.conf > /etc/resolv.conf.new;
  cat /etc/resolv.conf.new > /etc/resolv.conf;
  rm /etc/resolv.conf.new;
fi; until rabbitmqctl node_health_check; do sleep 1; done; if [[ "$HOSTNAME" != "rabbitmq-0" && -z "$(rabbitmqctl cluster_status | grep rabbitmq-0)" ]]; then
  rabbitmqctl stop_app;
  rabbitmqctl join_cluster rabbit@rabbitmq-0;
  rabbitmqctl start_app;
fi; rabbitmqctl set_policy ha-all "." '{"ha-mode":"exactly","ha-params":3,"ha-sync-mode":"automatic"}'
' exited with 137: Error: unable to connect to node 'rabbit@rabbitmq-0': nodedown

Result of kubectl logs rabbitmq-0

=CRASH REPORT==== 18-Jul-2021::11:06:01 ===
  crasher:
    initial call: application_master:init/4
    pid: <0.156.0>
    registered_name: []
    exception exit: {{timeout_waiting_for_tables,
                         [rabbit_user,rabbit_user_permission,rabbit_vhost,
                          rabbit_durable_route,rabbit_durable_exchange,
                          rabbit_runtime_parameters,rabbit_durable_queue]},
                     {rabbit,start,[normal,[]]}}
      in function  application_master:init/4 (application_master.erl, line 134)
    ancestors: [<0.155.0>]
    messages: [{'EXIT',<0.157.0>,normal}]
    links: [<0.155.0>,<0.31.0>]
    dictionary: []
    trap_exit: true
    status: running
    heap_size: 987
    stack_size: 27
    reductions: 98
  neighbours:

=INFO REPORT==== 18-Jul-2021::11:06:01 ===
    application: rabbit
    exited: {{timeout_waiting_for_tables,
                 [rabbit_user,rabbit_user_permission,rabbit_vhost,
                  rabbit_durable_route,rabbit_durable_exchange,
                  rabbit_runtime_parameters,rabbit_durable_queue]},
             {rabbit,start,[normal,[]]}}
    type: temporary

=INFO REPORT==== 18-Jul-2021::11:06:01 ===
    application: amqp_client
    exited: stopped
    type: temporary

=INFO REPORT==== 18-Jul-2021::11:06:01 ===
    application: rabbit_common
    exited: stopped
    type: temporary

=INFO REPORT==== 18-Jul-2021::11:06:01 ===
    application: xmerl
    exited: stopped
    type: temporary

=INFO REPORT==== 18-Jul-2021::11:06:01 ===
    application: os_mon
    exited: stopped
    type: temporary

=INFO REPORT==== 18-Jul-2021::11:06:01 ===
    application: inets
    exited: stopped
    type: temporary

=INFO REPORT==== 18-Jul-2021::11:06:01 ===
    application: asn1
    exited: stopped
    type: temporary

=INFO REPORT==== 18-Jul-2021::11:06:01 ===
    application: syntax_tools
    exited: stopped
    type: temporary

=INFO REPORT==== 18-Jul-2021::11:06:01 ===
    application: mnesia
    exited: stopped
    type: temporary

=INFO REPORT==== 18-Jul-2021::11:06:01 ===
    application: crypto
    exited: stopped
    type: temporary

=INFO REPORT==== 18-Jul-2021::11:06:01 ===
    application: ranch
    exited: stopped
    type: temporary

=INFO REPORT==== 18-Jul-2021::11:06:01 ===
    application: compiler
    exited: stopped
    type: temporary


BOOT FAILED
===========

Timeout contacting cluster nodes: ['rabbit@rabbitmq-1','rabbit@rabbitmq-2'].

BACKGROUND
==========

This cluster node was shut down while other nodes were still running.
To avoid losing data, you should start the other nodes first, then
start this one. To force this node to start, first invoke
"rabbitmqctl force_boot". If you do so, any changes made on other
cluster nodes after this one was shut down may be lost.

DIAGNOSTICS
===========

attempted to contact: ['rabbit@rabbitmq-1','rabbit@rabbitmq-2']

rabbit@rabbitmq-1:
  * unable to connect to epmd (port 4369) on rabbitmq-1: nxdomain (non-existing domain)

rabbit@rabbitmq-2:
  * unable to connect to epmd (port 4369) on rabbitmq-2: nxdomain (non-existing domain)


current node details:
- node name: 'rabbit@rabbitmq-0'
- home dir: /var/lib/rabbitmq
- cookie hash: P1XNOe5pN3Ug2FCRFzH7Xg==



=INFO REPORT==== 18-Jul-2021::11:06:01 ===
Timeout contacting cluster nodes: ['rabbit@rabbitmq-1','rabbit@rabbitmq-2'].

BACKGROUND
==========

This cluster node was shut down while other nodes were still running.
To avoid losing data, you should start the other nodes first, then
start this one. To force this node to start, first invoke
"rabbitmqctl force_boot". If you do so, any changes made on other
cluster nodes after this one was shut down may be lost.

DIAGNOSTICS
===========

attempted to contact: ['rabbit@rabbitmq-1','rabbit@rabbitmq-2']

rabbit@rabbitmq-1:
  * unable to connect to epmd (port 4369) on rabbitmq-1: nxdomain (non-existing domain)

rabbit@rabbitmq-2:
  * unable to connect to epmd (port 4369) on rabbitmq-2: nxdomain (non-existing domain)


current node details:
- node name: 'rabbit@rabbitmq-0'
- home dir: /var/lib/rabbitmq
- cookie hash: P1XNOe5pN3Ug2FCRFzH7Xg==


{"init terminating in do_boot",timeout_waiting_for_tables}
init terminating in do_boot (timeout_waiting_for_tables)

Crash dump is being written to: erl_crash.dump...

What I tried but didn't work:

rabbitmqctl stop_app
rabbitmqctl force_boot
Remove StatefulSet and re-install
Re-configure the yaml file
-- Nguyen Thanh
azure
kubernetes
kubernetes-cluster
rabbitmq

1 Answer

7/21/2021

Please try force boot in post Start scipt:

...

fi;

if [[ "$HOSTNAME" == "rabbitmq-0" ]]; then
                  rabbitmqctl stop_app;
                  rabbitmqctl force_boot;
                  
fi;

until rabbitmqctl node_health_check; do sleep 1; done; ...

-- LarryX
Source: StackOverflow