k8s, RabbitMQ, and Peer Discovery

1/22/2019

We are trying to run an instance of the RabbitMQ chart with Helm from the helm/charts/stable/rabbit project. I had it running perfect but then I had to restart k8s for some maintenance. Now we are completely unable to launch the RabbitMQ chart in any way shape or form. I am not even trying to run the chart with any variables, i.e. just the default values.

Here is all I am doing:

helm install stable/rabbitmq

I have confirmed I can simply run the default right on my local k8s which I'm running with Docker for Desktop. When we run the rabbit chart on our shared k8s the exact same way as on desktop and what we did before the restart, the following error is thrown:

Failed to get nodes from k8s - 503

I have also posted an issue on the Helm charts repo as well. Click here to see the issue on Github.

We are suspecting the DNS but are unable to confirm anything yet. What is very frustrating is after the restart every single other chart we installed restarted perfectly except Rabbit which now will not start at all.

Anyone know what I could do to get Rabbits peer discovery to work? Anyone seen issue like this after restarting k8s?

-- mr haven
kubernetes
kubernetes-helm
rabbitmq

2 Answers

1/24/2019

I'm wondering what is the reason of using rabbit_peer_discovery_k8s plugin, if values.yaml defaults to 1 replicas (your manifest file does not override this setting) ?

I was trying to reproduce your issue with given by you override values (dev-server.yaml), as per the details in your github issue #10811, but I somewhat failed. Here are my observations:

  1. If to install RabbitMQ chart with your custom values, my rabbitmq-dev-default-0 pod gets stuck in CrashLoopBackOff state. It`s quite hard to troubleshoot it further for me as bitnami`s rabbitmq image containers, used by this rabbitmq Helm chart, are shipped with non-root account.
  2. On the other hand if rabbitmq chart is installed on my Kubernetes cluster (v1.13.2) in simplest form:

helm install stable/rabbitmq

I observe similar issue then. I mean rabbitmq server survives a simulated VM restart of all cluster nodes (including master), but I cannot connect to it from outside:

Post VM restart, I`m getting following error from my python mqclient:

socket.gaierror: [Errno -2] Name or service not known

Few remarks here:

Yes, I did port(s)-forward as per instructions on "helm status " command:

The readiness probe works fine:

curl -sS -f --user user:<my_pwd> 127.0.0.1:15672/api/healthchecks/node
{"status":"ok"}

rabbitmqctl to rabbitmq-server connectivity from inside the container works fine too:

kubectl exec rabbitmq-dev-default-0 -- rabbitmqctl list_queues
warning: the VM is running with native name encoding of latin1 which may cause Elixir to malfunction as it expects utf8. Please ensure your locale is set to UTF-8 (which can be verified by running "locale" in your shell)
Timeout: 60.0 seconds ...
Listing queues for vhost / ...
name    messages
hello   11

From the moment I used kubectl port-forward to pod instead service, connectivity to rabbitmq server is restored:

kubectl port-forward --namespace default pod/rabbitmq-dev-default-0 5672:5672

$ python send.py
 [x] Sent 'Hello World!'
-- Nepomucen
Source: StackOverflow

1/25/2019

So I actually got rabbit to run. Turns out my issue was the k8s peer discovery could not connect over the default port 443 and I had to use the external port 6443 because kubernetes.default.svc.cluster.local resolved to the public port and could not find the internal, so yeah our config is messed up too.

It took me a while to realize the variable below was not overriding when I overrode it with helm install . -f server-values.yaml.

rabbitmq:
  configuration: |-
    ## Clustering
    cluster_formation.peer_discovery_backend  = rabbit_peer_discovery_k8s
    cluster_formation.k8s.host = kubernetes.default.svc.cluster.local
    cluster_formation.k8s.port = 6443
    cluster_formation.node_cleanup.interval = 10
    cluster_formation.node_cleanup.only_log_warning = true
    cluster_partition_handling = autoheal
    # queue master locator
    queue_master_locator=min-masters
    # enable guest user
    loopback_users.guest = false

I had to add cluster_formation.k8s.port = 6443 to the main values.yaml file instead of my own. Once the port was changed specifically in the values.yaml, rabbit started right up.

-- mr haven
Source: StackOverflow