Ansible AWX RabbitMQ container in Kubernetes Failed to get nodes from k8s with nxdomain

7/9/2018

I am trying to get Ansible AWX installed on my Kubernetes cluster but the RabbitMQ container is throwing "Failed to get nodes from k8s" error.

Below are the version of platforms I am using

[node1 ~]# kubectl version
Client Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.5", 
GitCommit:"f01a2bf98249a4db383560443a59bed0c13575df", GitTreeState:"clean", 
BuildDate:"2018-03-19T15:50:45Z", GoVersion:"go1.9.3", Compiler:"gc", 
Platform:"linux/amd64"}

Kubernetes is deployed via the kubespray playbook v2.5.0 and all the services and pods are up and running. (CoreDNS, Weave, IPtables)

I am deploying AWX via the 1.0.6 release using the 1.0.6 images for awx_web and awx_task.

I am using an external PostgreSQL database at v10.4 and have verified the tables are being created by awx in the db.

Troubleshooting steps I have tried.

  • I tried to deploy AWX 1.0.5 with the etcd pod to the same cluster and it has worked as expected
  • I have deployed a stand alone RabbitMQ cluster in the same k8s cluster trying to mimic the AWX rabbit deployment as much as possible and it works with the rabbit_peer_discovery_k8s backend.
  • I have tried tweeking some of the rabbitmq.conf for AWX 1.0.6 with no luck it just keeps thowing the same error.
  • I have verified the /etc/resolv.conf file has the kubernetes.default.svc.cluster.local entry

Cluster Info

[node1 ~]# kubectl get all -n awx
NAME         DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
deploy/awx   1         1         1            0           38m

NAME                DESIRED   CURRENT   READY     AGE
rs/awx-654f7fc84c   1         1         0         38m

NAME         DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
deploy/awx   1         1         1            0           38m

NAME                DESIRED   CURRENT   READY     AGE
rs/awx-654f7fc84c   1         1         0         38m

NAME                      READY     STATUS             RESTARTS   AGE
po/awx-654f7fc84c-9ppqb   3/4       CrashLoopBackOff   11         38m

NAME               TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)                          AGE
svc/awx-rmq-mgmt   ClusterIP   10.233.10.146   <none>        15672/TCP                        1d
svc/awx-web-svc    NodePort    10.233.3.75     <none>        80:31700/TCP                     1d
svc/rabbitmq       NodePort    10.233.37.33    <none>        15672:30434/TCP,5672:31962/TCP   1d

AWX RabbitMQ error log

[node1 ~]# kubectl logs -n awx awx-654f7fc84c-9ppqb awx-rabbit
2018-07-09 14:47:37.464 [info] <0.33.0> Application lager started on node 'rabbit@10.233.120.5'
2018-07-09 14:47:37.767 [info] <0.33.0> Application os_mon started on node 'rabbit@10.233.120.5'
2018-07-09 14:47:37.767 [info] <0.33.0> Application crypto started on node 'rabbit@10.233.120.5'
2018-07-09 14:47:37.768 [info] <0.33.0> Application cowlib started on node 'rabbit@10.233.120.5'
2018-07-09 14:47:37.768 [info] <0.33.0> Application xmerl started on node 'rabbit@10.233.120.5'
2018-07-09 14:47:37.851 [info] <0.33.0> Application mnesia started on node 'rabbit@10.233.120.5'
2018-07-09 14:47:37.851 [info] <0.33.0> Application recon started on node 'rabbit@10.233.120.5'
2018-07-09 14:47:37.852 [info] <0.33.0> Application jsx started on node 'rabbit@10.233.120.5'
2018-07-09 14:47:37.852 [info] <0.33.0> Application asn1 started on node 'rabbit@10.233.120.5'
2018-07-09 14:47:37.852 [info] <0.33.0> Application public_key started on node 'rabbit@10.233.120.5'
2018-07-09 14:47:37.897 [info] <0.33.0> Application ssl started on node 'rabbit@10.233.120.5'
2018-07-09 14:47:37.901 [info] <0.33.0> Application ranch started on node 'rabbit@10.233.120.5'
2018-07-09 14:47:37.901 [info] <0.33.0> Application ranch_proxy_protocol started on node 'rabbit@10.233.120.5'
2018-07-09 14:47:37.901 [info] <0.33.0> Application rabbit_common started on node 'rabbit@10.233.120.5'
2018-07-09 14:47:37.907 [info] <0.33.0> Application amqp_client started on node 'rabbit@10.233.120.5'
2018-07-09 14:47:37.909 [info] <0.33.0> Application cowboy started on node 'rabbit@10.233.120.5'
2018-07-09 14:47:37.957 [info] <0.33.0> Application inets started on node 'rabbit@10.233.120.5'
2018-07-09 14:47:37.964 [info] <0.193.0>
 Starting RabbitMQ 3.7.4 on Erlang 20.1.7
 Copyright (C) 2007-2018 Pivotal Software, Inc.
 Licensed under the MPL.  See http://www.rabbitmq.com/

  ##  ##
  ##  ##      RabbitMQ 3.7.4. Copyright (C) 2007-2018 Pivotal Software, Inc.
  ##########  Licensed under the MPL.  See http://www.rabbitmq.com/
  ######  ##
  ##########  Logs: <stdout>

              Starting broker...
2018-07-09 14:47:37.982 [info] <0.193.0>
 node           : rabbit@10.233.120.5
 home dir       : /var/lib/rabbitmq
 config file(s) : /etc/rabbitmq/rabbitmq.conf
 cookie hash    : at619UOZzsenF44tSK3ulA==
 log(s)         : <stdout>
 database dir   : /var/lib/rabbitmq/mnesia/rabbit@10.233.120.5
2018-07-09 14:47:39.649 [info] <0.201.0> Memory high watermark set to 11998 MiB (12581714329 bytes) of 29997 MiB (31454285824 bytes) total
2018-07-09 14:47:39.652 [info] <0.203.0> Enabling free disk space monitoring
2018-07-09 14:47:39.653 [info] <0.203.0> Disk free limit set to 50MB
2018-07-09 14:47:39.658 [info] <0.205.0> Limiting to approx 1048476 file handles (943626 sockets)
2018-07-09 14:47:39.658 [info] <0.206.0> FHC read buffering:  OFF
2018-07-09 14:47:39.658 [info] <0.206.0> FHC write buffering: ON
2018-07-09 14:47:39.660 [info] <0.193.0> Node database directory at /var/lib/rabbitmq/mnesia/rabbit@10.233.120.5 is empty. Assuming we need to join an existing cluster or initialise from scratch...
2018-07-09 14:47:39.660 [info] <0.193.0> Configured peer discovery backend: rabbit_peer_discovery_k8s
2018-07-09 14:47:39.660 [info] <0.193.0> Will try to lock with peer discovery backend rabbit_peer_discovery_k8s
2018-07-09 14:47:39.660 [info] <0.193.0> Peer discovery backend does not support locking, falling back to randomized delay
2018-07-09 14:47:39.660 [info] <0.193.0> Peer discovery backend rabbit_peer_discovery_k8s does not support registration, skipping randomized startup delay.
2018-07-09 14:47:39.665 [info] <0.193.0> Failed to get nodes from k8s - {failed_connect,[{to_address,{"kubernetes.default.svc.cluster.local",443}},
                 {inet,[inet],nxdomain}]}
2018-07-09 14:47:39.665 [error] <0.192.0> CRASH REPORT Process <0.192.0> with 0 neighbours exited with reason: no case clause matching {error,"{failed_connect,[{to_address,{\"kubernetes.default.svc.cluster.local\",443}},\n                 {inet,[inet],nxdomain}]}"} in rabbit_mnesia:init_from_config/0 line 164 in application_master:init/4 line 134
2018-07-09 14:47:39.666 [info] <0.33.0> Application rabbit exited with reason: no case clause matching {error,"{failed_connect,[{to_address,{\"kubernetes.default.svc.cluster.local\",443}},\n                 {inet,[inet],nxdomain}]}"} in rabbit_mnesia:init_from_config/0 line 164
{"Kernel pid terminated",application_controller,"{application_start_failure,rabbit,{bad_return,{{rabbit,start,[normal,[]]},{'EXIT',{{case_clause,{error,\"{failed_connect,[{to_address,{\\"kubernetes.default.svc.cluster.local\\",443}},\n                 {inet,[inet],nxdomain}]}\"}},[{rabbit_mnesia,init_from_config,0,[{file,\"src/rabbit_mnesia.erl\"},{line,164}]},{rabbit_mnesia,init_with_lock,3,[{file,\"src/rabbit_mnesia.erl\"},{line,144}]},{rabbit_mnesia,init,0,[{file,\"src/rabbit_mnesia.erl\"},{line,111}]},{rabbit_boot_steps,'-run_step/2-lc$^1/1-1-',1,[{file,\"src/rabbit_boot_steps.erl\"},{line,49}]},{rabbit_boot_steps,run_step,2,[{file,\"src/rabbit_boot_steps.erl\"},{line,49}]},{rabbit_boot_steps,'-run_boot_steps/1-lc$^0/1-0-',1,[{file,\"src/rabbit_boot_steps.erl\"},{line,26}]},{rabbit_boot_steps,run_boot_steps,1,[{file,\"src/rabbit_boot_steps.erl\"},{line,26}]},{rabbit,start,2,[{file,\"src/rabbit.erl\"},{line,793}]}]}}}}}"}
Kernel pid terminated (application_controller) ({application_start_failure,rabbit,{bad_return,{{rabbit,start,[normal,[]]},{'EXIT',{{case_clause,{error,"{failed_connect,[{to_address,{\"kubernetes.defau

Crash dump is being written to: /var/log/rabbitmq/erl_crash.dump...done

Kubernetes API service

[node1 ~]# kubectl describe service kubernetes
Name:              kubernetes
Namespace:         default
Labels:            component=apiserver
                provider=kubernetes
Annotations:       <none>
Selector:          <none>
Type:              ClusterIP
IP:                10.233.0.1
Port:              https  443/TCP
TargetPort:        6443/TCP
Endpoints:         10.237.34.19:6443,10.237.34.21:6443
Session Affinity:  ClientIP
Events:            <none>

nslookup from a busybox in the same kubernetes cluster

[node2 ~]# kubectl exec -it busybox -- sh
/ # nslookup  kubernetes.default.svc.cluster.local
Server:    10.233.0.3
Address 1: 10.233.0.3 coredns.kube-system.svc.cluster.local

Name:      kubernetes.default.svc.cluster.local
Address 1: 10.233.0.1 kubernetes.default.svc.cluster.local

Please let me know if I am missing anything that could help troubleshooting.

-- kaylor
ansible-awx
ansible-tower
kubernetes
rabbitmq

1 Answer

7/10/2018

I believe the solution is to omit the explicit kubernetes host. I can't think of any good reason one would need to specify the kubernetes api host from inside the cluster.

If for some terrible reason the RMQ plugin requires it, then try swapping in the Service IP (assuming your SSL cert for the master has its Service IP in the SANs list).


As for why it is doing such a silly thing, the only good reason I can think of is that the RMQ PodSpec has somehow gotten a dnsPolicy of something other than ClusterFirst. If you truly wish to troubleshoot the RMQ Pod, then you can provide an explicit command: to run some debugging bash commands first, in order to interrogate the state of the container at launch, and then exec /launch.sh to resume booting up RMQ (as they do)

-- mdaniel
Source: StackOverflow