K
Q

Question

Rook Ceph Operator hangs when checking for cluster status

6/11/2019

I've setup a k8s cluster on digital ocean Ubuntu 18.04 LTS droplets using calico on top of wireguard vpn, and was able to setup nginx-ingress with traefik as external LB. I'm now on the step of setting up distributed storage using rook ceph, by following the quickstart at https://rook.io/docs/rook/master/ceph-quickstart.html, but it seems like the monitors never reach a quorum (even when its just one). Actually, monitor a reaches by itself, but neither the operator or any other monitors seem to know that, and the operator hangs when trying to check the status.

I've tried troubleshooting network issues, all the way from wireguard, calico and ufw. I've even set ufw to temporarily allow all traffic by default just to make sure I wasn't allowing one port but the traffic was on another interface (i have wg0, eth1, tunl0 and the calico interfaces).

The I followed the ceph troubleshooting guide unsuccessfully: http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-mon/#recovering-a-monitor-s-broken-monmap

I've been 4 days at this and I'm out of solutions.

Here's how I setup the storage cluster

cd cluster/examples/kubernetes/ceph
kubectl apply -f common.yaml
kubectl apply -f operator.yaml
kubectl apply -f cluster-test.yaml

Running kubectl get pods returns

NAME                                      READY   STATUS    RESTARTS   AGE
pod/rook-ceph-agent-9ws2p                 1/1     Running   0          24s
pod/rook-ceph-agent-v6v9n                 1/1     Running   0          24s
pod/rook-ceph-agent-x2jv4                 1/1     Running   0          24s
pod/rook-ceph-mon-a-74cc6db5c8-8s5l5      1/1     Running   0          9s
pod/rook-ceph-operator-7cd5d8bd4c-pclxp   1/1     Running   0          25s
pod/rook-discover-24cfj                   1/1     Running   0          24s
pod/rook-discover-6xsnp                   1/1     Running   0          24s
pod/rook-discover-hj4tc                   1/1     Running   0          24s

However, when I try to check the status of the monitors, from the operator pod I get:

#This hangs forever
kubectl exec -it rook-ceph-operator-7cd5d8bd4c-pclxp ceph status

#This hangs foverer
kubectl exec -it rook-ceph-operator-7cd5d8bd4c-pclxp ceph ping mon.a

#This returns [errno 2] error calling ping_monitor
#Which I guess should, becasue mon.b does/should not exist
#But I expected a response such as mon.b does not exist
kubectl exec -it rook-ceph-operator-7cd5d8bd4c-pclxp ceph ping mon.b