Kubernetes pods can't connect between machines

1/6/2016

I used the node.yaml and master.yaml files here: http://kubernetes.io/v1.1/docs/getting-started-guides/coreos/coreos_multinode_cluster.html to create a multi-node cluster on 3 bare-metal machines running CoreOS. However, pods on different nodes can’t communicate with each other. I’d appreciate any pointers or suggestions. I’m at a loss.

I have three pods running rabbitmq:

thuey:~ thuey$ kbg pods | grep rabbitmq
rabbitmq-bootstrap     1/1       Running   0          3h
rabbitmq-jz2q7         1/1       Running   0          3h
rabbitmq-mrnfc         1/1       Running   0          3h

Two of the pods are on one machine:

kbd node jolt-server-3 | grep rabbitmq
thuey               rabbitmq-bootstrap      0 (0%)      0 (0%)      0 (0%)      0 (0%)
thuey               rabbitmq-jz2q7          0 (0%)      0 (0%)      0 (0%)      0 (0%)

And the other pod is on another machine:

thuey:~ thuey$ kbd node jolt-server-4 | grep rabbitmq
thuey               rabbitmq-mrnfc          0 (0%)      0 (0%)      0 (0%)      0 (0%)

I can successfully ping from rabbitmq-bootstrap to rabbitmq-jz2q7:

root@rabbitmq-bootstrap:/# ping 172.17.0.5
PING 172.17.0.5 (172.17.0.5) 56(84) bytes of data.
64 bytes from 172.17.0.5: icmp_seq=1 ttl=64 time=0.058 ms
64 bytes from 172.17.0.5: icmp_seq=2 ttl=64 time=0.035 ms
64 bytes from 172.17.0.5: icmp_seq=3 ttl=64 time=0.064 ms
64 bytes from 172.17.0.5: icmp_seq=4 ttl=64 time=0.055 ms
^C
--- 172.17.0.5 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3000ms
rtt min/avg/max/mdev = 0.035/0.053/0.064/0.010 ms

But I can't ping rabbitmq-mrnfc:

root@rabbitmq-bootstrap:/# ping 172.17.0.8
PING 172.17.0.8 (172.17.0.8) 56(84) bytes of data.
From 172.17.0.2 icmp_seq=1 Destination Host Unreachable
From 172.17.0.2 icmp_seq=2 Destination Host Unreachable
From 172.17.0.2 icmp_seq=3 Destination Host Unreachable
From 172.17.0.2 icmp_seq=4 Destination Host Unreachable
^C
--- 172.17.0.8 ping statistics ---
5 packets transmitted, 0 received, +4 errors, 100% packet loss, time 4000ms
pipe 4
-- thuey
coreos
kubernetes

2 Answers

1/6/2016

It turns out the problem was that docker was starting before flannel had started. It was resulting in the bip for docker being set to the default 172. Meanwhile, flannel was running a 10 subnet. To fix it, I just added a dependency on flannel from the docker.service in my cloud config. Some helpful links:

https://groups.google.com/forum/#!topic/coreos-user/KKnV1lA-ULs https://github.com/coreos/flannel/issues/246

-- thuey
Source: StackOverflow

1/6/2016

The guide you use don't include instructions for bare-metal machines. You need networking (e.g., flannel, calico) that implements Kubernetes's networking model. You can check the table of solutions for getting-started guides for different IaaS/OS/Network combinations.

-- Yu-Ju Hong
Source: StackOverflow