I have a network issue on my cluster and at first I thought it was a routing issue but discovered that maybe the outgoing packet from the cluster isn't getting wrapped with the node ip when leaving the node.
Background is that I have two clusters. I set up the first one (months ago) manually using this guide and it worked great. Then the second one I built multiple times as I created/debugged anisble scripts to automate how I created the first cluster.
On cluster2 I have the network issue... I can get to pods on other nodes but can't get to anything on my regular network. I have tcpdump'd the physical interface on node0 in cluster2 when pinging from a busybox pod and the 172.16.0.x internal pod ip is visible at that interface as the source ip - and my network outside the node has no idea what to do with it. But on cluster1 this same test shows the node ip in place of the pod ip - which is how I assume it should work.
My question is how can I troubleshoot this? Any ideas would be great as I have been at this for several days now. Even if it seems obvious as I can no longer see the forest through the trees... ie. both clusters look the same everywhere I know how to check :)
caveat to "my clusters are the same": Cluster1 is running kubectl 1.16 cluster2 is running 1.18
----edit after @Matt dropped some kube-proxy knowledge on me----
Did not know that kube-proxy rules could just be read by iptables command! Awesome!
I think my problem is those 10.net addresses in the broke cluster. I don't even know where those came from as they are not in any of my ansible config scripts or kube init files... I use all 172's in my configs.
I do pull some configs direct from source (flannel and CSI/CPI stuff) I'll pull those down and inspect them to see if the 10's are in there... Hopefully it's in the flannel defaults or something and I can just change that yml file!
cluster1 working:
[root@k8s-master ~]# iptables -t nat -vnL| grep POSTROUTING -A5
Chain POSTROUTING (policy ACCEPT 22 packets, 1346 bytes)
pkts bytes target prot opt in out source destination
6743K 550M KUBE-POSTROUTING all -- * * 0.0.0.0/0 0.0.0.0/0 /* kubernetes postrouting rules */
0 0 MASQUERADE all -- * !docker0 172.17.0.0/16 0.0.0.0/0
3383K 212M RETURN all -- * * 172.16.0.0/16 172.16.0.0/16
117K 9002K MASQUERADE all -- * * 172.16.0.0/16 !224.0.0.0/4
0 0 RETURN all -- * * !172.16.0.0/16 172.16.0.0/24
0 0 MASQUERADE all -- * * !172.16.0.0/16 172.16.0.0/16
cluster2 - not working:
[root@testvm-master ~]# iptables -t nat -vnL | grep POSTROUTING -A5
Chain POSTROUTING (policy ACCEPT 1152 packets, 58573 bytes)
pkts bytes target prot opt in out source destination
719K 37M KUBE-POSTROUTING all -- * * 0.0.0.0/0 0.0.0.0/0 /* kubernetes postrouting rules */
0 0 RETURN all -- * * 10.244.0.0/16 10.244.0.0/16
0 0 MASQUERADE all -- * * 10.244.0.0/16 !224.0.0.0/4
131K 7849K RETURN all -- * * !10.244.0.0/16 172.16.0.0/24
0 0 MASQUERADE all -- * * !10.244.0.0/16 10.244.0.0/16
Boom! @Matt advice for the win.
Using iptables to verify the nat rules that flannel was applying did the trick. I was able to find the 10.244 subnet in the flannel config that was referenced in the guide I was using.
I had two options. 1. download and alter the flannel yaml before deploying the CNI or 2. make my kubeadmin init subnet declaration match what flannel has.
I went with option 2 because I don't want to alter the flannel config everytime... I just want to pull down their latest file and run with it. This worked quite nicely to resolve my issue.