Source IP address translation for intra-cluster traffic

10/26/2018

I'm trying to dive into K8s networking model and I think I have a pretty good understanding of it so far, but there is one thing that I can't get my head around. In the Cluster Networking guide, the following is mentioned:

Kubernetes imposes the following fundamental requirements on any networking implementation (barring any intentional network segmentation policies):

  • all containers can communicate with all other containers without NAT
  • all nodes can communicate with all containers (and vice-versa) without NAT
  • the IP that a container sees itself as is the same IP that others see it as

The second bullet point specifies that x-node container communication should be possible without NAT. This is however not true when kube-proxy runs in iptables mode. This is the dump of the iptables from one of my nodes:

Chain POSTROUTING (policy ACCEPT)
target     prot opt source               destination         
KUBE-POSTROUTING  all  --  anywhere             anywhere             /* kubernetes postrouting rules */

Chain KUBE-POSTROUTING (1 references)
target     prot opt source               destination         
MASQUERADE  all  --  anywhere             anywhere             /* kubernetes service traffic requiring SNAT */ mark match 0x4000/0x4000

/* sample target pod chain being marked for MASQ */
Chain KUBE-SEP-2BKJZA32HM354D5U (1 references)
target     prot opt source               destination         
KUBE-MARK-MASQ  all  --  xx.yyy.zzz.109       anywhere             /* kube-system/heapster: */
DNAT       tcp  --  anywhere             anywhere             /* kube-system/heapster: */ tcp to:xx.yyy.zzz.109:8082

Chain KUBE-MARK-MASQ (156 references)
target     prot opt source               destination         
MARK       all  --  anywhere             anywhere             MARK or 0x4000

Looks like K8s is changing the source IP of marked outbound packets to the node's IP (for a ClusterIP service). And they even explicitly mention this in Source IP for Services with Type=ClusterIP:

Packets sent to ClusterIP from within the cluster are never source NAT’d if you’re running kube-proxy in iptables mode, which is the default since Kubernetes 1.2. If the client pod and server pod are in the same node, the client_address is the client pod’s IP address. However, if the client pod and server pod are in different nodes, the client_address is the client pod’s node flannel IP address.

This starts by saying packets within the cluster are never SNAT'd but then proceedes to say packages sent to pods in other nodes are in fact SNAT'd. I'm confused about this - am I misinterpreting the all nodes can communicate with all containers (and vice-versa) without NAT requirement somehow?

-- PoweredByOrange
kube-proxy
kubernetes

1 Answer

10/27/2018

If you read point 2:

Pod-to-Pod communications: this is the primary focus of this document.

This still applies to all the containers and pods running in your cluster, because all of them are in the PodCidr:

  • all containers can communicate with all other containers without NAT
  • all nodes can communicate with all containers (and vice-versa)
  • without NAT the IP that a container sees itself as is the same IP that others see it as

Basically, all pods have unique IP addresses and are in the same space and can talk to each at the IP layer.

Also, if you look at the routes on one of your Kubernetes nodes you'll see something like this for Calico, where the podCidr is 192.168.0.0/16:

default via 172.0.0.1 dev ens5 proto dhcp src 172.0.1.10 metric 100
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 linkdown
172.31.0.0/20 dev ens5 proto kernel scope link src 172.0.1.10
172.31.0.1 dev ens5 proto dhcp scope link src 172.0.1.10 metric 100
blackhole 192.168.0.0/24 proto bird
192.168.0.42 dev calixxxxxxxxxxx scope link
192.168.0.43 dev calixxxxxxxxxxx scope link
192.168.4.0/24 via 172.0.1.6 dev tunl0 proto bird onlink
192.168.7.0/24 via 172.0.1.55 dev tunl0 proto bird onlink
192.168.8.0/24 via 172.0.1.191 dev tunl0 proto bird onlink
192.168.9.0/24 via 172.0.1.196 dev tunl0 proto bird onlink
192.168.11.0/24 via 172.0.1.147 dev tunl0 proto bird onlink

You see the packets with a 192.168.x.x are directly forwarded to a tunnel interface connected to the nodes, so no NATing there.

Now, when you are connecting from the outside the PodCidr your packets are definitely NATed, say through services are through an external host. You also definitely see iptable rules like this:

# Completed on Sat Oct 27 00:22:39 2018
# Generated by iptables-save v1.6.1 on Sat Oct 27 00:22:39 2018
*nat
:PREROUTING ACCEPT [65:5998]
:INPUT ACCEPT [1:60]
:OUTPUT ACCEPT [28:1757]
:POSTROUTING ACCEPT [61:5004]
:DOCKER - [0:0]
:KUBE-MARK-DROP - [0:0]
-- Rico
Source: StackOverflow