I looked into the iptables rules which are used by the kube-dns, I'm a little bit confused by the sub chain "KUBE-SEP-V7KWRXXOBQHQVWAT", the content of this sub chain is below:<br> My question is why we need the target "KUBE-MARK-MASQ" when the source IP address(172.168.1.5) is the kube-dns IP address. Per my understanding, the target IP address should be the kube-dns pod's address 172.168.1.5, not the source IP address. Cause all the DNS queries are from other addesses(serivces), The DNS queries cannot be originated from itself.
# iptables -t nat -L KUBE-SEP-V7KWRXXOBQHQVWAT
Chain KUBE-SEP-V7KWRXXOBQHQVWAT (1 references)
target prot opt source destination
KUBE-MARK-MASQ all -- 172.18.1.5 anywhere /* kube-system/kube-dns:dns-tcp */
DNAT tcp -- anywhere anywhere /* kube-system/kube-dns:dns-tcp */ tcp to:172.18.1.5:53
Here is the full chain information:
# iptables -t nat -L KUBE-SERVICES
Chain KUBE-SERVICES (2 references)
target prot opt source destination
KUBE-MARK-MASQ tcp -- !172.18.1.0/24 10.0.62.222 /* kube-system/metrics-server cluster IP */ tcp dpt:https
KUBE-SVC-QMWWTXBG7KFJQKLO tcp -- anywhere 10.0.62.222 /* kube-system/metrics-server cluster IP */ tcp dpt:https
KUBE-MARK-MASQ tcp -- !172.18.1.0/24 10.0.213.2 /* kube-system/healthmodel-replicaset-service cluster IP */ tcp dpt:25227
KUBE-SVC-WT3SFWJ44Q74XUPR tcp -- anywhere 10.0.213.2 /* kube-system/healthmodel-replicaset-service cluster IP */ tcp dpt:25227
KUBE-MARK-MASQ tcp -- !172.18.1.0/24 10.0.0.1 /* default/kubernetes:https cluster IP */ tcp dpt:https
KUBE-SVC-NPX46M4PTMTKRN6Y tcp -- anywhere 10.0.0.1 /* default/kubernetes:https cluster IP */ tcp dpt:https
KUBE-MARK-MASQ udp -- !172.18.1.0/24 10.0.0.10 /* kube-system/kube-dns:dns cluster IP */ udp dpt:domain
KUBE-SVC-TCOU7JCQXEZGVUNU udp -- anywhere 10.0.0.10 /* kube-system/kube-dns:dns cluster IP */ udp dpt:domain
KUBE-MARK-MASQ tcp -- !172.18.1.0/24 10.0.0.10 /* kube-system/kube-dns:dns-tcp cluster IP */ tcp dpt:domain
KUBE-SVC-ERIFXISQEP7F7OF4 tcp -- anywhere 10.0.0.10 /* kube-system/kube-dns:dns-tcp cluster IP */ tcp dpt:domain
KUBE-NODEPORTS all -- anywhere anywhere /* kubernetes service nodeports; NOTE: this must be the last rule in this chain */ ADDRTYPE match dst-type LOCAL
# iptables -t nat -L KUBE-SVC-ERIFXISQEP7F7OF4
Chain KUBE-SVC-ERIFXISQEP7F7OF4 (1 references)
target prot opt source destination
KUBE-SEP-V7KWRXXOBQHQVWAT all -- anywhere anywhere /* kube-system/kube-dns:dns-tcp */ statistic mode random probability 0.50000000000
KUBE-SEP-BWCLCJLZ5KI6FXBW all -- anywhere anywhere /* kube-system/kube-dns:dns-tcp */
# iptables -t nat -L KUBE-SEP-V7KWRXXOBQHQVWAT
Chain KUBE-SEP-V7KWRXXOBQHQVWAT (1 references)
target prot opt source destination
KUBE-MARK-MASQ all -- 172.18.1.5 anywhere /* kube-system/kube-dns:dns-tcp */
DNAT tcp -- anywhere anywhere /* kube-system/kube-dns:dns-tcp */ tcp to:172.18.1.5:53
You can think of kubernetes service routing in iptables as the following steps:
1. Loop through chain holding all kubernetes services
1. If you hit a matching service address and IP, go to service chain
2. The service chain will randomly select an endpoint from the list of endpoints (using probabilities)
3. If the endpoint selected has the same IP as the source address of the traffic, mark it for MASQUERADE later (this is the KUBE-MARK-MASQ
you are asking about). In other words, if a pod tries to talk to a service IP and that service IP "resolves" to the pod itself, we need to mark it for MASQUERADE later (actual MASQUERADE target is in the POSTROUTING chain because it's only allowed to happen there)
4. Do the DNAT to selected endpoint and port. This happens regardless of whether 3) occurs or not.
If you look at iptables -t nat -L POSTROUTING
there will be a rule that is looking for marked packets which is where the MASQUERADE actually happens.
The reason why the KUBE-MARK-MASQ
rule has to exist is for hairpin NAT. The details why are a somewhat involved explanation, but here's my best attempt:
If MASQUERADE didn't happen, traffic would leave the pod's network namespace as (pod IP, source port -> virtual IP, virtual port)
and then be NAT'd to (pod IP, source port-> pod IP, service port)
and immediately sent back to the pod. Thus, this traffic would then arrive at the service with the source being (pod IP, source port)
. So when this service replies it will be replying to (pod IP, source port)
, but the pod (the kernel, really) is expecting traffic to come back on the same IP and port it sent the traffic to originally, which is (virtual IP, virtual port)
and thus the traffic would get dropped on the way back.