How does kube-proxy handle persistent connections to a Service between pods?

12/9/2020

I've seen scenarios where requests from one workload, sent to a ClusterIP service for another workload with no affinities set, only get routed to a subset of the associated pods. The Endpoints object for this service does show all of the pod IPs.

I did a little experiment to figure out what is happening.

Experiment

I set up minikube to have a "router" workload with 3 replicas sending requests to a "backend" workload also with 3 pods. The router just sends a request to the service name like http://backend.

I sent 100 requests to the router service via http://$MINIKUBE_IP:$NODE_PORT, since it's exposed as a NodePort service. Then I observed which backend pods actually handled requests. I repeated this test multiple times.

In most cases, only 2 backend pods handled any requests, with the occasional case where all 3 did. I didn't see any where all requests went to one in these experiments, though I have seen it happen before running other tests in AKS.

This led me to the theory that the router is keeping a persistent connection to the backend pod it connects to. Given there are 3 routers and 3 backends, there's an 11% chance all 3 routers "stick" to a single backend, a 67% chance that between the 3 routers, they stick to 2 of the backends, and a 22% chance that each router sticks to a different backend pod (1-to-1).

Here's one possible combination of router-to-backend connections (out of 27 possible): three routers sticking to 2 backends

Disabling HTTP Keep-Alive

If I use a Transport disabling HTTP Keep-Alives in the router's http client, then any requests I make to the router are uniformly distributed between the different backends on every test run as desired.

client := http.Client{
	Transport: &http.Transport{
		DisableKeepAlives: true,
	},
}
resp, err := client.Get("http://backend")

So the theory seems accurate. But here's my question:

  • How does the router using HTTP KeepAlive / persistent connections actually result in a single connection between one router pod and one backend pod?
    • There is a kube-proxy in the middle, so I'd expect any persistent connections to be between the router pod and kube-proxy as well as between kube-proxy and the backend pods.
    • Also, when the router does a DNS lookup, it's going to find the Cluster IP of the backend service every time, so how can it "stick" to a Pod if it doesn't know the Pod IP?

Using Kubernetes 1.17.7.

-- Andrew D.
kubernetes

1 Answer

12/9/2020

This excellent article covers your question in detail.
Kubernetes Services indeed do not load balance long-lived TCP connections.

Under the hood Services (in most cases) use iptables to distribute connections between pods. But iptables wasn't designed as a balancer, it's a firewall. It isn't capable to do high-level load balancing.
As a weak substitution iptables can create (or not create) a connection to a certain target with some probability - and thus can be used as L3/L4 balancer. This mechanism is what kube-proxy employs to somewhat imitate load balancing.

Does iptables use round-robin?

No, iptables is primarily used for firewalls, and it is not designed to do load balancing.
However, you could craft a smart set of rules that could make iptables behave like a load balancer.
And this is precisely what happens in Kubernetes.

If you have three Pods, kube-proxy writes the following rules:

  • select Pod 1 as the destination with a likelihood of 33%. Otherwise, move to the next rule
  • choose Pod 2 as the destination with a probability of 50%. Otherwise, move to the following rule
  • select Pod 3 as the destination (no probability)

What happens when you use keep-alive with a Kubernetes Service?

Let's imagine that front-end and backend support keep-alive.
You have a single instance of the front-end and three replicas for the backend.
The front-end makes the first request to the backend and opens the TCP connection.
The request reaches the Service, and one of the Pod is selected as the destination.
The backend Pod replies and the front-end receives the response.
But instead of closing the TCP connection, it is kept open for subsequent HTTP requests.
What happens when the front-end issues more requests?
They are sent to the same Pod.

Isn't iptables supposed to distribute the traffic?
It is.
There is a single TCP connection open, and iptables rule were invocated the first time.
One of the three Pods was selected as the destination.
Since all subsequent requests are channelled through the same TCP connection, iptables isn't invoked anymore.

Also it's not quite correct to say that kube-proxy is in the middle.
It isn't - kube-proxy by itself doesn't manage any traffic.
All that it does - it creates iptables rules.
It's iptables who actually listens, distributes, does DNAT etc.

Similar question here.

-- Olesya Bolobova
Source: StackOverflow