In Kubernetes services talk to each other via a service ip. With iptables or something similar each TCP connection is transparently routed to one of the pods that are available for the called service. If the calling service is not closing the TCP connection (e.g. using TCP keepalive or a connection pool) it will connect to one pod and not use the other pods that may be spawned.
What is the correct way to handle such a situation?
My own unsatisfying ideas:
Am I making every call slower only to be able to distribute requests to different pods? Doesn't feel right.
I could force the caller to open multiple connections (assuming it would then distribute the requests across these connections) but how many should be open? The caller has (and probably should not have) no idea how many pods there are.
I could limit the resources of the called services so it gets slow on multiple requests and the caller will open more connections (hopefully to other pods). Again I don't like the idea of arbitrarily slowing down the requests and this will only work on cpu bound services.
The keep-alive behavior can be tuned by options specified in the Keep-Alive general header:
E.g:
Connection: Keep-Alive
Keep-Alive: max=10, timeout=60
Thus, you could re-open a tcp connection after a specific timeout instead than at each API request or after a max number of http transactions.
Keep in mind that timeout and max are not guaranteed.
EDIT:
Note that If you use k8s service you can choose two LB mode:
iptables proxy mode (By default, kube-proxy in iptables mode chooses a backend at random.)
IPVS proxy mode where you have different load balancing options:
IPVS provides more options for balancing traffic to backend Pods; these are:
rr: round-robin lc: least connection (smallest number of open connections) dh: destination hashing sh: source hashing sed: shortest expected delay nq: never queue
check this link
One mechanism to do this might be to load balance in a layer underneath the TCP connection termination. For example, if you split your service into two - the microservice (let's call it frontend-svc) that does connection handling and maybe some authnz, and another separate service that does your business logic/processing.
clients <---persistent connection---> frontend-svc <----GRPC----> backend-svc
frontend-svc can maintain the make calls to your backend in a more granular fashion, making use of GRPC for example, and really load balance among the workers in the layer below. This means your pods that are part of the frontend-svc aren't doing much work and are completely stateless (and therefore have less need to load balance), which means you can also control them with an HPA provided you have some draining logic to ensure that you don't terminate existing connections.
This is a common approach that is used by SSL proxies etc to deal with connection termination separately from LB.