We have a number of different REST-based services running in Azure within a Kubernetes (version 1.9.6) cluster.
Two of the services, let's say A and B needs to communicate with each other using REST-calls. Typically, something like the following:
Client calls A (original request)
A calls B (request 1)
B calls A (request 2)
A responds to B (request 2)
B responds to A (request 1)
A responds to the original request
The above being a typical intertwined micro-services architecture. Manually running the docker instances works perfectly on our local test servers.
The moment we run this in Kubernetes on Azure we get intermittent timeouts (60+ seconds) on the micro-services calling each other through Kubernetes' networking services. After a timeout, repeating the request would then often give correct responses in a few micro-seconds.
I am stuck at this point as I have no idea what could be causing this. Could it be the dynamic routing? The virtualised network? Kubernetes configuration?
Any ideas?
As you describe it, it's probably not a docker or kubernetes issue. Instead, you should check if B is responding to A before A responds to B, and if so, check if A is not responding to the original call.
You could set up logs to see if this is happening, or debug it if you can reproduce it in your machine.
So I ran into this as well.
Basically there is some sort of network timeout that happens on AKS that cuts all connections out of a pod. As you mentioned this results in seemingly random errors that are difficult to trouble shoot since you only get to see them once (as hitting the same service again results in the expected correct behavior).
More details on my question here: What Azure Kubernetes (AKS) 'Time-out' happens to disconnect connections in/out of a Pod in my Cluster?
In my case AKS (or potentially Kubernetes) was disconnecting / severing my Ghost blog connection to my database after a time but not notifying the service which then resulted in strange errors related to the service not realizing that it was disconnected and not being able to continue to utilize the connection it expects to be available / open.
Thats not a solution just more background!
I am debating whether to open a ticket on Azure AKS GitHub (and with my support subscription) to request more information. If I hear back I will update this answer!
Finally figured this out.
Azure Load Balancers / Public IP addresses have a default 4 minute idle connection timeout.
Essentially anything running through a Load Balancer (whether created by an Azure AKS Kubernetes Ingress or otherwise) has to abide by this. While you CAN change the timeout there is no way to eliminate it entirely (longest idle timeout duration possible is 30 minutes).
For that reason it makes sense to implement a connection pooling / monitoring solution that will track the idle time that has elapsed on each of your connections (through the load balancer / public IP) and then disconnect / re-create any connection that gets close to the 4 minute cutoff.
We ended up implementing PGbouncer (https://github.com/pgbouncer/pgbouncer) as an additional container on our Azure AKS / Kubernetes Cluster via the awesome directions which can be found here: https://github.com/edoburu/docker-pgbouncer/tree/master/examples/kubernetes/singleuser
Overall I can see the need for the timeout but MAN was it hard to troubleshoot. Hope this saves you guys some time!
More details can be found on my full post over here: What Azure Kubernetes (AKS) 'Time-out' happens to disconnect connections in/out of a Pod in my Cluster?