I recently started to experience a lot of failed connection between pods on my Kubernetes cluster (v1.8.3-gke.0). Under load (400+ requests per second), requests to a service backed by 200 pods spread on machines with enough resources have a failure rate between 1 and 10 percent, which is clearly problematic. The HTTP request doesn't fail with a 4xx or 5xx error status, it's just dropped or refused at some point.
Note that the pods are far from being at maximum capacity, their CPU usage are rarely over 200 millicores.
Even without being under heavy load, I monitored that a lot of requests failed randomly, on other services than the previous one, so I'm suspecting an issue at the cluster level (docker? kubernetes? kernel?).
I have made some curl benchmarking to measure failure rates. When a HTTP request fails doing CURL request on a loop, the displayed error is curl: (7) Failed to connect to 10.x.x.x port 80: Connection refused
.
We have a similar error messages when reported by our production code: Cannot connect to host svc:80 ssl:False [Connect call failed ('10.x.x.x', 80)]
, although most requests succeed.
Do you have any idea of what is going wrong, or how can I track this issue down?