EDIT: Rebuilding a new cluster from scratch has resolved the issue. If I switch traffic back to the old cluster, the issue remains.
So it now appears that I have a broken Kubernetes cluster which I am not certain how to detect or diagnose if it happens in the future. Possibly: failing EC2 network adapter, failed service on one node or master, failing kube-proxy pods or other kube-system issues...
If there are any suggestions for how to detect the source of the intermittent refused connections between Pod and ClusterIP, I would love to find the root issue.
I have setup an nginx Kubernetes service (behind an external AWS LoadBalancer setup by Kubernetes). This nginx service proxies to several inbound services by ClusterIP.
It looks like this:
ELB ---> Nginx Proxy (PodA) ---> Internal Service (PodB)
I see random 502 connection refused for one of this (very busy) services. The nginx error logs look like this:
2019/10/15 19:27:57 [error] 13#0: *71355 connect() failed (111: Connection refused) while connecting to upstream, client: X.X.X.X, server: api.acme.com, request: "GET /XYZ HTTP/1.1", upstream: "http://100.67.38.167:80/XYZ", host: "acme.com"
I get about 10-15 of these 502 errors per second (Maybe 1% of total requests).
My internal service (at clusterip 100.67.38.167) did not appear to be over capacity, but I tried scaling it anyway. I doubled the number of pods. The number of 502 errors remains exactly the same.
Just to be certain there was no issue with the internal service, I added a second container to its pod, running its own nginx proxy. So the connection looks like:
ELB ---> Nginx Proxy1 (PodA) ---> Nginx Proxy2 (PodB) ---> Internal Service (PodB)
The second Nginx proxy sees zero errors. So all of the errors appear to be between Nginx and the ClusterIP address. Does this mean kube-proxy is over capacity?
Is there any way to determine if this is the case?
Note: This is a 14 node cluster, and adding a node to the cluster (and moving some pods over) doesn't reduce 502's either.
I have also tried: (1) adding and removing keep-alive timeouts for nginx proxy_pass. (2) Resolving the internal service by ip address or by DNS (3) Scaling kube-dns up
Does anyone know what could be happening?