intermittent 502 bad gateway error in Kubernetes pods

9/24/2021

We are using Kubernetes in AWS, deployed using kops. We are using Nginx as our ingress controller it was working fine for almost 2 years. but recently we started getting 502 bad gateway issues in multiple pods randomly.

ingress log shows 502

[23/Sep/2021:10:53:43 +0000] "GET /service HTTP/2.0" 502 559 "https://mydomain/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.82 Safari/537.36" 4691 0.040 [default-myservice-80] 100.96.13.157:80, 100.96.13.157:80, 100.96.13.157:80 0, 0, 0 0.000, 0.000, 0.000 502, 502, 502 258a09eaaddef85cae2a0c2f706ce06b
..
[error] 1050#1050: *1352377 connect() failed (111: Connection refused) while connecting to upstream, client: CLIENT_IP_HERE , server: my.domain.com , request: "GET /index.html HTTP/2.0", upstream: "http://POD_IP:8080/index.html", host: "my.domain.com", referrer: "https://my.domain/index.html"

We tried connecting to pod-ip which gave 502 from ingress pod

www-data@nginx-ingress-controller-664f488479-7cp57:/etc/nginx$ curl 100.96.13.157
curl: (7) Failed to connect to 100.96.13.157 port 80: Connection refused

it showed connection refuced

We monitored tcpdump traffic from the node where the pod gave 502

root@node-ip:/home/admin# tcpdump -i cbr0 dst 100.96.13.157
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on cbr0, link-type EN10MB (Ethernet), capture size 262144 bytes
17:39:16.779950 ARP, Request who-has 100.96.13.157 tell 100.96.13.22, length 28
17:39:16.780207 IP 100.96.13.22.57610 > 100.96.13.157.http: Flags [S], seq 2263585697, win 26883, options [mss 8961,sackOK,TS val 1581767928 ecr 0,nop,wscale 9], length 0
17:39:21.932839 ARP, Reply 100.96.13.22 is-at 0a:58:64:60:0d:16 (oui Unknown), length 28


root@node-ip:/home/admin# ping 100.96.13.157
PING 100.96.13.157 (100.96.13.157) 56(84) bytes of data.
64 bytes from 100.96.13.157: icmp_seq=1 ttl=64 time=0.309 ms
64 bytes from 100.96.13.157: icmp_seq=2 ttl=64 time=0.042 ms
64 bytes from 100.96.13.157: icmp_seq=3 ttl=64 time=0.044 ms

it looks like pods can reach each other, and ping is working,

root@node-ip:/home/admin# tcpdump -i cbr0 src 100.96.13.157
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on cbr0, link-type EN10MB (Ethernet), capture size 262144 bytes
17:39:16.780076 ARP, Reply 100.96.13.157 is-at 0a:58:64:60:0d:9d (oui Unknown), length 28
17:39:16.780175 ARP, Reply 100.96.13.157 is-at 0a:58:64:60:0d:9d (oui Unknown), length 28
17:39:16.780238 IP 100.96.13.157.http > 100.96.13.22.57610: Flags [R.], seq 0, ack 2263585698, win 0, length 0
17:39:21.932808 ARP, Request who-has 100.96.13.22 tell 100.96.13.157, length 28

Here ingress is sending request but it's been reset,(flag R. = RST-ACK in tcp dump) and http request is lost.

we don't know where this connection is getting lost, we checked our service and pod labels, everything is configured properly. also most of the time my.domain.com is accessible and ISSUE LOOKS INTERMITTENT, is any other place we need to check for logs....?or has anyone experienced the same issue? Thanks in advance

-- Shreyank Sharma
kube-proxy
kubernetes

0 Answers