Kubernetes Bridge networking issue

2/3/2022

We are running our applications in Kubernetes(1.11) cluster installed through KOps.(its our DEV/QA cluster inherited from the employee who is no longer with the company)

Mostly everything works fine but sometimes after deployments, the pods will give connection refused errors.We came to know because Nginx was complaining 502 error from backend.

Sometimes it will automatically work again, but will sometimes again give errors. Restarting the pod will resolve the issue. It will work fine til the next deployment, then the issue happens again.

We compared the syslog with other cluster but everything looks similar.

TCPDUMPS logs of the POD's IP

11:06:47.387766 IP 100.96.13.22.57778 > 100.96.12.137.http-alt: Flags [S], seq 1515889791, win 26883, options [mss 8961,sackOK,TS val 132113284 ecr 0,nop,wscale 9], length 0
11:06:47.387775 IP 100.96.13.22.57778 > 100.96.12.137.http-alt: Flags [S], seq 1515889791, win 26883, options [mss 8961,sackOK,TS val 132113284 ecr 0,nop,wscale 9], length 0
11:06:47.387777 IP 100.96.13.22.57778 > 100.96.12.137.http-alt: Flags [S], seq 1515889791, win 26883, options [mss 8961,sackOK,TS val 132113284 ecr 0,nop,wscale 9], length 0
11:06:47.387781 IP 100.96.12.137.http-alt > 100.96.13.22.57778: Flags [R.], seq 0, ack 1515889792, win 0, length 0
11:06:47.387781 IP 100.96.12.137.http-alt > 100.96.13.22.57778: Flags [R.], seq 0, ack 1, win 0, length 0
11:06:47.387785 IP 100.96.12.137.http-alt > 100.96.13.22.57778: Flags [R.], seq 0, ack 1, win 0, length 0

As seen in the logs, Ingress-nginx(100.96.13.22) pod tries to connect to webapp pod(100.96.12.137) but the connection to the pods are immedietly reset.

Our Investigation:

After some learning about how kubernetes network work (Bridge networking, VETH pairs), (https://medium.com/practo-engineering/networking-with-kubernetes-1-3db116ad3c98https://stackoverflow.com/questions/37860936/find-out-which-network-interface-belongs-to-docker-containerhttps://www.digitalocean.com/community/tutorials/how-to-inspect-kubernetes-networking#finding-and-entering-pod-network-namespaces )

As per this, the pods interface is connected through a VETH pair to the Nodes bridge interface Any traffic from or to the pod goes through this bridge(in our case cbr0)

Troubleshooting:

We got the affected pods container ID by running 

docker ps

Get the Pod's Process ID

docker inspect --format '{{ .State.Pid }}' container-ID

Get the Pods Network details

nsenter -t container-pid -n ip addr
output:
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1
 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
 inet 127.0.0.1/8 scope host lo
 valid_lft forever preferred_lft forever
 inet6 ::1/128 scope host 
 valid_lft forever preferred_lft forever
3: eth0@if380852: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc noqueue state UP group default 
 link/ether 0a:58:64:60:0d:27 brd ff:ff:ff:ff:ff:ff link-netnsid 0
 inet 100.96.13.39/24 scope global eth0
 valid_lft forever preferred_lft forever
 inet6 fe80::e4cd:81ff:fe96:2914/64 scope link 
 valid_lft forever preferred_lft forever

eth0@if380852 is the pods network interface

380852 is the VETH link number

0a:58:64:60:0d:27 is the pods mac address

Get the pod's VETH pair details

ip addr | grep 380852
output:380852: vethd49cda8b@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc noqueue master cbr0 state UP group default

Here vethd49cda8b is the pods VETH ID

Now checked the bridge details

Get the bridge mac table:

brctl showmacs cbr0 | grep 0a:58:64:60:0d:27
output:27 0a:58:64:60:0d:27 no 1.44

Here 27 is PORT of VETH Interface

Checked the Port's VETH details

brctl showstp cbr0 | grep "(27)"
output:
veth488082e8 (27)

We can see that the Port 27 belongs to different VETH interface, The expected output should be: POD's VETH ID (PORT in the bridge table)

vethd49cda8b (27)

Lets get the POD's VETH interface's actual PORT

brctl showstp cbr0 | grep vethd49cda8b
output:
vethd49cda8b (52)

We can see that the traffic to the pods is getting lost due the wrong port in bridge mac table

The bridge MAC table should show the port as 52 for the container MAC address, but it is showing 27

But after some time, it automatically shows the correct port, which resolves our connection issue.

We don't know why is the bridge table is wrongly configured and what is this caused by.

Has anyone faced similar issues.

Is there any fix for this

or How can we troubleshoot it further?

Thanks in advance

-- Abhishek Jayaram
docker
kops
kubernetes
linux
networking

0 Answers