Pod to pod communication were the 2 pods are on the same node fails sporadically (EKS 1.13)

10/24/2019

The symptom

Requests to applications sporadicly give a http 504 or long waiting time (a multiple of 12s).

We have the problem on pod to pod communication where the 2 pods are on the same node in kubernetes.

Eg from nginx ingress to an application pod on the same node from application pod to an application pod on the same node from application pod to a rabbitmq eventbus pod on the same node

Our infrastructure

EKS with classic ELBs (both internal and external) (not network lb) on nginx ingress service. The load balancer services have externalnetworkpolicy local. EKS 1.13 with node version 1.13.8 (eks optimized ami)

TCPDUMP

Folowing is a useful tcpdump output from an application pod trying to connect to an eventbus, which fails. It succeeds after a couple of retries, most of the times (usually after 12s):

13:44:46.744764 IP customer-reports-service-5b4d8c48b-vj4db.35196 > eventbus-rabbitmq.kube-system.svc.cluster.local.5672: Flags [S], seq 1434468571, win 26883, options [mss 8961,sackOK,TS val 4064032250 ecr 0,nop,wscale 7], length 0

13:44:46.751000 IP ip-10-0-161-173.eu-west-1.compute.internal > customer-reports-service-5b4d8c48b-vj4db: ICMP time exceeded in-transit, length 68

info on this tcpdump: 1. applicaton pod make a request to eventbus pod on same node 2. the node sends a icmp time exceeded to the application pod. Probably the request never gets to the eventbus.

Possible workaround

use pod anti affinity to make sure that each eventbus pod, each nginx ingress pod, each api gateway runs on different nodes then our application services

But I'm looking to an actual solution of the problem.

Other related reference

https://kubernetes.io/docs/tasks/debug-application-cluster/debug-service/#a-pod-cannot-reach-itself-via-service-ip Hairpin mode in my EKS setup is hairpin-veth. There is the following instruction: ensure the Kubelet has the permission to operate in /sys on node. But I'm not sure on how to do so as on EKS the cbr0 interfaces is not there, it uses eni interfaces

-- timv
amazon-eks
aws-eks
eks
kubernetes

1 Answer

10/24/2019

Ok, right after posting the question, AWS provided me a solution to the problem:

ISSUE : https://github.com/aws/amazon-vpc-cni-k8s/issues/641

Downgrade the VPC CNI PlugIn to v1.5.3 until 1.5.5 is released: update the daemonset and restart all nodes

-- timv
Source: StackOverflow