We're running a web API app on Kubernetes (1.9.3) in AWS (set with KOPS). The app is a Deployment and represented by a Service (type: LoadBalancer) which is actually an ELB (v1) on AWS. This generally works - except that some packets (fragments of HTTP requests) are "delayed" somewhere between the client <-> app container. (In both HTTP and HTTPS which terminates on ELB).
From the node side:
( Note: Almost all packets on server-side arrive duplicated 3 times )
We use keep-alive so the tcp socket is open and requests arrive and return pretty fast. Then the problem happens:
From the client side:
I've run some traffic from my computer, recording it on the client side to see the other end of the problem, but not 100% sure it represents the real client side.
This is only happening under load.
To my understanding, this is somewhere between the ELB and the Kube-Proxy, but I'm clueless and desperate for help.
This is the arguments Kube-Proxy runs with:
Commands: /bin/sh -c mkfifo /tmp/pipe; (tee -a /var/log/kube-proxy.log < /tmp/pipe & ) ; exec /usr/local/bin/kube-proxy --cluster-cidr=100.96.0.0/11 --conntrack-max-per-core=131072 --hostname-override=ip-10-176-111-91.ec2.internal --kubeconfig=/var/lib/kube-proxy/kubeconfig --master=https://api.internal.prd.k8s.local --oom-score-adj=-998 --resource-container="" --v=2 > /tmp/pipe 2>&1
And we use Calico as a CNI:
So far I've tried:
service.beta.kubernetes.io/aws-load-balancer-type: "nlb"
- the issue remained.E0801 04:10:57.269475 1 reflector.go:205] k8s.io/kubernetes/pkg/client/informers/informers_generated/internalversion/factory.go:85: Failed to list *core.Endpoints: Get https://api.internal.prd.k8s.local/api/v1/endpoints?limit=500&resourceVersion=0: dial tcp: lookup api.internal.prd.k8s.local on 10.176.0.2:53: no such host
...and...
E0801 04:09:48.075452 1 proxier.go:1667] Failed to execute iptables-restore: exit status 1 (iptables-restore: line 7 failed ) I0801 04:09:48.075496 1 proxier.go:1669] Closing local ports after iptables-restore failure
I couldn't find anything describing such issue and will appreciate any help. Ideas on how to continue and troubleshoot are welcome.
Best, A