I’m trying to debug an issue that is solved by using hostNetwork: true
. The k8s installation is using kubenet, and the k8s version is 1.9.8.
The installation is done with kops on AWS, using m4.xlarge and c4.xlarge instances.
The problem is the following:
When we migrated this application to kubernetes, the response time (percentile 95) for a certain endpoint increased about 20-30%.
This issue is solved, though, when using hostNetwork: true
in the yaml. The performance is the same than it was on VMs for this endpoint, i.e the percentile 95 for the response time is the same for this endpoint.
I’ve asked this in the kubernetes office hours on July 18th (yeah, a while ago!) and the hostNetwork: true
workaround come up there.
Please note that all kube-proxy stuff can be discarded as this increased response time is seen when measuring in the app itself. I mean, the ruby app measures the time it takes and send it to the log collector. This time, that is since the request is started to be processed by the app till it finished, shows the degraded performance already. So kube-proxy and that stuff is out of the equation.
The pod has 3 containers:
These apps are also in the VMs mode
What I tried:
ab -c 1 -n 1000 'https://...
hostNetwork: true
is less, about 10%. Please note that EKS does not use kubenet and uses their own network overlay based on some open source."Die" * 10 * 1024 * 1024
), and the issue does not happen eitherSo, I’m trying to debug this issue to understand what it is and, hopefully, stop using hostNetwork: true
. There seems to path to dig further:
Try other CNIs (EKS showed less performance degradation) to see if the performance changes
See what this endpoint does or how it interacts with unicorn and the whole stack. One big difference is that unicorn is one process per request (synchronous) and nodejs is not.
Try to use more newer machines (m5/c5) to see if they mitigate the performance hit. But, as this issue is not present with the current instances using them as VMs, seems that if it helps, will only hide the problem
This endpoint that has the perf problem, is an endpoint in ruby that reads a database and gets returns a json. The database, the host, the network, all seem fine (monitoring CPU, disk IO, swap, etc. with vmstat, our regular tools, AWS console, checking kern.log, sysloca and that stuff also)
By any chance, did you have a similar experience? Or do you have any other ideas on how to continue to debug this issue?
Any ideas or any kind of help is more than welcome!
Rodrigo
Sounds like the overhead you're experiencing is due to Docker's NAT.hostNetwork: true
exposes the host's network to the pod/container(s), as opposed to using a NAT, providing better performance... But reducing security.
Hope this helps!
The problem seems to be https://github.com/kubernetes/kubernetes/issues/56903
The workarounds mentioned there (like dnsPolicy: Default
) solve the issue for me.
These two post explain the problem in detail: https://www.weave.works/blog/racy-conntrack-and-dns-lookup-timeouts and https://blog.quentin-machu.fr/2018/06/24/5-15s-dns-lookups-on-kubernetes/
And also provide some workarounds.
long story short: there is a race condition in nf that affects connection-less protocols (like UDP) when doing DNAT/SNAT. The weave guys have sent a patch to fix most of the races. To work-around you can either use an external dns (i.e. not kube-dns as it is exposed via a service and, so, uses DNAT), set flags for glibc (but don't work for musl), use a minimal delay with tc
, etc.
Note: Using dnsPolicy: Default
does the trick because it is using an external DNS server (i.e. not hosted in kubernetes and accessed via a service, that does DNAT).
I'll test the glibc flags for my cluster, although the dnsPolicy: Default
thing does solve the issue for me, as we are using k8s DNS service resolution on some apps.