External internet DNS resolution failures from Windows containers

9/20/2017

I've got an ACS Windows cluster setup using k8s that is generally running well. I've deployed ASP.NET Core webapi and worker app containers. These two containers work fine locally and generally in ACS as well. I can scale them out and back, deploy new versions, etc.

They are functional and working and yet then suddenly start generating DNS resolution errors when trying to access external internet resources. I'm seeing exceptions that include:

System.Net.Http.WinHttpException: The server name or address could not be resolved

The resources they are trying to access resolve fine and then suddenly stop resolving. Then after some indeterminate time (a few minutes, 20 minutes, or even a few hours it seems) they start resolving again, clear quite intermittent. Note that these external resources are CosmosDB, Azure Queues and a 3rd party logging service called Loggly (point being they are all big web properties and are not at fault here). Also note the two containers do not necessarily lose DNS at the same time.

I've tried opening a command shell inside the container:

kubectl exec -it {podname} -- powershell

And then using powershell to request a site:

invoke-webrequest -uri www.google.com -outfile test.txt

get-content test.txt

...and it works fine, I can access google.com. So I have no idea how to debug this. Are there known issues with k8s on ACS that might be in play here?

I've deployed the same containers to a simple Server 2016 host and do not see the problem at all. So it seems to revolve around either k8s or the ACS cluster itself. I've rebuilt the ACS cluster 4 or 5 times in different regions (which use different k8s versions) and see exactly the same problem.

This is a major blocker for me. External internet access is obviously very basic and core functionality. My webapi and worker app are completely broken without it.

-- BrettRobi
azure-container-service
kubernetes

2 Answers

10/1/2017

While I suspect some weirdness in windows container networking code (it's been... problematic in the past), you could probably setup a small container running a dns resolver and add it as upstream server in kube-dns config map. If the issue is related to kube-dns going to external nameserver, a local cache could help.

Another option would be to hardcode the IPs, either using Service like in https://kubernetes.io/docs/concepts/services-networking/service/#services-without-selectors or by using hosts file with HostsAliases, like described in https://kubernetes.io/docs/concepts/services-networking/add-entries-to-pod-etc-hosts-with-host-aliases/

Hope it helps - Intermittent DNS problems can sometimes make you feel like you're going crazy (I still think that one unresolved case in my experience was a broken hw somewhere).

-- p_l
Source: StackOverflow

11/1/2017

I have had indirect contact with the Windows DNS team at Microsoft and was offered a temporary fix to this problem.

Add the below two commands to the dockerfile of any pods that are exhibiting the problem:

Set-Service dnscache -StartupType disabled
Stop-Service dnscache

Redeploy and you should have better luck. I've been running for 2 days now and have seen zero failures, whereas previously I'd see failures within a few hours. You might notice higher latency in DNS resolves due to the lack of caching, but for me this is WAY better than outright failures. Also note this is NOT a recommended strategy for production use.

-- BrettRobi
Source: StackOverflow