Recently our domain was down for some reason, but it was just the domain name the kubernetes cluster wasnt changed at all.
Now the pods can not communicate via domains and sub-domains, on ip's they work like curl ip-to-any-pod
is ok but curl sub-domain.domain.com
wont work. It says curl: (6) Could not resolve host: sub-domain.domain.com
Whats crazy is, it works sometimes and sometimes it doesn't work.
I have gone through every related issue on the internet but can not find anything specific, neither does the logs, events etc tell me anything.
I restarted my pods, the calico network pods but still nothing has changed.
I got this message once while restarting one of my pod
Warning FailedCreatePodSandBox 45s kubelet, ip-xxx-xx-xx-xx.ap-south-1.compute.internal Failed create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "db2249c98d0b8b4bbef79ac5cd7e5c36c957f3929637093268670e7002c2467f" network for pod "web-6576f9fcdc-kt9xw": NetworkPlugin cni failed to set up pod "web-6576f9fcdc-kt9xw_hc" network: dial tcp: lookup etcd-a.internal.cluster.xxxx.xx on xxx.xx.x.x:53: no such host, failed to clean up sandbox container "db2249c98d0b8b4bbef79ac5cd7e5c36c957f3929637093268670e7002c2467f" network for pod "web-6576f9fcdc-kt9xw": NetworkPlugin cni failed to teardown pod "web-6576f9fcdc-kt9xw_hc" network: dial tcp: lookup etcd-a.internal.cluster.xx.xx on xxx.xx.x.x:53: no such host]
Often when setting up a domain it takes time for it to propagate, and propagates non-uniformly. It's common to see that immediately after creating the record you will not be able to resolve it at all, then a little later it'll be flaky, and eventually it will stabilize. Sometimes DNS takes tens of hours to propagate.
There are various articles online you can find from an Internet search which explain why DNS propagation can take so much time. There are also neat tools like DNS Checker which can give you a sense for how well your DNS records have propagated globally.
As you confirmed in the comments, your problems went away by the next day.
In my opinion your question it's quite complex and it cant' be answered so simply.
Please refer to:
Cluster specific issues f.e. kubernetes 1.15.3 (you can verify this settings in your environment)
Default TTL for DNS records in kubernetes zone has been changed from 5s to 30s to keep consistent with old dnsmasq based kube-dns. The TTL can be customized with command
kubectl edit -n kube-system configmap/coredns
Reverted the CoreDNS version to 1.3.1 for kubeadm cluster-dns
Firstly please start debugging your cluster and verify if your problem is related to your domain settings or it is cluster internal issue. Debugging DNS Resolution
Please verify the local dns configuration in /etc/resolv.conf inside your pod.
Please verify Errors in the in DNS,Coredns PODS.
To obtain more information about dns resolution you can use different tools like: nslkookup, dig, traceroute
example:
nslookup -type=a [domain.com]
using against specific domain server
nslookup -type=a [domain.com] [ns server]
Using those tools you can get also information about Non-authoritative or Authoritative answers.
An authoritative name server is a name server that has the original source files of a domain zone files.
Because it's very important in production environment try to recreate the issue in order to keep your services healthy in the future.
Hope this help.